Parallel MM5 benchmarks, 2001

Here to return to http://www.mmm.ucar.edu/mm5/mpp/helpdesk .
Disclaimer
Some results on this page are contributed (as thanked below). Contributed results are provided on this page along with results from runs we conducted ourselves with the following caveat: Reasonable efforts have been made to verify contributed results, including consulting with the contributors and inspection of configuration and raw benchmark result files. We have complete confidence in the integrity and competence of the contributors; however, we assume no responsibility for nor make any claims regarding the accuracy, veracity, or appropriateness for any purpose whatsoever of contributed results. Further, all results, whether contributed or generated by us, are for a single fixed-size MM5 case on the specific machines listed, and no claim is made of representativeness for other model scenarios, code versions, or hardware installations.

Press here for the newer Parallel MM5 2003 Benchmarks Page.
Press here for the newer Parallel MM5 2002 Benchmarks Page.
Press here for the older Parallel MM5 2000 Benchmarks Page.


Click here to download the input data for this MM5 benchmark case.
For additional information on MM5 benchmarks, please click here.
For information on the MM5 distributed-memory parallel code, please click here.
Scroll down this page for additional explanation of the figures shown here.



Figure 1a. MM5 floating-point performance on various platforms. (Updated Nov. 26, 2001)

Figure 1b. MM5 floating-point performance on various platforms (zoomed). (Updated Nov. 26, 2001)


Figures 1a-b shows performance results in Mflop/second and in simulated hours per hour on a variety of platforms.

Timings for the Pittsburgh Supercomputer Center Terascale Computing System (PSC TCS) were conducted by J. Michalakes on October 20-21, 2001; additional runs Oct. 26 and 29. Thanks to Ralph Roskies, Sergiu Sanielevici, Roberto Gomez, and others at PSC. The 6 TFlop/s peak TCS comprises 3000 1 GHz Compaq Alpha EV68 processors (750 ES45 nodes). The model was run straight-MPI (no OpenMP) MPI-over-shared memory for communication within nodes and using MPI-over Quadrics for communication between nodes. These are full-node timings (i.e. using all 4 CPUs on each node) on a dedicated system. (10/23/2001; updated 10/26/2001)

Timing for the IBM Power 4 was conducted as 16 single-threaded MPI tasks on 16 and 32 1.3 GHz cpu's of a Regatta-H node (run used 8 Power4 chips with 2 CPU cores each; i.e., two 8-CPU MCMs). Color added to plot for clarity. Contributed. Thanks Jim Tuccillo, IBM. (10/26/2001; 10/29/2001)

The AlphaServerSC/667 timings were completed on an AlphaServer configuration at Compaq. Each SMP node in the cluster contains four 667 MHz EV67 processors. The model was run straight-MPI (no OpenMP) MPI-over-shared memory for communication within nodes and using MPI-over Quadrics for communication between nodes. Contributed. Thanks Steve Leventer, Compaq. Note: these timings have been updated since first being posted 5/19/00. They are now all full-node timings (i.e. 4 CPUs per node). (6/26/00)

The AlphaServerSC/667 timings were completed on an AlphaServer configuration at Compaq. Each SMP node in the cluster contains four 667 MHz EV67 processors. The model was run straight-MPI (no OpenMP) MPI-over-shared memory for communication within nodes and using MPI-over Quadrics for communication between nodes. Contributed. Thanks Steve Leventer, Compaq. Note: these timings have been updated since first being posted 5/19/00. They are now all full-node timings (i.e. 4 CPUs per node). (6/26/00)

The Fujitsu VPP5000 is a distributed-memory machine with vector processors, linked by a high-speed crossbar interconnect. The model was run using Fujitsu's implementation of MPI and with a one-dimensional data decomposition to preserve vector length in the I-dimension.

The IBM SP WH2 timings were obtained on the NCAR/SCD IBM Winterhawk-II machine (four 375 Mhz Power3 CPUs per node). This is the post-upgrade blackforest.ucar.edu. The model was run using four MPI tasks per node (no OpenMP). Note, these numbers have been updated since 5/19/00; we reran using the MP_SHARED_MEMORY environment variable (improving the 256 node time by approximately 14 percent). These timings use 4 processors per node. (5/22/00)

The HPTi ACL/667 timings were conducted on jet.fsl.noaa.gov an Alpha Linux cluster at NOAA Forecast Systems Laboratory having a single-Alpha 667 MHz processor per node. The model was run using MPI-over-Myrinet message-passing between single-threaded processes, one per node. The code is compiled using Compaq compilers on Linux. Contributed. Thanks Greg Lindahl, HPTi. (3/10/00)

The Origin3000 400 MHz timings were obtained using MPI message-passing. Because the lines are bunching up, this plot is shown only in the zoomed figure and in the table below. Contributed. Thanks Elizabeth Hayes, SGI. (3/14/01)

The Origin2000 300 MHz timings were obtained in dedicated (exclusive access) mode using MPI message-passing on 64- and 128-processor configurations of R12000 processors. This plot is in color only for readability. Contributed. Thanks Wesley Jones, SGI. (1/5/00)

The Pentium-III ScaliMPI timings were conducted on 16 dual 800 MHz Pentium-III nodes and using the SCA interconnect and ScaMPI. Contributed. Thanks Ole W. Saastad. (6/14/2001)

The Cray T90 timings were obtained on the Cray T932 at the NOAA Geophysical Fluid Dynamics Laboratory. All timings are dedicated mode and represent elapsed time except one-processor timing obtained in non-dedicated mode which represents CPU time. All runs were in shared-memory mode (Cray Microtasking; not MPI). The T90 runs were also used to determine the floating point operation count on which the Mflop/second estimates are based. Color added to plot for clarity. (1/5/00)

All runs were of a 36-kilometer resolution domain over Europe; the grid consisted of 136 cells in the east/west dimension, 112 north/south, and 33 vertical layers (503,000 cells). Time step is 81 seconds. There is a link to the input data for this case at the top of this page. The operation count for this scenario is 2,398 million floating point operations per average time step. I/O and model initialization were not included in the timings. All timing runs were performed at single (32-bit) floating-point precision except the T90. Scaling is calculated as the speedup divided by the factor of increase in the number of processors.

The results were as follows:

- NEW PSC TCS (11/26/2001),  16 to 512 CPU ( 8100 to 98960 Mflop/sec), 38 percent
-   "       "       "     ,  64 to 512 CPU (26178 to 98960 Mflop/sec), 47 percent
-   "       "       "     , 256 to 512 CPU (64925 to 98960 Mflop/sec), 76 percent
-   "       "       "     ,  16 to  64 CPU ( 8100 to 26178 Mflop/sec), 81 percent
-   "       "       "     ,  64 to 128 CPU (26178 to 46894 Mflop/sec), 90 percent
-   "       "       "     ,  16 to  32 CPU ( 8100 to 16092 Mflop/sec), 99 percent 
- OLD PSC TCS (10/20/2001),  16 to 512 CPU ( 7836 to 88934 Mflop/sec), 35 percent
-   "       "       "     ,  64 to 512 CPU (25339 to 88934 Mflop/sec), 44 percent
-   "       "       "     , 256 to 512 CPU (58097 to 88934 Mflop/sec), 77 percent
-   "       "       "     ,  16 to  64 CPU ( 7836 to 25339 Mflop/sec), 81 percent
-   "       "       "     ,  64 to 128 CPU (25339 to 43594 Mflop/sec), 86 percent
-   "       "       "     ,  16 to  32 CPU ( 7836 to 15175 Mflop/sec), 97 percent 
- IBM Power4 1.3Ghz,  16 to 32 CPU (8248 to 13701 Mflop/sec), 83 percent
- Futisu VPP5000 (8/9/00), 1 to 40 CPU (2156 to 40638 Mflop/sec), 47 percent
-   "       "       "    , 1 to 20 CPU (2156 to 27815 Mflop/sec), 77 percent
-   "       "       "    , 1 to 10 CPU (2156 to 16650 Mflop/sec), 89 percent
- Compaq AlphaServerSC/667, 4 to 512 CPU (1255 to 45317 Mflop/sec), 28 percent
-   "        "        "   , 4 to 256 CPU (1255 to 32471 Mflop/sec), 40 percent
-   "        "        "   , 4 to 128 CPU (1255 to 22015 Mflop/sec), 55 percent
-   "        "        "   , 4 to  64 CPU (1255 to 11709 Mflop/sec), 58 percent
- HPTi ACL/667, 1 to 128 CPU (330 to 22960 Mflop/sec), 54 percent
-   "    "   " (1 to  64 CPU (330 to 12680 Mflop/sec), 60 percent
-   "    "   " (1 to  32 CPU (330 to  7900 Mflop/sec), 75 percent
- SGI O2000 300 MHz, 1 to 120 CPU (158 to 15080 Mflop/sec), 80 percent
- SGI O3000 400 MHz, 4 to 64 CPU (815 to 14045 Mflop/sec), 108 percent (1)
- IBM WH2, 4 to 256 CPU (674 to 24219 Mflop/sec), 55 percent
-  "   " , 4 to 128 CPU (674 to 16767 Mflop/sec), 77 percent
-  "   " , 4 to  64 CPU (674 to  9082 Mflop/sec), 83 percent
- Pentium-III ScaliMPI (6/14/2001), 4 to 32 CPU (950 to 3460 Mflop/sec), 91 percent
- Cray T90, 1 to 20 CPU (569 to 8472 Mflop/sec), 74 percent

Notes

(1) Linear or superlinear scaling. This does not necessarily mean superior scalability; rather, it may be indicative of degraded run times because of memory/cache effects on the runs with smaller numbers of processors.


John Michalakes, michalak@ucar.edu
---
Updated: November 26, 2001. Updated PSC numbers (reran benchmarks on improved machine)
Updated: October 29, 2001. Added 32-cpu Power4; updated PCS TCS-II and added 32 cpu time
Updated: October 26, 2001. Added 16-cpu Power4; updated PCS TCS-II and added 16 cpu time
Updated: October 23, 2001. Added PSC TCS-II and Pentium-III Scali-MPI