Parallel MM5 benchmarks, 2002

Here to return to http://www.mmm.ucar.edu/mm5/mpp/helpdesk .
Disclaimer
Some results on this page are contributed (as thanked below). Contributed results are provided on this page along with results from runs we conducted ourselves with the following caveat: Reasonable efforts have been made to verify contributed results, including consulting with the contributors and inspection of configuration and raw benchmark result files. We have complete confidence in the integrity and competence of the contributors; however, we assume no responsibility for nor make any claims regarding the accuracy, veracity, or appropriateness for any purpose whatsoever of contributed results. Further, all results, whether contributed or generated by us, are for a single fixed-size MM5 case on the specific machines listed, and no claim is made of representativeness for other model scenarios, code versions, or hardware installations.

Press here for the newer Parallel MM5 2003 Benchmarks Page.
Press here for the older Parallel MM5 2001 Benchmarks Page.
Press here for the older Parallel MM5 2000 Benchmarks Page.


Click here to download the input data for this MM5 benchmark case.
For additional information on MM5 benchmarks, please click here.
For information on the MM5 distributed-memory parallel code, please click here.
Scroll down this page for additional explanation of the figures shown here.



Figure 1a. MM5 floating-point performance on various platforms. (Updated Feb. 20, 2002)

Figure 1b. MM5 floating-point performance on various platforms (zoomed). (Updated Feb. 20, 2002)


Figures 1a-b shows performance results in Mflop/second and in simulated hours per hour on a variety of platforms.

Timings for the Pittsburgh Supercomputer Center Terascale Computing System (PSC TCS) were conducted by J. Michalakes on October 20-21, 2001; additional runs Oct. 26 and 29. Thanks to Ralph Roskies, Sergiu Sanielevici, Roberto Gomez, and others at PSC. The 6 TFlop/s peak TCS comprises 3000 1 GHz Compaq Alpha EV68 processors (750 ES45 nodes). The model was run straight-MPI (no OpenMP) MPI-over-shared memory for communication within nodes and using MPI-over Quadrics for communication between nodes. These are full-node timings (i.e. using all 4 CPUs on each node) on a dedicated system. Two rails of the Quadrics interconnect were used for the Feb. 2002 updated runs. (10/23/2001; updated 10/26/2001, 2/18/2002)

Timing for the IBM Power 4 was conducted as 16 single-threaded MPI tasks on 16 and 32 1.3 GHz cpu's of a Regatta-H node (run used 8 Power4 chips with 2 CPU cores each; i.e., two 8-CPU MCMs). Color added to plot for clarity. Contributed. Thanks Jim Tuccillo, IBM. (10/26/2001; 10/29/2001)

Origin3000 600 MHz timings conducted on using straight-MPI on system with 128 R14k CPUs. Contributed. Thanks Wesley Jones, SGI. (2/20/02)

The AlphaServerSC/667 timings were completed on an AlphaServer configuration at Compaq. Each SMP node in the cluster contains four 667 MHz EV67 processors. The model was run straight-MPI (no OpenMP) MPI-over-shared memory for communication within nodes and using MPI-over Quadrics for communication between nodes. Contributed. Thanks Steve Leventer, Compaq. Note: these timings have been updated since first being posted 5/19/00. They are now all full-node timings (i.e. 4 CPUs per node). (6/26/00)

The AlphaServerSC/667 timings were completed on an AlphaServer configuration at Compaq. Each SMP node in the cluster contains four 667 MHz EV67 processors. The model was run straight-MPI (no OpenMP) MPI-over-shared memory for communication within nodes and using MPI-over Quadrics for communication between nodes. Contributed. Thanks Steve Leventer, Compaq. Note: these timings have been updated since first being posted 5/19/00. They are now all full-node timings (i.e. 4 CPUs per node). (6/26/00)

The Fujitsu VPP5000 is a distributed-memory machine with vector processors, linked by a high-speed crossbar interconnect. The model was run using Fujitsu's implementation of MPI and with a one-dimensional data decomposition to preserve vector length in the I-dimension.

The Origin3000 400 MHz timings were obtained using MPI message-passing. Because the lines are bunching up, this plot is shown only in the zoomed figure and in the table below. Contributed. Thanks Elizabeth Hayes, SGI. (3/14/01)

The IBM SP WH2 timings were obtained on the NCAR/SCD IBM Winterhawk-II machine (four 375 Mhz Power3 CPUs per node). This is the post-upgrade blackforest.ucar.edu. The model was run using four MPI tasks per node (no OpenMP). Note, these numbers have been updated since 5/19/00; we reran using the MP_SHARED_MEMORY environment variable (improving the 256 node time by approximately 14 percent). These timings use 4 processors per node. (5/22/00)

The iJet HPTi Xeon timings were conducted on iJet a Pentium/Linux cluster at NOAA Forecast Systems Laboratory. Each node has dual Xeon 2.2 GHz processors. Model was run straight-MPI over Myrinet (no-OpenMP). The MM5 code was compiled using the Intel compiler. Contributed. Thanks Craig Tierney, HPTi. (11/2002)

The HPTi ACL/667 timings were conducted on Jet an Alpha Linux cluster at NOAA Forecast Systems Laboratory having a single-Alpha 667 MHz processor per node. The model was run using MPI-over-Myrinet message-passing between single-threaded processes, one per node. The code is compiled using Compaq compilers on Linux. Contributed. Thanks Greg Lindahl, HPTi. (3/10/00)

The Linux/Athlon ScaliMPI timings were conducted on a cluster at the Institute for Geophysics and Planetary Physics and the Earth Sciences Department at UC Santa Cruz. Comprises 132 dual 1.4 GHz Athlon nodes (1 GB RAM per node) using ScaMPI over Dolphin SCI interconnect. Compiler is Portland Group v3.2-4. Contributed. Thanks Ole W. Saastad. (9/03/2002; updated 11/30/2002)

The Pentium-III ScaliMPI timings were conducted on 16 dual 800 MHz Pentium-III nodes and using the SCA interconnect and ScaMPI. Contributed. Thanks Ole W. Saastad. (6/14/2001)

The Cray T90 timings were obtained on the Cray T932 at the NOAA Geophysical Fluid Dynamics Laboratory. All timings are dedicated mode and represent elapsed time except one-processor timing obtained in non-dedicated mode which represents CPU time. All runs were in shared-memory mode (Cray Microtasking; not MPI). The T90 runs were also used to determine the floating point operation count on which the Mflop/second estimates are based. Color added to plot for clarity. (1/5/00)

All runs were of a 36-kilometer resolution domain over Europe; the grid consisted of 136 cells in the east/west dimension, 112 north/south, and 33 vertical layers (503,000 cells). Time step is 81 seconds. There is a link to the input data for this case at the top of this page. The operation count for this scenario is 2,398 million floating point operations per average time step. I/O and model initialization were not included in the timings. All timing runs were performed at single (32-bit) floating-point precision except the T90. Scaling is calculated as the speedup divided by the factor of increase in the number of processors.

The results were as follows:

- NEW PSC TCS (11/26/2001),   1 to 512 CPU (  586 to 105923 Mflop/sec), 35 percent
-   "       "       "     ,   1 to 256 CPU (  586 to  74346 Mflop/sec), 50 percent
-   "       "       "     ,   1 to 128 CPU (  586 to  49252 Mflop/sec), 66 percent
-   "       "       "     ,   1 to  64 CPU (  586 to  29119 Mflop/sec), 78 percent
-   "       "       "     ,   1 to  32 CPU (  586 to  16063 Mflop/sec), 86 percent
-   "       "       "     ,   1 to   4 CPU (  586 to   2038 Mflop/sec), 87 percent
- IBM Power4 1.3Ghz,  16 to 32 CPU (8248 to 13701 Mflop/sec), 83 percent
- SGI O3000 600 MHz, 16 to 32  CPU (5015 to 10302 Mflop/sec), 102 percent (1)
-  "    "    "   " , 16 to 64  CPU (5015 to 19668 Mflop/sec), 98 percent
-  "    "    "   " , 16 to 128 CPU (5015 to 35319 Mflop/sec), 88 percent
- Futisu VPP5000 (8/9/00), 1 to 40 CPU (2156 to 40638 Mflop/sec), 47 percent
-   "       "       "    , 1 to 20 CPU (2156 to 27815 Mflop/sec), 77 percent
-   "       "       "    , 1 to 10 CPU (2156 to 16650 Mflop/sec), 89 percent
- Compaq AlphaServerSC/667, 4 to 512 CPU (1255 to 45317 Mflop/sec), 28 percent
-   "        "        "   , 4 to 256 CPU (1255 to 32471 Mflop/sec), 40 percent
-   "        "        "   , 4 to 128 CPU (1255 to 22015 Mflop/sec), 55 percent
-   "        "        "   , 4 to  64 CPU (1255 to 11709 Mflop/sec), 58 percent
- iJet Xeon 2.2 GHz, Myrinet MPI, 2 to 256 CPU (596 to 39242 Mflop/sec), 51 percent
-   "   "        "     "      " , 2 to 128 CPU (596 to 26204 Mflop/sec), 69 percent
-   "   "        "     "      " , 2 to  64 CPU (596 to 14272 Mflop/sec), 74 percent
-   "   "        "     "      " , 2 to  32 CPU (596 to  8443 Mflop/sec), 86 percent
-   "   "        "     "      " , 2 to  16 CPU (596 to  4229 Mflop/sec), 88 percent
- HPTi ACL/667, Myrinet MPI, 1 to 128 CPU (330 to 22960 Mflop/sec), 54 percent
-   "    "   "     "     "   1 to  64 CPU (330 to 12680 Mflop/sec), 60 percent
-   "    "   "     "     "   1 to  32 CPU (330 to  7900 Mflop/sec), 75 percent
- Linux/Athlon 1.4 GHz Scali MPI, 2 to 400 CPU (472 to 31548), 33 percent
-   "    "   "  "   "   "     "   2 to 256 CPU (472 to 27559), 46 percent
-   "    "   "  "   "   "     "   2 to 128 CPU (472 to 17630), 58 percent
-   "    "   "  "   "   "     "   2 to  64 CPU (472 to 10335), 68 percent
-   "    "   "  "   "   "     "   2 to  32 CPU (472 to  5848), 77 percent
-   "    "   "  "   "   "     "   2 to  16 CPU (472 to  3382), 90 percent
- SGI O2000 300 MHz, 1 to 120 CPU (158 to 15080 Mflop/sec), 80 percent
- IBM WH2, 4 to 256 CPU (674 to 24219 Mflop/sec), 55 percent
-  "   " , 4 to 128 CPU (674 to 16767 Mflop/sec), 77 percent
-  "   " , 4 to  64 CPU (674 to  9082 Mflop/sec), 83 percent
- Pentium-III ScaliMPI (6/14/2001), 4 to 32 CPU (950 to 3460 Mflop/sec), 91 percent
- Cray T90, 1 to 20 CPU (569 to 8472 Mflop/sec), 74 percent

Notes

(1) Linear or superlinear scaling. This does not necessarily mean superior scalability; rather, it may be indicative of degraded run times because of memory/cache effects on the runs with smaller numbers of processors.


John Michalakes, michalak@ucar.edu
---
Updated: November 30, 2002. Added iJet ; revised 1.4 GHz Athlon performance.
Updated: September 3, 2002. Added 1.4 GHz Athlon performance.
Updated: February 20, 2002. Updated SGI Origin 3000 600 MHz R14k numbers
Updated: February 18, 2002. Updated PSC numbers (reran benchmarks with 2 Quadrics rails)
Updated: November 26, 2001. Updated PSC numbers (reran benchmarks on improved machine)
Updated: October 29, 2001. Added 32-cpu Power4; updated PCS TCS-II and added 32 cpu time
Updated: October 26, 2001. Added 16-cpu Power4; updated PCS TCS-II and added 16 cpu time
Updated: October 23, 2001. Added PSC TCS-II and Pentium-III Scali-MPI