GPU Acceleration of WSM5 Microphysics
John Michalakes, National Center for Atmospheric Research
Manish Vachharajani, University of Colorado at Boulder
This page contains most recent results updating the work published in the proceedings of the workshop on Large Scale Parallel Processing (LSPP) within the IEEE International Parallel and Distributed Processing Symposium (IPDPS), April 2008, Miami, Florida, (see References, below).
The figure below is an update of the figure that appears in the LSPP proceedings and the PPL paper. It shows the near doubling of performance over the results presented in the LSPP and PPL papers for the NVIDIA Quadro 5600 GPU using a code optimization, explained below, and also results for the newer NVIDIA GeForce GTX 280 GPU on one of the nodes of the University of Illinois’ GPU cluster, qp.ncsa.uiuc.edu. Performance of the WSM5 microphysics kernel on the GPU and on several conventional CPUs is shown in the figure below. The first four bars from the left show performance of the kernel running on a single core. The next two bars show performance on all 4 cores; in other words, socket performance. The last two bars show performance on the GPU itself, and then effective performance once CPU-GPU transfer cost is accounted for. The Harpertown and Nehalem (workstation version) results were contributed by Roman Dubtsov of Intel Corp. in December 2008.
The GTX 280 achieves nearly 65 Gf/s on the optimized WSM5 microphysics kernel, but it is also clear that a very much larger fraction of performance is being lost to data transfer overhead. Much of this should be amortizable as more weather model kernels are adapted to run and reuse the model state data on the GPU without moving it back and forth from the CPU. The newest results (since December 2008) for this kernel show that conventional multi-core processors are closing the gap with GPU performance (or already have closed it, if transfer costs are included).
Two optimizations to the WSM5 microphysics kernel provided significant performance improvements since the first results presented in April, 2008 at the LSPP workshop. The first optimization was to utilize the –use_fast_math option to the nvcc compiler, which causes higher level mathematical intrinsic operations such as square root, log, and exponent, to be computed in hardware with modestly reduced precision in the Special Function Units (SFUs) on the GPU. The WSM5 kernel, which uses these intrinsics heavily, ran about 1.25 times faster with –use_fast_math on the Quadro 5600 GPU results presented earlier.
The second optimization involved rewriting the kernel to eliminate temporary arrays used to store results between successive loops over the vertical dimension of the WRF domain. Since there were no loop carry dependencies over these vertical dimension loops, they could be jammed into a single large loop. The results that had been stored in the GPU’s slow DRAM memory in these temporary arrays could then be stored in registers instead. A side-effect of this restructuring was increased per-thread register usage, limiting the number of threads per block. Performance on the older Quadro 5600 improved by a factor of 1.67 while the newer GTX 280, which has twice as many registers as the older GPU, achieved a significantly greater increase from this .
The code and data for the CUDA implementation of the WSM5 microphysics is available in a standalone test configuration that can be downloaded from this link. There are README and SAMPLE_SESSION files included.
New (March, 2010): WRFv3.2 version of WSM5 microphysics may be downloaded from this link. Developed and contributed by Ahmed Saeed, Emad Elwany, Emad Joseph, Kareem Abdelsalam, Pakinam Yousry and Samia Hafez, senior undergraduate students advised by Prof. Dr. Mohamed Abougabal, Dr. Ahmed El-Mahdy and Dr. Layla Abouhadid at Faculty of Engineering, Alexandria University, Alexandria, Egypt.
The GPU WSM5 microphysics that comes with the standalone configuration may also be used when compiling the full WRF model (version 3.0.1 and later) for GPU accelerated WSM5 microphysics:
To run, make sure that WSM5 physics is selected in the namelist.input file (mp_physics option #4) and then run on a GPU enabled node or nodes. You can use the GPU WSM5 option with WRF when it is configured serial or dmpar (distributed memory parallel) The code assumes that there is one GPU device available per MPI task, if you have compiled WRF with the dmpar option. The GPU WSM5 microphysics option is not currently supported for smpar or dm+sm (OpenMP or hybrid OpenMP & MPI) configurations of WRF.
· Michalakes, J. and M. Vachharajani, “GPU Acceleration of Numerical Weather Prediction”, Parallel Processing Letters Vol. 18 No. 4. World Scientific. Dec. 2008. pp. 531—548, http://wwww.worldscinet.com/ppl
Last updated, March 29, 2010 michalak@ucar.edu