GPU Acceleration of WSM5 Microphysics

 

John Michalakes, National Center for Atmospheric Research

Manish Vachharajani, University of Colorado at Boulder

 

Introduction

This page contains most recent results updating the work published in the proceedings of the workshop on Large Scale Parallel Processing (LSPP) within the IEEE International Parallel and Distributed Processing Symposium (IPDPS), April 2008, Miami, Florida, (see References, below).

Recent results                                                                                                                          

The figure below is an update of the figure that appears in the LSPP proceedings and the PPL paper. It shows the near doubling of performance over the results presented in the LSPP and PPL papers for the NVIDIA Quadro 5600 GPU using a code optimization, explained below, and also results for the newer NVIDIA GeForce GTX 280 GPU on one of the nodes of the University of Illinois’ GPU cluster, qp.ncsa.uiuc.edu. Performance of the WSM5 microphysics kernel on the GPU and on several conventional CPUs is shown in the figure below.  The first four bars from the left show performance of the kernel running on a single core. The next two bars show performance on all 4 cores; in other words, socket performance.  The last two bars show performance on the GPU itself, and then effective performance once CPU-GPU transfer cost is accounted for.  The Harpertown and Nehalem (workstation version) results were contributed by Roman Dubtsov of Intel Corp. in December 2008.

The GTX 280 achieves nearly 65 Gf/s on the optimized WSM5 microphysics kernel, but it is also clear that a very much larger fraction of performance is being lost to data transfer overhead.  Much of this should be amortizable as more weather model kernels are adapted to run and reuse the model state data on the GPU without moving it back and forth from the CPU.  The newest results (since December 2008) for this kernel show that conventional multi-core processors are closing the gap  with GPU performance (or already have closed it, if transfer costs are included).

Two optimizations to the WSM5 microphysics kernel provided significant performance improvements since the first results presented in April, 2008 at the LSPP workshop.  The first optimization was to utilize the –use_fast_math option to the nvcc compiler, which causes higher level mathematical intrinsic operations such as square root, log, and exponent, to be computed in hardware with modestly reduced precision in the Special Function Units (SFUs) on the GPU.  The WSM5 kernel, which uses these intrinsics heavily, ran about 1.25 times faster with –use_fast_math on the Quadro 5600 GPU results presented earlier.

The second optimization involved rewriting the kernel to eliminate temporary arrays used to store results between successive loops over the vertical dimension of the WRF domain. Since there were no loop carry dependencies over these vertical dimension loops, they could be jammed into a single large loop. The results that had been stored in the GPU’s slow DRAM memory in these temporary arrays could then be stored in registers instead. A side-effect of this restructuring was increased per-thread register usage, limiting the number of threads per block. Performance on the older Quadro 5600 improved by a factor of 1.67 while the newer GTX 280, which has twice as many registers as the older GPU, achieved a significantly greater increase from this .

GPU WSM5 Code for Standalone and in WRF model

The code and data for the CUDA implementation of the WSM5 microphysics is available in a standalone test configuration that can be downloaded from this link. There are README and SAMPLE_SESSION files included.

New (March, 2010): WRFv3.2 version of WSM5 microphysics may be downloaded from this link.  Developed and contributed by Ahmed Saeed, Emad Elwany, Emad Joseph, Kareem Abdelsalam, Pakinam Yousry and Samia Hafez, senior undergraduate students advised by Prof. Dr. Mohamed Abougabal, Dr. Ahmed El-Mahdy and Dr. Layla Abouhadid at Faculty of Engineering, Alexandria University, Alexandria, Egypt.

The GPU WSM5 microphysics that comes with the standalone configuration may also be used when compiling the full WRF model (version 3.0.1 and later) for GPU accelerated WSM5 microphysics:

  1. Edit the makefile in the standalone driver source directory and set MKX to the correct number of levels in the WRF configuration you will be running.  Also set XXX and YYY for the correct number of threads per block for your GPU (see comment in makefile).
  2. Type ‘make’ in the standalone driver directory. Copy the resulting object files wsm5.cu.o and wsm5_gpu.cu.o into the phys directory of the WRF model.
  3. In the top-level WRF directory, run the configure script and generate a configure.wrf file for your system and configuration (must be Linux with Intel or Gnu C, C++, and Fortran and you must have CUDA 1.1 or higher installed).
  4. Add –DRUN_ON_GPU to ARCH_LOCAL in configure.wrf
  5. Add ../phys/wsm5.cu.o and ../phys/wsm5_gpu.cu.o to LIB_LOCAL in configure.wrf (define LIB_LOCAL if it does not already exist)
  6. If configuring wrf for serial or sm, you need to also include a definition for rsl_internal_microclock. A change is also needed in main/module_wrf_top.F (version 3.2) for serial compiles. See attached note.
  7. Add –L/usr/local/cuda/lib –lcuda –lcudart to LIB_LOCAL (or appropriate path for CUDA libraries on your system. You will need to use lib64 instead of just lib if you are on 64-bit Linux)
  8. compile and run wrf (see www2.mmm.ucar.edu/wrf/users for other information)

To run, make sure that WSM5 physics is selected in the namelist.input file (mp_physics option #4) and then run on a GPU enabled node or nodes.  You can use the GPU WSM5 option with WRF when it is configured serial or dmpar (distributed memory parallel) The code assumes that there is one GPU device available per MPI task, if you have compiled WRF with the dmpar option.  The GPU WSM5 microphysics option is not currently supported for smpar or dm+sm (OpenMP or hybrid OpenMP & MPI) configurations of WRF.

References

·         Michalakes, J. and M. Vachharajani, “GPU Acceleration of Numerical Weather Prediction”, Parallel Processing Letters Vol. 18 No. 4. World Scientific. Dec. 2008. pp. 531—548, http://wwww.worldscinet.com/ppl

 

 

Last updated, March 29, 2010    michalak@ucar.edu