UCAR > wiki.ucar.edu > WRF Wiki > ... > WRF4DVAR > 4DVAR minutes 062907
  WRF Wiki Welcome huangx | History | Preferences | Log Out  View a printable version of the current page. Get help from the Confluence website.  
  4DVAR minutes 062907
Added by huangx, last edited by huangx on Jul 31, 2007  (view change)
Labels: 
(None)

Add this page to your favourites list Watch this page

Reports from Hans, Tom, Xin, Yongsheng, Jimy and Xiaoyan were compiled as the "meeting memo" for June 2007.

Progress review

Hans presented the status of WRF 4D-Var at 1) NCEP/JCSDA, 2) GMAO, 3) the WRF users' workshop in Boulder and 4) the AMS NWP/WAF conference in Utah.

Optimization

Code optimization

We are at the point where inter-process communication due to halo updates in solve_em_ad() is pacing performance of the adjoint. To a lesser extent, I/O is also limiting performance, especially in the larger cases. In the 91x73x17 Haitang case, halo update time is at least as large as time spent computing when using 16 MPI tasks. In the 151x145x38 Shanshan case, halo update time is two times larger than time spent computing when 192 tasks were used.

Here is the net effect of all of Tom's optimizations on the run time of a single execution of the adjoint for the serial 31x25x17 Haitang case. Adjoint output files written by each case are bitwise-identical. Two cases were run: "Old" is a case without any of optimizations; "New" includes all of Tom's optimizations. Each case was run five times to illustrate run-to-run variability in measured times. Times below are computed via several methods:

"LSF" = Total time for job measured by LSF. This includes I/O time and LSF overhead.
"main" = Time measured by WRF "main" timer
"I/O" = Sum of times measured by the following WRF timers:
"processing lateral boundary for domain"
"Writing filter output for domain"
"med_auxinput2_in4ad"
"med_auxinput3_in 1"
"med_auxinput3_in 2"
"med_auxinput3_in 3"
"solve_em_ad_time" = Time measured for solve_em_ad.

=============================================================================
Summary for old (r2539) code, serial, no memory vs. compute optimizations enabled:
LSF | main | I/O | solve_em_ad
------------------------------------
27 | 24.8 | 2.0 | 22.71
28 | 24.5 | 1.7 | 22.72
27 | 24.3 | 1.3 | 22.69
25 | 23.6 | 0.7 | 22.68
27 | 24.6 | 1.7 | 22.72
=============================================================================

=============================================================================
Summary for new code, serial, all memory vs. compute optimizations enabled:
LSF | "main" | "I/O" | "solve_em_ad"
---------+---------------
19 | 16.8 | 1.6 | 14.96
27 | 18.5 | 3.7 | 14.95
26 | 18.0 | 2.9 | 15.02
20 | 17.5 | 2.6 | 14.99
18 | 16.1 | 0.9 | 14.89
=============================================================================

Note that there is considerable run-to-run variability in times that include I/O.

Comparing the minima of these times, time measured by LSF has sped up by:
1 - (18/25) = 28%
Time for "main" has sped up by:
1 - (16.1/23.6) = 32%
Time for "solve_em_ad" alone has sped up by:
1 - (14.9/22.7) = 34%

However, comparing the best "r2539" case vs. the worst new case, time measured by LSF has actually slowed down! So the speedup due to the optimizations of solve_em_ad can be masked by run-to-run variability in individual cases.

Also note that the ratio of "I/O" time to "main" time ranges from 3% to 8% for the old code and from 6% to 20% for the new code. I/O time is very unpredictable since it depends on contention with other jobs for limited I/O resources. We would see even higher values for both cases if we ran more tests.

For parallel runs the benefits of my optimizations to solve_em_ad are reduced even more since a large fraction of run-time is spent in halo updates. Also, I/O time consumes a larger fraction of run-time.

So, the computation vs. memory trade-off has reached a point of diminishing returns. Tom is starting to tackle inter-process communication now.

First, finish the static load balance optimization. This may help some with a 16-task configuration, but will not help much with very large task counts.

Then resume the optimization of the halo updates. Thus far Tom has optimized 10 of them, removing 5 and shrinking 5 others (by eliminating
arrays that do not need to be updated and reducing halo thicknesses). 107 halo updates remain. This optimization task must be performed in a careful step-wise manner with tests after each change to validate correctness of the dependence analysis. Also, it is difficult to predict how effective the optimizations will be in advance due to the complexity of the adjoint. At best we might see 10-20% improvement for the 91x73x17 Haitang case on 16 tasks, probably more on 32 tasks. The 151x145x38 Shanshan case would probably see more improvement, though I/O will begin to dominate as halo update costs are reduced. 3-6 weeks will be required to complete the halo optimization work.

For the 151x145x38 case, I/O is taking 30% of adjoint run time when 192 tasks are used. (Of course this number may vary greatly from one run to
the next.) If halo update optimizations are successful, this fraction will increase. A short-term approach would be to try parallel netCDF. However, we would still be at the mercy of other jobs contending for I/O resources. A long-term approach would be to avoid disk I/O altogether via a coupling mechanism such as MCEL (possible now) or ESMF (contingent on ESMF support for staggered grids and multiple executables). Neither of these approach can be done quickly.

Most of the "easy" optimizations are done. We are not seeing much benefit from them in the parallel runs due to the expense of the halo updates and to run-to-run variability largely due to I/O.

Tom has committed all his optimization related code back to the 4dvar branch. He will continue the optimization work on his own branch. Xin, Yongsheng and Hans started to test the optimized code.

Multi-incremental formulation

Xin spent two weeks at MMM working with Hans, Yongsheng and Tom. Finished programming, testing and bug fix of the single processor multi-incremental 4dvar, the parallel multi-incremental 4dvar programming was also completed, now we are in test and verification stage.

4D-Var Met test

Yongsheng conducted Typhoon Shanshan OSSE experiments to compare the perfomance of 4dvar to that of 3dvar.

We start to see problems with increase of the model gridpoints, levels and resolution (from 51x49x38 108 km to 151x145x38 36 km). Testing the optimized code from Tom Henderson revealed performance bottle neck due to I/O.

For the first time, Yongsheng and Xin tested the multi-incremenal formulation with the OSSE configuration.

Hans and Yongsheng started to run verification over the Typhoon Haitang experiments, which were re-run with the latest code.

Develop and test the control of lateral boundary condition (Jbdy) perturbations in WRF-Var.

The lateral boundary noise problem was traced to the incompatibility between WRF and WRFPLUS.

Include additional physics in the WRF linear and adjoint models, and test in 4D-Var.

The DU cumulus scheme is the simplified KF scheme.

Forward code testing

a) We design a group of experiments to evaluate the impact of DU cumulus code in WRFV2.2 Model. They are Non-cu_physics (no any cumulus physics option), KF, BMJ, GD and DU. We use the Typhoon Haitang in these experiments with the 30-km horizontal resolution. The 48-h forecast of track show that DU is closest to the KF. The cumulus precipitation and 48-h total precipitation present the very close pattern, and a similar maximum value between DU and KF.

b) We finish replanting the DU cumulus code in the nonlinear code of wrfplus from WRFV2,2. Test the wrfplus non-linear code run with DU cumulus scheme. We do the comparison between the forecast of cumulus precipitation with DU and KF cumulus scheme by wrfplus nonlinear code. The distribution of precipitation presents the similar result between these two experiments.

TL & AD code testing

We are developing the TL & AD code for the DU cumulus scheme now.

AOB

The next meeting: 10 am 7/31/07, room 2072.

0 comments | Add Comment
Powered by Atlassian Confluence - Contact Administrators