WW3 scalability to meet operational runtime requirements #775

MatthewMasarik-NOAA · 2022-08-30T19:11:16Z

Is your feature request related to a problem? Please describe.

Current configurations of WW3 being considered for implementation in GFSv17/GEFSv13 are in need of fixes that reduce overall runtime.

Describe the solution you'd like

Solutions are being sought that increase code performance by addressing bottlenecks and slow regions.
Changes to the MPI decomposition are not being requested at the present time. Otherwise all suggestions of edits, refactoring, threading fixes should be considered.
Any suggestion must pass all WW3 regtests without changing answers.

Describe alternatives you've considered

An update to the MPI parallelization. This will be a major undertaking however and has not been scheduled. Solutions outside of the MPI decomposition are needed for the short-term to meet operational runtime needs.

Additional context

Profiling of the code has pointed to a few routines having a significant share of the runtime. A major bottleneck in a subroutine of w3wave (w3nmin) has been identified by @GeorgeVandenberghe-NOAA , @DeniseWorthen and @mvertens, and are being worked on. Fixes for other potentially poor performing areas are being solicited.

The text was updated successfully, but these errors were encountered:

MatthewMasarik-NOAA · 2022-09-20T20:27:33Z

Progress update

Initial profiling

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 31.61    342.43   342.43                w3parall_mp_init_get_jsea_isproc_
 27.24    637.50   295.07                w3wavemd_mp_w3wave_
 10.44    750.64   113.14                w3src4md_mp_w3sds4_
  9.24    850.72   100.08                w3pro3md_mp_w3xyp3_
  5.60    911.42    60.70                w3uqckmd_mp_w3qck3_
  4.66    961.94    50.52                w3snl1md_mp_w3snl1_

The initial profiling above shows the subroutines init_get_jsea_isproc and w3wave account for the most significant portion of runtime. This can be traced to single bottleneck in w3nmin. w3wave calls w3nmin which contains a nested loop, the body of which calls init_get_jsea_isproc. The recently merged PR (Feat/w3wave scaling #784) is likely the main fix for this issue. It effectively removes the bottleneck in w3nmin identified in profiling.

Subsequent profiling

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  9.79     41.32    41.32   229549    0.00     0.00  w3src4md_mp_w3sds4_
  8.72     78.12    36.80     480     0.08     0.08  w3pro3md_mp_w3xyp3_
  5.11     99.67    21.55   229549    0.00     0.00  w3snl1md_mp_w3snl1_

Profiling once the fix had been included is shown above. From this it's clear the bottleneck has been removed. The next three subroutines identified in profiling now take up the remaining significant amount of runtime. Each of these subroutines were looked into for potential further performance gains by adding additional openMP statements. The status of that work:
- w3src4md: w3sds4() - first look at this determined there was little/no opportunities for optimization.
- w3pro3md: w3xyp3() - first attempt seemed like there may be a small amount of optimization possible. This included some OMP statements around inner loops. Further reading suggested for nested loops one should (in general) first try applying OMP parallelization to the outer loops, and as a potential second step descend into the inner. Reanalyzing the loops, there were a number of inner loops in serial under the outer loop in question, and did not seem like this would be a straightfoward possibility for optimization. Based on this w3xyp3() was also put aside.
- w3snl1md: w3snl1() - there were a couple small loops that adding OMP was an option. A fix was made and test are currently running to check for any improvements. If there are any improvements, they will most likely be quite small. Once the runs return for this subroutine fix, this issue will be closed as any subroutines which stood out from profiling have now been looked into.

MatthewMasarik-NOAA · 2022-09-26T15:34:08Z

Follow up - w3snl1md: w3snl1()

Reviewing runtimes for OMP edits to w3nsl1( ), and those without show that the OMP code added to this routine does not improve the performance. This was the final subroutine identified in profiling for potential improvement. Given each routine found in profiling has been investigated, this issue will now be closed.

MatthewMasarik-NOAA added the enhancement New feature or request label Aug 30, 2022

MatthewMasarik-NOAA mentioned this issue Sep 15, 2022

Feat/w3wave scaling #784

Merged

2 tasks

MatthewMasarik-NOAA closed this as completed Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WW3 scalability to meet operational runtime requirements #775

WW3 scalability to meet operational runtime requirements #775

MatthewMasarik-NOAA commented Aug 30, 2022

MatthewMasarik-NOAA commented Sep 20, 2022

MatthewMasarik-NOAA commented Sep 26, 2022

WW3 scalability to meet operational runtime requirements #775

WW3 scalability to meet operational runtime requirements #775

Comments

MatthewMasarik-NOAA commented Aug 30, 2022

MatthewMasarik-NOAA commented Sep 20, 2022

Progress update

MatthewMasarik-NOAA commented Sep 26, 2022