Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WW3 scalability to meet operational runtime requirements #775

Closed
MatthewMasarik-NOAA opened this issue Aug 30, 2022 · 2 comments
Closed
Labels
enhancement New feature or request

Comments

@MatthewMasarik-NOAA
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

  • Current configurations of WW3 being considered for implementation in GFSv17/GEFSv13 are in need of fixes that reduce overall runtime.

Describe the solution you'd like

  • Solutions are being sought that increase code performance by addressing bottlenecks and slow regions.
  • Changes to the MPI decomposition are not being requested at the present time. Otherwise all suggestions of edits, refactoring, threading fixes should be considered.
  • Any suggestion must pass all WW3 regtests without changing answers.

Describe alternatives you've considered

  • An update to the MPI parallelization. This will be a major undertaking however and has not been scheduled. Solutions outside of the MPI decomposition are needed for the short-term to meet operational runtime needs.

Additional context

  • Profiling of the code has pointed to a few routines having a significant share of the runtime. A major bottleneck in a subroutine of w3wave (w3nmin) has been identified by @GeorgeVandenberghe-NOAA , @DeniseWorthen and @mvertens, and are being worked on. Fixes for other potentially poor performing areas are being solicited.
@MatthewMasarik-NOAA MatthewMasarik-NOAA added the enhancement New feature or request label Aug 30, 2022
@MatthewMasarik-NOAA
Copy link
Collaborator Author

Progress update

Initial profiling

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 31.61    342.43   342.43                w3parall_mp_init_get_jsea_isproc_
 27.24    637.50   295.07                w3wavemd_mp_w3wave_
 10.44    750.64   113.14                w3src4md_mp_w3sds4_
  9.24    850.72   100.08                w3pro3md_mp_w3xyp3_
  5.60    911.42    60.70                w3uqckmd_mp_w3qck3_
  4.66    961.94    50.52                w3snl1md_mp_w3snl1_
  • The initial profiling above shows the subroutines init_get_jsea_isproc and w3wave account for the most significant portion of runtime. This can be traced to single bottleneck in w3nmin. w3wave calls w3nmin which contains a nested loop, the body of which calls init_get_jsea_isproc. The recently merged PR (Feat/w3wave scaling #784) is likely the main fix for this issue. It effectively removes the bottleneck in w3nmin identified in profiling.

Subsequent profiling

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  9.79     41.32    41.32   229549    0.00     0.00  w3src4md_mp_w3sds4_
  8.72     78.12    36.80     480     0.08     0.08  w3pro3md_mp_w3xyp3_
  5.11     99.67    21.55   229549    0.00     0.00  w3snl1md_mp_w3snl1_
  • Profiling once the fix had been included is shown above. From this it's clear the bottleneck has been removed. The next three subroutines identified in profiling now take up the remaining significant amount of runtime. Each of these subroutines were looked into for potential further performance gains by adding additional openMP statements. The status of that work:
    • w3src4md: w3sds4() - first look at this determined there was little/no opportunities for optimization.
    • w3pro3md: w3xyp3() - first attempt seemed like there may be a small amount of optimization possible. This included some OMP statements around inner loops. Further reading suggested for nested loops one should (in general) first try applying OMP parallelization to the outer loops, and as a potential second step descend into the inner. Reanalyzing the loops, there were a number of inner loops in serial under the outer loop in question, and did not seem like this would be a straightfoward possibility for optimization. Based on this w3xyp3() was also put aside.
    • w3snl1md: w3snl1() - there were a couple small loops that adding OMP was an option. A fix was made and test are currently running to check for any improvements. If there are any improvements, they will most likely be quite small. Once the runs return for this subroutine fix, this issue will be closed as any subroutines which stood out from profiling have now been looked into.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Follow up - w3snl1md: w3snl1()

Reviewing runtimes for OMP edits to w3nsl1( ), and those without show that the OMP code added to this routine does not improve the performance. This was the final subroutine identified in profiling for potential improvement. Given each routine found in profiling has been investigated, this issue will now be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant