Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance outline #635

Closed
5 of 10 tasks
Tracked by #2943
charleskawczynski opened this issue Jul 13, 2022 · 2 comments
Closed
5 of 10 tasks
Tracked by #2943

Performance outline #635

charleskawczynski opened this issue Jul 13, 2022 · 2 comments

Comments

@charleskawczynski
Copy link
Member

charleskawczynski commented Jul 13, 2022

Key items to tackle

  • Fuse more column operations using the bycolumn: this is fairly low hanging fruit and also makes threading more efficient. Most of this is complete, however, non_orographic_gravity_wave_tendency! and orographic_gravity_wave_tendency! need to be reworked. Opened Improve design of nonorographic gravity wave parameterization #897 to track.
  • Explicitly split vertical and horizontal operations to increase work per thread
  • Track down allocations: some are caused by inference issues. This is mostly finished, one outstanding issue remains in ClimaCore (see Inference failure ClimaCore.jl#1024)
  • GC pauses can happen at different places on different processes, which will hamper scaling efficiency (as processes will wait for other processes running GC): once we've reduced allocations we can disable the GC and trigger it manually intermittently. GC.enable(false) and GC.gc() to trigger manually
  • We still communicate whole elements instead of just boundaries: this will require a bit of work in ClimaCore, but we'd like to see hard numbers before going down this route
  • Use * over / if not being optimized
  • Check impact of loops over mesh components (iterator infrastructure) compared to pre-computed mesh
  • Potential performance optimization: move the LU factorization of the matrix W = -I + dtγ * J from the linsolve! function (which solves the equation W * newton_residual = ΔY) to the Wfact! function (which computes W). This will speed things up if we take multiple Newton iterations during the implicit step of a Runge-Kutta stage without re-computing W for each iteration, or if we only compute W once per timestep and hold it fixed for all the Runge-Kutta stages. Factorization may be the most expensive part of either Wfact! or linsolve!, so minimizing the number of factorizations could potentially give us a significant performance improvement. This is not a pure optimization, though, since it's behavior changing, has built-in assumptions about source terms, and can impact stability.
  • Measure performance of IO. We've added a callbacks flame graph, which shows time spent during most IO calls.
  • Use sparse mat-mul operations for AxisTensor conversions. See PR add special cases for projection from Covariant => Contravariant ClimaCore.jl#853

Performance notes / specs

  • Desired config: 0.1sec wall clock per 400sec simulated time-step
  • Current state: 8sec wall clock per 400s simulated timestep.
  • This means performance optimizations and parallelism needs to buy us an 80x speedup (8s/80 = 0.1s)
  • Recompute performance target based on CFL or target time step at target resolution. Looks like this would be another factor of 3
bors bot added a commit that referenced this issue Oct 3, 2022
821: make GC deterministic in distributed r=simonbyrne a=simonbyrne

# PULL REQUEST

## Purpose and Content
This should reduce MPI Waitall time by manually triggering the GC across all processes at the same time.

## Benefits and Risks
The number of steps will require some tuning to avoid out-of-memory errors

## Linked Issues
- Item 3 of #635 
- Mentioned in #686
- Supersedes #687


## PR Checklist
- [x] This PR has a corresponding issue OR is linked to an SDI.
- [x] I have followed CliMA's codebase [contribution](https://clima.github.io/ClimateMachine.jl/latest/Contributing/) and [style](https://clima.github.io/ClimateMachine.jl/latest/DevDocs/CodeStyle/) guidelines OR N/A.
- [x] I have followed CliMA's [documentation policy](https://github.com/CliMA/policies/wiki/Documentation-Policy).
- [x] I have checked all issues and PRs and I certify that this PR does not duplicate an open PR.
- [x] I linted my code on my local machine prior to submission OR N/A.
- [x] Unit tests are included OR N/A.
- [x] Code used in an integration test OR N/A.
- [x] All tests ran successfully on my local machine OR N/A.
- [x] All classes, modules, and function contain docstrings OR N/A.
- [x] Documentation has been added/updated OR N/A.


Co-authored-by: Simon Byrne <simonbyrne@gmail.com>
@charleskawczynski
Copy link
Member Author

Superseded by #2632

@charleskawczynski
Copy link
Member Author

  • I've excluded the item for time stepper LU factorization for further tracking this issue because there are other higher-level optimizations that we can apply (like parallelizing function calls that the timestepper makes) that effectively nullify reducing the frequency of LU factorizations, which also has the side effect of trading off with the approximation that some physics operate more slowly than others.

  • I've excluded the impact of loops over mesh components (iterator infrastructure) compared to pre-computed mesh because I don't think that this is relevant to GPU performance, which is what we're more primarily targeting. @sriharshakandala can correct me if I'm wrong there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant