Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance roadmap #2632

Open
12 of 30 tasks
Tracked by #2943
charleskawczynski opened this issue Feb 6, 2024 · 2 comments
Open
12 of 30 tasks
Tracked by #2943

Performance roadmap #2632

charleskawczynski opened this issue Feb 6, 2024 · 2 comments
Assignees

Comments

@charleskawczynski
Copy link
Member

charleskawczynski commented Feb 6, 2024

This issue is a continuation of #635, but I'm excluding some items (some addressed, and others which I've explained in #635) to reduce the noise.

Memory access patterns

We should make sure that we inline all kernels, use shared/local memory when possible, and ensure we have coalesced reads/writes.

Reducing loads and stores

The primary point of improving performance beyond our current state is by reducing the number of memory loads and stores. One way to do that is by fusing operations, which can allow the compiler to hoist (and eliminate) memory loads/stores. Another way is to explicitly pass less data through broadcast expressions (where possible).

There are a few different options / paths to capturing some of this performance that we've left on the table, and each approach has its limitations, pros and cons:

  • Optimize / fuse simple fieldvector operations in timestepper
    • Pros: should be somewhat simple / straightforward, no changes to ClimaAtmos
    • Cons: impact may be about 10%
  • FD operators read data redundantly (i.e. the +half and -half values), we could improve this by first reading into shared memory, similar to spectral element operators.
    • Pros: could improve vertical kernels by as much as 2x, no ClimaAtmos changes. This is probably a good idea to implement at some point.
    • Cons: Does not impact horizontal kernels.
  • Fusing operations (e.g., @fuse begin @. a = b; @. c = d end)
    • Pros:
      • Will improve CPU/GPU performance
      • It's an optimization that can be done incrementally (lower risk of failure, easy to prototype) and apply this to ClimaAtmos / other repos
    • Cons:
      • Requires changing ClimaAtmos (we have many broadcast expressions)
      • This will (likely) be limited by similar BC expressions (cannot mix cell center / face, or horizontal / vertical, union-splitting), which will limit the number of fuses we can perform
  • Use lazy evaluation of BC expressions
    • Pros:
      • There's potential for fusing many operations this way (maybe even beyond cell center / face)
      • This could also help us reduce code duplication (e.g., reusing functions/expressions for tendencies and diagnostics)
      • The code could become more modular, and this may allow us to more easily unit test individual kernels
    • Cons:
      • There is risk of exploding compilation times
      • We'll need some way to recover useable stack traces (e.g., enabling eager execution)
      • There is a risk that profiling the code may becomes more complicated
      • A significant portion of ClimaAtmos will need to be re-written to a functional-style approach
  • Reduce reads / writes from LocalGeometry (if possible)
    • Pros
      • (possibly) no changes to ClimaAtmos required
    • Cons
      • We need a prototype to prove that there is performance available
      • This optimization will only improve kernels that use the LocalGeometry
      • Optimizations beyond this will nullify this effort $^1$

$^1$ It's important to note that one can nullify the other. That is, if we perform two optimizations:

  1. eliminate loading X from kernel A and eliminate loading Y from kernel B
  2. fuse kernels A and B

we could end up with the same number of loads and stores if we had only performed optimization 1) or 2) alone.

Removing unnecessary work

We can remove unnecessary work, e.g., in precomputed quantities, or using a caching system

Parallelism

There are other optimizations we can perform, which can also have a notable impact. For example, parallelizing work, reducing allocations to reduce the frequency of GC, reducing MPI communication, and emitting more efficient low-level code. Below is a list of some of these items:

Scaling

Minimize number of dss calls, and gc calls.

Misc

There are other miscilaneous items, specified in the task list.

Tasks

  1. Performance
  2. charleskawczynski
  3. enhancement
    charleskawczynski
  4. enhancement
    charleskawczynski
  5. enhancement
    charleskawczynski
  6. enhancement
    charleskawczynski
  7. enhancement
    charleskawczynski
  8. enhancement
    charleskawczynski
  9. charleskawczynski
  10. enhancement
    charleskawczynski
  11. Performance
    charleskawczynski
  12. enhancement
  13. enhancement
  14. enhancement
  15. enhancement
  16. 1 of 1
    enhancement performance
    charleskawczynski
@charleskawczynski
Copy link
Member Author

charleskawczynski commented Mar 5, 2024

I've removed the prototype (as we already have developed https://github.com/CliMA/MultiBroadcastFusion.jl, which has performance tests) to reduce the noise in this issue.

I'm pleasantly surprised that the generic/recursive pattern appears (somehow) more performant than the hard-coded one, but I'll take it!

@tapios
Copy link
Contributor

tapios commented Mar 5, 2024

Really nice and helpful. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants