Performance roadmap #2632

charleskawczynski · 2024-02-06T02:05:22Z

This issue is a continuation of #635, but I'm excluding some items (some addressed, and others which I've explained in #635) to reduce the noise.

Memory access patterns

We should make sure that we inline all kernels, use shared/local memory when possible, and ensure we have coalesced reads/writes.

Reducing loads and stores

The primary point of improving performance beyond our current state is by reducing the number of memory loads and stores. One way to do that is by fusing operations, which can allow the compiler to hoist (and eliminate) memory loads/stores. Another way is to explicitly pass less data through broadcast expressions (where possible).

There are a few different options / paths to capturing some of this performance that we've left on the table, and each approach has its limitations, pros and cons:

Optimize / fuse simple fieldvector operations in timestepper
- Pros: should be somewhat simple / straightforward, no changes to ClimaAtmos
- Cons: impact may be about 10%
FD operators read data redundantly (i.e. the +half and -half values), we could improve this by first reading into shared memory, similar to spectral element operators.
- Pros: could improve vertical kernels by as much as 2x, no ClimaAtmos changes. This is probably a good idea to implement at some point.
- Cons: Does not impact horizontal kernels.
Fusing operations (e.g., @fuse begin @. a = b; @. c = d end)
- Pros:
  - Will improve CPU/GPU performance
  - It's an optimization that can be done incrementally (lower risk of failure, easy to prototype) and apply this to ClimaAtmos / other repos
- Cons:
  - Requires changing ClimaAtmos (we have many broadcast expressions)
  - This will (likely) be limited by similar BC expressions (cannot mix cell center / face, or horizontal / vertical, union-splitting), which will limit the number of fuses we can perform
Use lazy evaluation of BC expressions
- Pros:
  - There's potential for fusing many operations this way (maybe even beyond cell center / face)
  - This could also help us reduce code duplication (e.g., reusing functions/expressions for tendencies and diagnostics)
  - The code could become more modular, and this may allow us to more easily unit test individual kernels
- Cons:
  - There is risk of exploding compilation times
  - We'll need some way to recover useable stack traces (e.g., enabling eager execution)
  - There is a risk that profiling the code may becomes more complicated
  - A significant portion of ClimaAtmos will need to be re-written to a functional-style approach
Reduce reads / writes from LocalGeometry (if possible)
- Pros
  - (possibly) no changes to ClimaAtmos required
- Cons
  - We need a prototype to prove that there is performance available
  - This optimization will only improve kernels that use the LocalGeometry
  - Optimizations beyond this will nullify this effort $^1$

$^1$ It's important to note that one can nullify the other. That is, if we perform two optimizations:

eliminate loading X from kernel A and eliminate loading Y from kernel B
fuse kernels A and B

we could end up with the same number of loads and stores if we had only performed optimization 1) or 2) alone.

Removing unnecessary work

We can remove unnecessary work, e.g., in precomputed quantities, or using a caching system

Parallelism

There are other optimizations we can perform, which can also have a notable impact. For example, parallelizing work, reducing allocations to reduce the frequency of GC, reducing MPI communication, and emitting more efficient low-level code. Below is a list of some of these items:

Scaling

Minimize number of dss calls, and gc calls.

Misc

There are other miscilaneous items, specified in the task list.

The text was updated successfully, but these errors were encountered:

charleskawczynski · 2024-03-05T20:22:01Z

I've removed the prototype (as we already have developed https://github.com/CliMA/MultiBroadcastFusion.jl, which has performance tests) to reduce the noise in this issue.

I'm pleasantly surprised that the generic/recursive pattern appears (somehow) more performant than the hard-coded one, but I'll take it!

tapios · 2024-03-05T21:41:32Z

Really nice and helpful. Thank you!

charleskawczynski mentioned this issue Feb 6, 2024

Performance outline #635

Closed

10 tasks

charleskawczynski pinned this issue Feb 9, 2024

charleskawczynski mentioned this issue Feb 15, 2024

Fuse dss calls #2689

Merged

charleskawczynski added this to the O1.2.5 1 SYPD for AMIP milestone Feb 15, 2024

charleskawczynski mentioned this issue Feb 20, 2024

Improve the performance of buoyancy_gradients #2530

Closed

charleskawczynski mentioned this issue Mar 11, 2024

Add support for MultiBroadcastFusion CliMA/ClimaCore.jl#1641

Merged

akshaysridhar unpinned this issue Mar 19, 2024

akshaysridhar pinned this issue Mar 19, 2024

cmbengue mentioned this issue Apr 29, 2024

O1.2.6 (atmos) 1 SYPD for AMIP on a single A100 #2943

Open

cmbengue assigned charleskawczynski Apr 29, 2024

charleskawczynski mentioned this issue May 21, 2024

Use local memory in band matrix solve CliMA/ClimaCore.jl#1735

Merged

charleskawczynski removed this from the O1.2.5 1 SYPD for AMIP milestone May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance roadmap #2632

Performance roadmap #2632

charleskawczynski commented Feb 6, 2024 •

edited

Loading

Tasks

charleskawczynski commented Mar 5, 2024 •

edited

Loading

tapios commented Mar 5, 2024

Performance roadmap #2632

Performance roadmap #2632

Comments

charleskawczynski commented Feb 6, 2024 • edited Loading

Memory access patterns

Reducing loads and stores

Removing unnecessary work

Parallelism

Scaling

Misc

Tasks

charleskawczynski commented Mar 5, 2024 • edited Loading

tapios commented Mar 5, 2024

charleskawczynski commented Feb 6, 2024 •

edited

Loading

charleskawczynski commented Mar 5, 2024 •

edited

Loading