Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop Blocking for fn GPU Backend #1787

Merged
merged 78 commits into from
Oct 29, 2024
Merged

Conversation

fthaler
Copy link
Contributor

@fthaler fthaler commented Jun 18, 2024

Implements loop blocking for the GPU fn backend. Thread block size (that is, CUDA/HIP threads per block) and loop block size (that is, loop iterations per CUDA/HIP thread) can now be specified as template parameters.

Further changes:

  • Set __launch_bounds__ in the fn GPU kernel based on the thread block size.
  • Activate vertical loop blocking in the fn nabla kernels on newer CUDA versions that support GT_PROMISE.

Performance changes:

  • __launch_bounds__ affects performance of the fn_cartesian_vertical_advection benchmark significantly (positively or negatively, depending on domain size).
  • Performance of fn nabla benchmarks improves significantly on newer CUDA versions.
  • Performance on Daint is currently reduced due to too old CUDA version.

@gridtoolsjenkins
Copy link
Collaborator

Hi there, this is jenkins continuous integration...
Do you want me to verify this patch?

@fthaler
Copy link
Contributor Author

fthaler commented Jun 19, 2024

launch jenkins

@fthaler fthaler requested review from havogt and iomaganaris June 19, 2024 08:59
@havogt
Copy link
Contributor

havogt commented Jun 20, 2024

launch perftests

@havogt
Copy link
Contributor

havogt commented Jun 20, 2024

launch jenkins

1 similar comment
@fthaler
Copy link
Contributor Author

fthaler commented Jun 24, 2024

launch jenkins

@fthaler
Copy link
Contributor Author

fthaler commented Jun 24, 2024

launch jenkins

@fthaler
Copy link
Contributor Author

fthaler commented Jun 24, 2024

launch perftest

@fthaler
Copy link
Contributor Author

fthaler commented Sep 25, 2024

launch perftest

@fthaler
Copy link
Contributor Author

fthaler commented Sep 25, 2024

launch perftest

include/gridtools/fn/backend/common.hpp Outdated Show resolved Hide resolved
include/gridtools/fn/backend/gpu.hpp Outdated Show resolved Hide resolved
return index_at_dim<I>(blockIdx) * (ThreadBlockSize::value * LoopBlockSize::value) +
index_at_dim<I>(threadIdx) * LoopBlockSize::value;
} else {
return integral_constant<int, 0>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit lost here, maybe you can add a few comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some at various places, let me know if that’s enough.

include/gridtools/sid/loop.hpp Show resolved Hide resolved
include/gridtools/sid/loop.hpp Outdated Show resolved Hide resolved
jenkins/envs/daint_nvcc_cray.sh Outdated Show resolved Hide resolved
@fthaler
Copy link
Contributor Author

fthaler commented Sep 25, 2024

launch perftest

@fthaler
Copy link
Contributor Author

fthaler commented Sep 25, 2024

launch perftest

@fthaler
Copy link
Contributor Author

fthaler commented Oct 7, 2024

launch jenkins

@fthaler
Copy link
Contributor Author

fthaler commented Oct 7, 2024

launch perftest

@fthaler
Copy link
Contributor Author

fthaler commented Oct 22, 2024

launch perftest

@fthaler
Copy link
Contributor Author

fthaler commented Oct 22, 2024

launch jenkins

@fthaler fthaler requested a review from havogt October 22, 2024 10:56
@fthaler
Copy link
Contributor Author

fthaler commented Oct 22, 2024

launch jenkins

@fthaler
Copy link
Contributor Author

fthaler commented Oct 22, 2024

launch perftest

Copy link
Contributor

@havogt havogt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@fthaler fthaler merged commit 32daaa5 into GridTools:master Oct 29, 2024
70 checks passed
havogt pushed a commit that referenced this pull request Oct 30, 2024
Implements loop blocking for the GPU fn backend. Thread block size (that
is, CUDA/HIP threads per block) and loop block size (that is, loop
iterations per CUDA/HIP thread) can now be specified as template
parameters.

Further changes:
- Set `__launch_bounds__` in the fn GPU kernel based on the thread block
size.
- Activate vertical loop blocking in the fn nabla kernels on newer CUDA
versions that support `GT_PROMISE`.

Performance changes:
- `__launch_bounds__` affects performance of the
`fn_cartesian_vertical_advection` benchmark significantly (positively or
negatively, depending on domain size).
- Performance of fn nabla benchmarks improves significantly on newer
CUDA versions.
- Performance on Daint is currently reduced due to too old CUDA version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants