Loop Blocking for fn GPU Backend #1787

fthaler · 2024-06-18T13:00:02Z

Implements loop blocking for the GPU fn backend. Thread block size (that is, CUDA/HIP threads per block) and loop block size (that is, loop iterations per CUDA/HIP thread) can now be specified as template parameters.

Further changes:

Set __launch_bounds__ in the fn GPU kernel based on the thread block size.
Activate vertical loop blocking in the fn nabla kernels on newer CUDA versions that support GT_PROMISE.

Performance changes:

__launch_bounds__ affects performance of the fn_cartesian_vertical_advection benchmark significantly (positively or negatively, depending on domain size).
Performance of fn nabla benchmarks improves significantly on newer CUDA versions.
Performance on Daint is currently reduced due to too old CUDA version.

gridtoolsjenkins · 2024-06-18T13:05:03Z

Hi there, this is jenkins continuous integration...
Do you want me to verify this patch?

fthaler · 2024-06-19T07:35:00Z

launch jenkins

havogt · 2024-06-20T06:43:55Z

launch perftests

havogt · 2024-06-20T06:44:04Z

launch jenkins

fthaler · 2024-06-24T09:09:38Z

launch jenkins

fthaler · 2024-06-24T09:54:20Z

launch jenkins

fthaler · 2024-06-24T09:54:26Z

launch perftest

…g-tmp

fthaler · 2024-09-25T06:26:17Z

launch perftest

fthaler · 2024-09-25T07:08:10Z

launch perftest

include/gridtools/fn/backend/common.hpp

include/gridtools/fn/backend/gpu.hpp

havogt · 2024-09-25T07:34:19Z

include/gridtools/fn/backend/gpu.hpp

+                    return index_at_dim<I>(blockIdx) * (ThreadBlockSize::value * LoopBlockSize::value) +
+                           index_at_dim<I>(threadIdx) * LoopBlockSize::value;
+                } else {
+                    return integral_constant<int, 0>();


I am a bit lost here, maybe you can add a few comments.

Added some at various places, let me know if that’s enough.

include/gridtools/sid/loop.hpp

jenkins/envs/daint_nvcc_cray.sh

fthaler · 2024-09-25T09:18:46Z

launch perftest

fthaler · 2024-09-25T09:53:38Z

launch perftest

This reverts commit c896ea1.

fthaler · 2024-10-07T14:17:45Z

launch jenkins

fthaler · 2024-10-07T14:17:52Z

launch perftest

fthaler · 2024-10-22T09:16:50Z

launch perftest

fthaler · 2024-10-22T10:56:35Z

launch jenkins

fthaler · 2024-10-22T11:56:13Z

launch jenkins

fthaler · 2024-10-22T11:56:19Z

launch perftest

havogt

lgtm

Implements loop blocking for the GPU fn backend. Thread block size (that is, CUDA/HIP threads per block) and loop block size (that is, loop iterations per CUDA/HIP thread) can now be specified as template parameters. Further changes: - Set `__launch_bounds__` in the fn GPU kernel based on the thread block size. - Activate vertical loop blocking in the fn nabla kernels on newer CUDA versions that support `GT_PROMISE`. Performance changes: - `__launch_bounds__` affects performance of the `fn_cartesian_vertical_advection` benchmark significantly (positively or negatively, depending on domain size). - Performance of fn nabla benchmarks improves significantly on newer CUDA versions. - Performance on Daint is currently reduced due to too old CUDA version.

fthaler added 10 commits June 12, 2024 12:15

Renamed BlockSizes → ThreadBlockSizes

eb86cbd

Add loop blocking to fn GPU backend

f10f9ad

Cleanup/refactor

c616749

Treat compile-time block sizes as compile-time

44f13fd

Merge remote-tracking branch 'origin/master' into loop-blocking

5da7ed2

Merge remote-tracking branch 'origin/master' into loop-blocking

b1577c1

Minor cleanup

9a5edf1

Use uint3 instead of dim3

90e3a50

Silent warning

f70b885

Add __launch_bounds__

6e77d53

fthaler added 7 commits June 18, 2024 15:06

Fixed comment

508852d

Slightly simpler meta:: calculation

bade455

Maybe fix for Clang-CUDA compilation

1641040

Merge remote-tracking branch 'origin/master' into loop-blocking

85dbec8

Added another ::template

990555e

Make clang-14-cuda-11 happy

ea9e3c9

Fix HIP compilation

d3d5228

fthaler requested review from havogt and iomaganaris June 19, 2024 08:59

fthaler added 3 commits June 19, 2024 14:27

Some cleanup

bba6c78

Fix formatting

1b547d0

Check types of ThreadBlockSizes and LoopBlockSizes

6bc7bcf

Merge branch 'master' into loop-blocking

cf22abe

fthaler added 2 commits September 25, 2024 08:24

Merge remote-tracking branch 'upstream/master' into fast-loop-blockin…

c3edf6f

…g-tmp

Merge branch 'fast-loop-blocking-tmp' into loop-blocking

ef4fed7

fthaler mentioned this pull request Sep 25, 2024

WIP: Faster Unstructured Backend #1792

Closed

Use sid::make_loop in sid::make_unrolled_loop if unroll factor is 1

ffd070e

havogt requested changes Sep 25, 2024

View reviewed changes

fthaler added 4 commits September 25, 2024 10:36

Remove leftover --use_fast_math

6e4bcbe

Cleanup mp_find

b3bd8e3

Added a few comments

304878d

Address review comment

a83027f

Unconditional lookup in neighbor table

c896ea1

fthaler added 4 commits October 7, 2024 09:36

Merge remote-tracking branch 'upstream/master' into loop-blocking

d16c1ea

Revert "Unconditional lookup in neighbor table"

4e115a5

This reverts commit c896ea1.

Revert accidental changes to Daint envs

3e39f82

Use default loop block sizes in fn_select

494f8de

Disable loop unrolling for now

27fc972

Updated perftest references

05206ac

fthaler requested a review from havogt October 22, 2024 10:56

havogt approved these changes Oct 28, 2024

View reviewed changes

fthaler merged commit 32daaa5 into GridTools:master Oct 29, 2024
70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loop Blocking for fn GPU Backend #1787

Loop Blocking for fn GPU Backend #1787

fthaler commented Jun 18, 2024 •

edited

Loading

gridtoolsjenkins commented Jun 18, 2024

fthaler commented Jun 19, 2024

havogt commented Jun 20, 2024

havogt commented Jun 20, 2024

fthaler commented Jun 24, 2024

fthaler commented Jun 24, 2024

fthaler commented Jun 24, 2024

fthaler commented Sep 25, 2024

fthaler commented Sep 25, 2024

havogt Sep 25, 2024

fthaler Sep 25, 2024

fthaler commented Sep 25, 2024

fthaler commented Sep 25, 2024

fthaler commented Oct 7, 2024

fthaler commented Oct 7, 2024

fthaler commented Oct 22, 2024

fthaler commented Oct 22, 2024

fthaler commented Oct 22, 2024

fthaler commented Oct 22, 2024

havogt left a comment

Loop Blocking for fn GPU Backend #1787

Loop Blocking for fn GPU Backend #1787

Conversation

fthaler commented Jun 18, 2024 • edited Loading

gridtoolsjenkins commented Jun 18, 2024

fthaler commented Jun 19, 2024

havogt commented Jun 20, 2024

havogt commented Jun 20, 2024

fthaler commented Jun 24, 2024

fthaler commented Jun 24, 2024

fthaler commented Jun 24, 2024

fthaler commented Sep 25, 2024

fthaler commented Sep 25, 2024

havogt Sep 25, 2024

Choose a reason for hiding this comment

fthaler Sep 25, 2024

Choose a reason for hiding this comment

fthaler commented Sep 25, 2024

fthaler commented Sep 25, 2024

fthaler commented Oct 7, 2024

fthaler commented Oct 7, 2024

fthaler commented Oct 22, 2024

fthaler commented Oct 22, 2024

fthaler commented Oct 22, 2024

fthaler commented Oct 22, 2024

havogt left a comment

Choose a reason for hiding this comment

fthaler commented Jun 18, 2024 •

edited

Loading