CUDA variant optimised beyond k-caching #104

MichaelSt98 · 2024-12-11T15:26:52Z

Optimisation beyond k-caching:

buffering variables
pipelined global-to-shared memory copies overlapped with compute, TMA loads
- NVIDIA blog post: Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture

likely still potential for further optimisation ...

Implemented with input from @lukasm91

k-caching vs "opt"-variant performance (compiled with nvhpc 22.11 and executed with 1 262144 128):

double-precision
- k-caching: 600 - 700 GF/s
- "opt": 700 - 850 GF/s
single-precision
- k-caching: 2300 - 3000 GF/s
- "opt": 2700 - 5000 GF/s

reuterbal

This is really impressive!

I left some comments on the CMake side, which would be good to take care of
Tests are currently failing, not sure why
Please describe the new variant in the README, also adding some details on what constitutes the optimisation

reuterbal · 2024-12-19T10:21:23Z

src/cloudsc_cuda/CMakeLists.txt

+      target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:
+          -O3 -use_fast_math -lineinfo -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)
+          # -O0 -g -G -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)


Maybe switch according to build-type?

Suggested change

target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:

-O3 -use_fast_math -lineinfo -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)

# -O0 -g -G -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)

if(CMAKE_BUILD_TYPE STREQUAL "Debug")

target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-O0 -g -G>)

else()

target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE

$<$<COMPILE_LANGUAGE:CUDA>:-O3 -use_fast_math -lineinfo>)

endif()

target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>)

reuterbal · 2024-12-19T10:23:40Z

src/cloudsc_cuda/CMakeLists.txt

+      ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}
+    )
+    if (NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
+      target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>>)


This looks a little weird. Could this be the reason for the failing tests?

reuterbal · 2024-12-19T10:33:56Z

src/cloudsc_cuda/CMakeLists.txt

+    target_include_directories(
+      dwarf-cloudsc-c-cuda-opt-lib
+      PUBLIC
+      ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}


What precisely do we need from that?
It would be better to link against CUDA::cudart (or similar, see https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#cuda-toolkit-rt-lib) as a PUBLIC_LIBS target instead

reuterbal · 2024-12-19T10:37:46Z

src/cloudsc_cuda/CMakeLists.txt

@@ -54,7 +54,7 @@ if( HAVE_CLOUDSC_C_CUDA )
        target_compile_options(dwarf-cloudsc-c-cuda-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>>)
    else()
        target_compile_options(dwarf-cloudsc-c-cuda-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:
-            --ptxas-options=-O3 -use_fast_math -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>) 
+            --ptxas-options=-O3 -use_fast_math -lineinfo -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>) 


I would suggest collecting these flags in variables, separating between optimisation and base flags. Something like

set( CLOUDSC_CUDA_OPT_FLAGS "..." ) set( CLOUDSC_CUDA_FLAGS "..." )

The first one can then be build_type dependent, e.g., use the -O0 flags for debug as suggested in the other comment, and then apply them to each target as

target_compile_options< dwarf-... PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--ptxas-options=${CLOUDSC_CUDA_OPT_FLAGS} ${CLOUDSC_CUDA_FLAGS}>)

That way you have to change them only in one place.

MichaelSt98 · 2024-12-19T10:54:21Z

This is really impressive!

I left some comments on the CMake side, which would be good to take care of

Tests are currently failing, not sure why

Please describe the new variant in the README, also adding some details on what constitutes the optimisation

Tests are failing due to the compute capability?!

/home/runner/work/dwarf-p-cloudsc/dwarf-p-cloudsc/nvhpc-install/Linux_x86_64/21.9/cuda/11.4/include/cuda/std/detail/__atomic:11:4: error: #error "CUDA atomics are only supported for sm_60 and up on *nix and sm_70 and up on Windows."
[2891](https://github.com/ecmwf-ifs/dwarf-p-cloudsc/actions/runs/12279542957/job/34263839987?pr=104#step:11:2892)
   11 | #  error "CUDA atomics are only supported for sm_60 and up on *nix and sm_70 and up on Windows."
[2892](https://github.com/ecmwf-ifs/dwarf-p-cloudsc/actions/runs/12279542957/job/34263839987?pr=104#step:11:2893)
      |    ^~~~~
[2893](https://github.com/ecmwf-ifs/dwarf-p-cloudsc/actions/runs/12279542957/job/34263839987?pr=104#step:11:2894)
make[2]: *** [cloudsc-dwarf/src/cloudsc_cuda/CMakeFiles/dwarf-cloudsc-c-cuda-opt-lib.dir/build.make:95: cloudsc-dwarf/src/cloudsc_cuda/CMakeFiles/dwarf-cloudsc-c-cuda-opt-lib.dir/cloudsc/cloudsc_c_opt.cu.o] Error 1

Therefore, I guess we have to disable the new CUDA optimised variant for testing?!

reuterbal · 2024-12-19T10:56:47Z

It shouldn't be run but it should be built.
Try setting the CUDA architecture to 80 in the Github toolchain file: https://github.com/ecmwf-ifs/dwarf-p-cloudsc/blob/main/arch/toolchains/github-ubuntu-nvhpc.cmake

…ared mem copies overlaped with compute

reuterbal

Very nice, many thanks!

I know that the CMake changes are a bit tedious but I really like how it looks now!

mlange05

Brilliant! Very, very impressive.

marsdeno · 2024-12-20T10:22:09Z

My hat's off 🙇

reuterbal · 2024-12-20T10:25:54Z

A peculiar addition to this: @MichaelSt98's CMake changes make sure to now apply --ptxas-options=-O3 -O3, which seems to tip the double-precision performance over the (effective) 1TF/s barrier for throughput, and I'm seeing about ~4.3TF/s in single-precision.

reuterbal requested changes Dec 19, 2024

View reviewed changes

MichaelSt98 force-pushed the nams-cuda-beyond-k-caching branch from 99afc47 to 7d92eef Compare December 19, 2024 14:13

MichaelSt98 added 4 commits December 19, 2024 14:18

CUDA variant optimised beyond k-caching, e.g., pipelined global-to-sh…

6b975a2

…ared mem copies overlaped with compute

describe new even more optimized CUDA C variant in README

7fb7401

improve on CMake usage for CUDA C variants

add9bbc

specify/set CUDA architecture to 80 for GitHub toolchain file

6beb1d9

MichaelSt98 force-pushed the nams-cuda-beyond-k-caching branch from 7d92eef to 6beb1d9 Compare December 19, 2024 14:28

MichaelSt98 requested a review from reuterbal December 19, 2024 14:57

reuterbal approved these changes Dec 19, 2024

View reviewed changes

mlange05 approved these changes Dec 20, 2024

View reviewed changes

reuterbal merged commit 8cec441 into develop Dec 20, 2024
32 checks passed

reuterbal deleted the nams-cuda-beyond-k-caching branch December 20, 2024 08:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA variant optimised beyond k-caching #104

CUDA variant optimised beyond k-caching #104

MichaelSt98 commented Dec 11, 2024

reuterbal left a comment •

edited

Loading

reuterbal Dec 19, 2024

reuterbal Dec 19, 2024

reuterbal Dec 19, 2024

reuterbal Dec 19, 2024

MichaelSt98 commented Dec 19, 2024

reuterbal commented Dec 19, 2024

reuterbal left a comment

mlange05 left a comment

marsdeno commented Dec 20, 2024

reuterbal commented Dec 20, 2024

CUDA variant optimised beyond k-caching #104

CUDA variant optimised beyond k-caching #104

Conversation

MichaelSt98 commented Dec 11, 2024

reuterbal left a comment • edited Loading

Choose a reason for hiding this comment

reuterbal Dec 19, 2024

Choose a reason for hiding this comment

reuterbal Dec 19, 2024

Choose a reason for hiding this comment

reuterbal Dec 19, 2024

Choose a reason for hiding this comment

reuterbal Dec 19, 2024

Choose a reason for hiding this comment

MichaelSt98 commented Dec 19, 2024

reuterbal commented Dec 19, 2024

reuterbal left a comment

Choose a reason for hiding this comment

mlange05 left a comment

Choose a reason for hiding this comment

marsdeno commented Dec 20, 2024

reuterbal commented Dec 20, 2024

reuterbal left a comment •

edited

Loading