-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA variant optimised beyond k-caching #104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really impressive!
- I left some comments on the CMake side, which would be good to take care of
- Tests are currently failing, not sure why
- Please describe the new variant in the README, also adding some details on what constitutes the optimisation
src/cloudsc_cuda/CMakeLists.txt
Outdated
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>: | ||
-O3 -use_fast_math -lineinfo -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>) | ||
# -O0 -g -G -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe switch according to build-type?
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>: | |
-O3 -use_fast_math -lineinfo -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>) | |
# -O0 -g -G -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>) | |
if(CMAKE_BUILD_TYPE STREQUAL "Debug") | |
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-O0 -g -G>) | |
else() | |
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE | |
$<$<COMPILE_LANGUAGE:CUDA>:-O3 -use_fast_math -lineinfo>) | |
endif() | |
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>) |
src/cloudsc_cuda/CMakeLists.txt
Outdated
${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES} | ||
) | ||
if (NOT DEFINED CMAKE_CUDA_ARCHITECTURES) | ||
target_compile_options(dwarf-cloudsc-c-cuda-opt-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a little weird. Could this be the reason for the failing tests?
src/cloudsc_cuda/CMakeLists.txt
Outdated
target_include_directories( | ||
dwarf-cloudsc-c-cuda-opt-lib | ||
PUBLIC | ||
${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What precisely do we need from that?
It would be better to link against CUDA::cudart
(or similar, see https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#cuda-toolkit-rt-lib) as a PUBLIC_LIBS target instead
src/cloudsc_cuda/CMakeLists.txt
Outdated
@@ -54,7 +54,7 @@ if( HAVE_CLOUDSC_C_CUDA ) | |||
target_compile_options(dwarf-cloudsc-c-cuda-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>>) | |||
else() | |||
target_compile_options(dwarf-cloudsc-c-cuda-lib PRIVATE $<$<COMPILE_LANGUAGE:CUDA>: | |||
--ptxas-options=-O3 -use_fast_math -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>) | |||
--ptxas-options=-O3 -use_fast_math -lineinfo -maxrregcount=128 -gencode arch=compute_${CMAKE_CUDA_ARCHITECTURES},code=sm_${CMAKE_CUDA_ARCHITECTURES}>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest collecting these flags in variables, separating between optimisation and base flags. Something like
set( CLOUDSC_CUDA_OPT_FLAGS "..." )
set( CLOUDSC_CUDA_FLAGS "..." )
The first one can then be build_type dependent, e.g., use the -O0 flags for debug as suggested in the other comment, and then apply them to each target as
target_compile_options< dwarf-... PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--ptxas-options=${CLOUDSC_CUDA_OPT_FLAGS} ${CLOUDSC_CUDA_FLAGS}>)
That way you have to change them only in one place.
Tests are failing due to the compute capability?!
Therefore, I guess we have to disable the new CUDA optimised variant for testing?! |
It shouldn't be run but it should be built. |
99afc47
to
7d92eef
Compare
…ared mem copies overlaped with compute
7d92eef
to
6beb1d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, many thanks!
I know that the CMake changes are a bit tedious but I really like how it looks now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant! Very, very impressive.
My hat's off 🙇 |
A peculiar addition to this: @MichaelSt98's CMake changes make sure to now apply |
Optimisation beyond k-caching:
likely still potential for further optimisation ...
Implemented with input from @lukasm91
k-caching vs "opt"-variant performance (compiled with nvhpc 22.11 and executed with
1 262144 128
):