Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove GPUClock Cost Function #4802

Merged
merged 1 commit into from
Mar 25, 2024
Merged

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Mar 25, 2024

Removing the GPUClock cost function due to the following reasons.

Incomplete Implementation:
The implementation is only added to selected kernels. The implementation is not generalized to work with varying occupancy of different kernels, even if it were used in all kernels. The implementation is verbose.

Unused:
Our host-side timer implementation was in the last years extended to synchronize kernels at minimal overhead cost. This and heuristic is actually used.

Research scope shifted:
In the last years, we realized that we do not need more precise scalar cost functions, but instead vector cost functions to build better load balance performance models from.

Costly when used:
The implementation uses an atomic add of each GPU thread, instead of, e.g., just using one per warp or a generalized time that scales with actual GPU time. This adds severe memory bandwidth strain.

Costly, even if not used:
The implementation adds about 4 registers unnecessary to all instrumented GPU kernels once compiled in (by default).

Thanks to @AlexanderSinn for bringing this up again.

@ax3l ax3l added backend: cuda Specific to CUDA execution (GPUs) Performance optimization backend: hip Specific to ROCm execution (GPUs) backend: sycl Specific to DPC++/SYCL execution (CPUs/GPUs) labels Mar 25, 2024
@ax3l ax3l force-pushed the topic-rm-gpuclock branch 7 times, most recently from 1d6f23d to 37db926 Compare March 25, 2024 01:54
Removing the GPUClock cost function due to the following reasons.

Incomplete Implementation:
The implementation is only added to selected kernels. The
implementation is not generalized to work with varying occupancy
of different kernels, even if it were used in all kernels. The
implementation is verbose.

Unused:
Our host-side timer implementation was in the last years extended
to synchronize kernels at minimal overhead cost. This and heuristic
is actually used.

Research scope shifted:
In the last years, we realized that we do not need more precise
scalar cost functions, but instead vector cost functions to build
better load balance performance models from.

Costly when used:
The implementation uses an atomic add of each kernel, instead
of, e.g., just using one per warp. This adds severe memory bandwidth
strain.

Costly, even if not used:
The implementation adds about 4 registers unnecessary to all
instrumented GPU kernels once compiled in (by default).
@RemiLehe RemiLehe merged commit f49a63f into ECP-WarpX:development Mar 25, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: cuda Specific to CUDA execution (GPUs) backend: hip Specific to ROCm execution (GPUs) backend: sycl Specific to DPC++/SYCL execution (CPUs/GPUs) Performance optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants