Remove GPUClock Cost Function #4802

ax3l · 2024-03-25T01:08:05Z

Removing the GPUClock cost function due to the following reasons.

Incomplete Implementation:
The implementation is only added to selected kernels. The implementation is not generalized to work with varying occupancy of different kernels, even if it were used in all kernels. The implementation is verbose.

Unused:
Our host-side timer implementation was in the last years extended to synchronize kernels at minimal overhead cost. This and heuristic is actually used.

Research scope shifted:
In the last years, we realized that we do not need more precise scalar cost functions, but instead vector cost functions to build better load balance performance models from.

Costly when used:
The implementation uses an atomic add of each GPU thread, instead of, e.g., just using one per warp or a generalized time that scales with actual GPU time. This adds severe memory bandwidth strain.

Costly, even if not used:
The implementation adds about 4 registers unnecessary to all instrumented GPU kernels once compiled in (by default).

Thanks to @AlexanderSinn for bringing this up again.

Removing the GPUClock cost function due to the following reasons. Incomplete Implementation: The implementation is only added to selected kernels. The implementation is not generalized to work with varying occupancy of different kernels, even if it were used in all kernels. The implementation is verbose. Unused: Our host-side timer implementation was in the last years extended to synchronize kernels at minimal overhead cost. This and heuristic is actually used. Research scope shifted: In the last years, we realized that we do not need more precise scalar cost functions, but instead vector cost functions to build better load balance performance models from. Costly when used: The implementation uses an atomic add of each kernel, instead of, e.g., just using one per warp. This adds severe memory bandwidth strain. Costly, even if not used: The implementation adds about 4 registers unnecessary to all instrumented GPU kernels once compiled in (by default).

ax3l added backend: cuda Specific to CUDA execution (GPUs) Performance optimization backend: hip Specific to ROCm execution (GPUs) backend: sycl Specific to DPC++/SYCL execution (CPUs/GPUs) labels Mar 25, 2024

ax3l requested review from RemiLehe, kngott and AlexanderSinn March 25, 2024 01:08

ax3l force-pushed the topic-rm-gpuclock branch 7 times, most recently from 1d6f23d to 37db926 Compare March 25, 2024 01:54

ax3l force-pushed the topic-rm-gpuclock branch from 37db926 to 6958be7 Compare March 25, 2024 01:55

ax3l assigned RemiLehe Mar 25, 2024

RemiLehe approved these changes Mar 25, 2024

View reviewed changes

RemiLehe merged commit f49a63f into ECP-WarpX:development Mar 25, 2024
45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove GPUClock Cost Function #4802

Remove GPUClock Cost Function #4802

ax3l commented Mar 25, 2024 •

edited

Loading

Remove GPUClock Cost Function #4802

Remove GPUClock Cost Function #4802

Conversation

ax3l commented Mar 25, 2024 • edited Loading

ax3l commented Mar 25, 2024 •

edited

Loading