-
Notifications
You must be signed in to change notification settings - Fork 447
Conversation
d94cc04
to
250fd5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the elaborate PR description and experimental evaluation!
I mostly just have some optional nitpicks. The only question I'd have is related to delay_on_dc_gpu_or_prevent_hoisting
, which does not __threadfence_block
for SM 70 and above (with exception for SM 80), while we decided to keep the __threadfence_block
around in other places.
if (gridDim.x < GridThreshold) | ||
{ | ||
__threadfence_block(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I assume this is a heuristic that turned out through experimental evaluation on small problem sizes to be more performant. Maybe a comment to motivate it would help future readers to understand 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Decoupled Look-Back is a core of many CUB algorithms. This PR provides a few optimizations that help reduce contention on L2 and improve overall performance. This PR is intended as the first in a series of optimizations/tunings, so selecting the best parameters is out of the scope for now. This PR also addresses the following issue by relying on strong operations (
.relaxed
,.release
etc.) instead of hints (.cg
).Optimizations
WaitForValid
to reduce contention and help the signal propagate faster.U32
instead ofU8
(for the same reasons above) when message size doesn't let us use a single architectural word, and we have to store flags separately.Fixes
The increase of tile state size revealed issues with
.cg
loads on P100 when the message size doesn't fit the architectural word. The fix consists of voting in the spin loopwhile (WARP_ANY((status == SCAN_TILE_INVALID), 0xffffffff));
. Although this might be considered as a breaking change sinceScanTileState<T, false>::WaitForValid
didn't use to be cooperative, I think it's okay sinceScanTileState<T, true>::WaitForValid
is cooperative, and we haven't guaranteed that anyway.Results
To benchmark proposed optimizations, I've selected various GPUs and all algorithms that depend on decoupled look-back. Since the algorithm is sensitive to compute / memory clock ratio, I've run benchmarks with both base and TDP-locked clocks. In general, on large input problem sizes, the speedup is significant enough not to lock clocks. Below is the distribution of execution times for old (main) and new (optimized) decoupled look-back in the case of the device exclusive sum on A100.
Apart from the speedup, the deviation of the new version is smaller. To illustrate broader results, I've grouped multiple benchmarks by the underlying algorithm. For instance, the
select.if
speedups for different input patterns and operations are combined into a single list as follows:Turns into:
This is further presented as a bar in the bar plot below. Therefore, error bars should be treated as different speedups in different scenarios rather than a run-to-run variance. Here's the aggregate result for base clocks:
And the image below is for TDP-locked clocks: