Optimize Decoupled Look-Back #606

gevtushenko · 2022-12-22T16:45:21Z

Decoupled Look-Back is a core of many CUB algorithms. This PR provides a few optimizations that help reduce contention on L2 and improve overall performance. This PR is intended as the first in a series of optimizations/tunings, so selecting the best parameters is out of the scope for now. This PR also addresses the following issue by relying on strong operations (.relaxed, .release etc.) instead of hints (.cg).

Optimizations

Introduce a larger delay before loading the first look-back window since the data has a low probability of being updated.
Try loading tile states before falling into the spin loop. Once the first window is in the partial state, the previous ones should likely be as well. In other words, there's no point in waiting before loading.
Introduce a delay into the spin loop of WaitForValid to reduce contention and help the signal propagate faster.
Make tile state size at least four bytes (used to be two) to distribute the load between a larger number of cache lines.
Make flag size U32 instead of U8 (for the same reasons above) when message size doesn't let us use a single architectural word, and we have to store flags separately.

Fixes

The increase of tile state size revealed issues with .cg loads on P100 when the message size doesn't fit the architectural word. The fix consists of voting in the spin loop while (WARP_ANY((status == SCAN_TILE_INVALID), 0xffffffff));. Although this might be considered as a breaking change since ScanTileState<T, false>::WaitForValid didn't use to be cooperative, I think it's okay since ScanTileState<T, true>::WaitForValid is cooperative, and we haven't guaranteed that anyway.

Results

To benchmark proposed optimizations, I've selected various GPUs and all algorithms that depend on decoupled look-back. Since the algorithm is sensitive to compute / memory clock ratio, I've run benchmarks with both base and TDP-locked clocks. In general, on large input problem sizes, the speedup is significant enough not to lock clocks. Below is the distribution of execution times for old (main) and new (optimized) decoupled look-back in the case of the device exclusive sum on A100.

Apart from the speedup, the deviation of the new version is smaller. To illustrate broader results, I've grouped multiple benchmarks by the underlying algorithm. For instance, the select.if speedups for different input patterns and operations are combined into a single list as follows:

|    T     |  Op  |  Pattern  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|----------|------|-----------|------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   U32    | Mid  |    Seq    |    2^28    |   3.302 ms |       3.31% |   2.484 ms |       0.35% |  -817.584 us | -24.76% |   FAIL   |
|   U32    | Mid  |   Const   |    2^28    |   2.946 ms |       3.06% |   2.148 ms |       0.32% |  -797.100 us | -27.06% |   FAIL   |
|   U32    | Mid  |   Rand    |    2^28    |   3.642 ms |       2.64% |   2.867 ms |       0.35% |  -774.785 us | -21.27% |   FAIL   |
|   U32    | Zero |    Seq    |    2^28    |   3.610 ms |       2.86% |   2.820 ms |       0.35% |  -790.071 us | -21.89% |   FAIL   |
|   U32    | Zero |   Const   |    2^28    |   3.543 ms |       3.05% |   2.750 ms |       0.37% |  -792.352 us | -22.37% |   FAIL   |
|   U32    | Zero |   Rand    |    2^28    |   3.551 ms |       3.04% |   2.748 ms |       0.31% |  -803.270 us | -22.62% |   FAIL   |
|   U32    | Even |    Seq    |    2^28    |   3.358 ms |       3.23% |   2.449 ms |       0.34% |  -908.775 us | -27.07% |   FAIL   |
|   U32    | Even |   Const   |    2^28    |   3.704 ms |       3.05% |   2.748 ms |       0.33% |  -955.250 us | -25.79% |   FAIL   |
|   U32    | Even |   Rand    |    2^28    |   3.389 ms |       2.64% |   2.524 ms |       0.27% |  -864.592 us | -25.51% |   FAIL   |

Turns into:

speedups = [24.76, 27.06, 21.27, 21.89, 22.37, 22.62, 27.07, 25.79, 25.51]

This is further presented as a bar in the bar plot below. Therefore, error bars should be treated as different speedups in different scenarios rather than a run-to-run variance. Here's the aggregate result for base clocks:

And the image below is for TDP-locked clocks:

elstehle

Thanks for the elaborate PR description and experimental evaluation!

I mostly just have some optional nitpicks. The only question I'd have is related to delay_on_dc_gpu_or_prevent_hoisting, which does not __threadfence_block for SM 70 and above (with exception for SM 80), while we decided to keep the __threadfence_block around in other places.

cub/detail/strong_load.cuh

cub/agent/single_pass_scan_operators.cuh

elstehle · 2023-01-06T14:05:45Z

cub/agent/single_pass_scan_operators.cuh

+                  if (gridDim.x < GridThreshold) 
+                  {
+                    __threadfence_block();
+                  }


nit: I assume this is a heuristic that turned out through experimental evaluation on small problem sizes to be more performant. Maybe a comment to motivate it would help future readers to understand 🙂

cub/agent/single_pass_scan_operators.cuh

cub/detail/strong_load.cuh

gonzalobg

LGTM

gevtushenko requested a review from elstehle December 22, 2022 16:45

gevtushenko linked an issue Dec 22, 2022 that may be closed by this pull request

Question: how lookback used in scan avoid loading stale data in L1 cache? #591

Closed

gevtushenko added P0: must have Absolutely necessary. Critical issue, major blocker, etc. type: bug: functional Does not work as intended. area: performance Does not perform as intended. labels Dec 22, 2022

Optimize decoupled look-back

250fd5c

gevtushenko force-pushed the fix-main/github/scan branch from d94cc04 to 250fd5c Compare December 22, 2022 16:52

gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Dec 22, 2022

Testing NVIDIA/cub#606

59a4663

gevtushenko added the testing: gpuCI in progress Started gpuCI testing. label Dec 22, 2022

gevtushenko changed the title ~~Optimize Decouled Look-Back~~ Optimize Decoupled Look-Back Dec 22, 2022

gevtushenko added testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Dec 23, 2022

Fix temporary storage size for decoupled look-back

4d89cd4

elstehle reviewed Jan 11, 2023

View reviewed changes

elstehle approved these changes Jan 11, 2023

View reviewed changes

gonzalobg reviewed Feb 7, 2023

View reviewed changes

cub/agent/single_pass_scan_operators.cuh Show resolved Hide resolved

gonzalobg reviewed Feb 7, 2023

View reviewed changes

cub/agent/single_pass_scan_operators.cuh Show resolved Hide resolved

gonzalobg reviewed Feb 7, 2023

View reviewed changes

cub/agent/single_pass_scan_operators.cuh Show resolved Hide resolved

gonzalobg reviewed Feb 7, 2023

View reviewed changes

cub/detail/strong_load.cuh Show resolved Hide resolved

gonzalobg approved these changes Feb 7, 2023

View reviewed changes

gevtushenko merged commit e4c1881 into NVIDIA:main Feb 7, 2023

gevtushenko mentioned this pull request Nov 8, 2023

Remove memory clobber in strong memory operations NVIDIA/cccl#915

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Decoupled Look-Back #606

Optimize Decoupled Look-Back #606

gevtushenko commented Dec 22, 2022

elstehle left a comment

elstehle Jan 6, 2023

gonzalobg left a comment

Optimize Decoupled Look-Back #606

Optimize Decoupled Look-Back #606

Conversation

gevtushenko commented Dec 22, 2022

Optimizations

Fixes

Results

elstehle left a comment

Choose a reason for hiding this comment

elstehle Jan 6, 2023

Choose a reason for hiding this comment

gonzalobg left a comment

Choose a reason for hiding this comment