Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Optimize Decoupled Look-Back #606

Merged
merged 2 commits into from
Feb 7, 2023

Conversation

gevtushenko
Copy link
Collaborator

Decoupled Look-Back is a core of many CUB algorithms. This PR provides a few optimizations that help reduce contention on L2 and improve overall performance. This PR is intended as the first in a series of optimizations/tunings, so selecting the best parameters is out of the scope for now. This PR also addresses the following issue by relying on strong operations (.relaxed, .release etc.) instead of hints (.cg).

Optimizations

  • Introduce a larger delay before loading the first look-back window since the data has a low probability of being updated.
  • Try loading tile states before falling into the spin loop. Once the first window is in the partial state, the previous ones should likely be as well. In other words, there's no point in waiting before loading.
  • Introduce a delay into the spin loop of WaitForValid to reduce contention and help the signal propagate faster.
  • Make tile state size at least four bytes (used to be two) to distribute the load between a larger number of cache lines.
  • Make flag size U32 instead of U8 (for the same reasons above) when message size doesn't let us use a single architectural word, and we have to store flags separately.

Fixes

The increase of tile state size revealed issues with .cg loads on P100 when the message size doesn't fit the architectural word. The fix consists of voting in the spin loop while (WARP_ANY((status == SCAN_TILE_INVALID), 0xffffffff));. Although this might be considered as a breaking change since ScanTileState<T, false>::WaitForValid didn't use to be cooperative, I think it's okay since ScanTileState<T, true>::WaitForValid is cooperative, and we haven't guaranteed that anyway.

Results

To benchmark proposed optimizations, I've selected various GPUs and all algorithms that depend on decoupled look-back. Since the algorithm is sensitive to compute / memory clock ratio, I've run benchmarks with both base and TDP-locked clocks. In general, on large input problem sizes, the speedup is significant enough not to lock clocks. Below is the distribution of execution times for old (main) and new (optimized) decoupled look-back in the case of the device exclusive sum on A100.

density

Apart from the speedup, the deviation of the new version is smaller. To illustrate broader results, I've grouped multiple benchmarks by the underlying algorithm. For instance, the select.if speedups for different input patterns and operations are combined into a single list as follows:

|    T     |  Op  |  Pattern  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|----------|------|-----------|------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   U32    | Mid  |    Seq    |    2^28    |   3.302 ms |       3.31% |   2.484 ms |       0.35% |  -817.584 us | -24.76% |   FAIL   |
|   U32    | Mid  |   Const   |    2^28    |   2.946 ms |       3.06% |   2.148 ms |       0.32% |  -797.100 us | -27.06% |   FAIL   |
|   U32    | Mid  |   Rand    |    2^28    |   3.642 ms |       2.64% |   2.867 ms |       0.35% |  -774.785 us | -21.27% |   FAIL   |
|   U32    | Zero |    Seq    |    2^28    |   3.610 ms |       2.86% |   2.820 ms |       0.35% |  -790.071 us | -21.89% |   FAIL   |
|   U32    | Zero |   Const   |    2^28    |   3.543 ms |       3.05% |   2.750 ms |       0.37% |  -792.352 us | -22.37% |   FAIL   |
|   U32    | Zero |   Rand    |    2^28    |   3.551 ms |       3.04% |   2.748 ms |       0.31% |  -803.270 us | -22.62% |   FAIL   |
|   U32    | Even |    Seq    |    2^28    |   3.358 ms |       3.23% |   2.449 ms |       0.34% |  -908.775 us | -27.07% |   FAIL   |
|   U32    | Even |   Const   |    2^28    |   3.704 ms |       3.05% |   2.748 ms |       0.33% |  -955.250 us | -25.79% |   FAIL   |
|   U32    | Even |   Rand    |    2^28    |   3.389 ms |       2.64% |   2.524 ms |       0.27% |  -864.592 us | -25.51% |   FAIL   |

Turns into:

speedups = [24.76, 27.06, 21.27, 21.89, 22.37, 22.62, 27.07, 25.79, 25.51]

This is further presented as a bar in the bar plot below. Therefore, error bars should be treated as different speedups in different scenarios rather than a run-to-run variance. Here's the aggregate result for base clocks:

base

And the image below is for TDP-locked clocks:

tdp

@gevtushenko gevtushenko added P0: must have Absolutely necessary. Critical issue, major blocker, etc. type: bug: functional Does not work as intended. area: performance Does not perform as intended. labels Dec 22, 2022
gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Dec 22, 2022
@gevtushenko gevtushenko added the testing: gpuCI in progress Started gpuCI testing. label Dec 22, 2022
@gevtushenko gevtushenko changed the title Optimize Decouled Look-Back Optimize Decoupled Look-Back Dec 22, 2022
@gevtushenko gevtushenko added testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Dec 23, 2022
Copy link
Collaborator

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the elaborate PR description and experimental evaluation!

I mostly just have some optional nitpicks. The only question I'd have is related to delay_on_dc_gpu_or_prevent_hoisting, which does not __threadfence_block for SM 70 and above (with exception for SM 80), while we decided to keep the __threadfence_block around in other places.

cub/detail/strong_load.cuh Show resolved Hide resolved
cub/agent/single_pass_scan_operators.cuh Show resolved Hide resolved
Comment on lines +122 to +125
if (gridDim.x < GridThreshold)
{
__threadfence_block();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I assume this is a heuristic that turned out through experimental evaluation on small problem sizes to be more performant. Maybe a comment to motivate it would help future readers to understand 🙂

cub/agent/single_pass_scan_operators.cuh Show resolved Hide resolved
Copy link
Collaborator

@gonzalobg gonzalobg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area: performance Does not perform as intended. P0: must have Absolutely necessary. Critical issue, major blocker, etc. testing: gpuCI passed Passed gpuCI testing. type: bug: functional Does not work as intended.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Question: how lookback used in scan avoid loading stale data in L1 cache?
3 participants