Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Fix RLE when items[0] is NaN #598

Merged
merged 4 commits into from
Nov 30, 2022

Conversation

elstehle
Copy link
Collaborator

@elstehle elstehle commented Nov 28, 2022

This is a suggestion to fix #596.

Situation

The root cause is in cub/agent/agent_reduce_by_key.cuh, where, for the very first tile of items, we're using keys[0] as the tile_predecessor that is later fed into the BlockDiscontinuity:

tile_predecessor = (tile_idx == 0) ? keys[0] : d_keys_in[tile_offset - 1];

...
BlockDiscontinuityKeys(temp_storage.scan_storage.discontinuity)
        .FlagHeads(head_flags, keys, prev_keys, flag_op, tile_predecessor);

Problem

Since NaN == NaN is false: if keys[0] is NaN, BlockDiscontinuityKeys evaluates tile_predecessor == keys[0] as false and will flag keys[0] as the beginning of a new run.

Suggested Solution

After having run BlockDiscontinuity, we reset the flag on the very first item.

if (threadIdx.x == 0 && tile_idx == 0)
{
  scan_items[0].key = 0;
}

Alternative fix

An alternative to fix this would be to invoke a different overload of BlockDiscontinuityKeys for the very first tile of items. That is, invoking BlockDiscontinuityKeys that does not take the tile_predecessor. However, this comes at the cost of increased kernel size, as we'll end up with four BlockDiscontinuityKeys instantiations instead of two. This is the part that would have to be changed:

    if (IS_LAST_TILE)
    {
      // Use custom flag operator to additionally flag the first out-of-bounds
      // item
      GuardedInequalityWrapper<EqualityOpT> flag_op(equality_op, num_remaining);
      BlockDiscontinuityKeys(temp_storage.scan_storage.discontinuity)
        .FlagHeads(head_flags, keys, prev_keys, flag_op, tile_predecessor);
    }
    else
    {
      InequalityWrapper<EqualityOpT> flag_op(equality_op);
      BlockDiscontinuityKeys(temp_storage.scan_storage.discontinuity)
        .FlagHeads(head_flags, keys, prev_keys, flag_op, tile_predecessor);
    }

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, that is a really nice contained fix

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put the algorithm description here to explain current issue.

As the first step, ReduceByKey marks first unique element in key sequence. In normal conditions we'd get:

keys  {    1, 1, 2, ...};
pred  {undef, 1, 1, ...};
flags {    1, 0, 1, ...};

But instead, we provide a tile predecessor item that's always equal to the very first key:

keys  {/* 1 */ 1, 1, 2, ...};
pred  {        1, 1, 1, ...};
flags {        0, 0, 1, ...};

After that, we scan the following pairs make_pair(flag, 1) using ReduceBySegmentOp:

scan_items {{0, 1}, {0, 1}, {1, 1}, ...};
excl_scan  {          ... , {0, 2}};

Later, at the scatter phase, we check flags for each i and when we notice new sequence we write pred as a unique key by excl_scan.key offset considering excl_scan.value count of this unique queue.

This contract means that the very first key is going to be written when processing the next unique key. So the first head has to be false, otherwise we'll attempt to write the first key using undefined offset (that's actually out of num_items array) and then write it again when processing the second unique key.

This second write should mask out newly introduced bug, since the count is calculated correctly and the unique key is written. But out of bounds access is not great. This should also introduce more issues when ScatterTwoPhase usage is triggered.

I'd advice to reset flags before filling scan items, instead of adjusting scan items afterwards. In this case algorithm behaves as expected:

if (threadIdx.x == 0 && tile_idx == 0)
{
  head_flags[0] = 0;
}

#pragma unroll
for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
{
  scan_items[ITEM].value = values[ITEM];
  scan_items[ITEM].key   = head_flags[ITEM];
}

Besides that, alternative solution that's discussed in this PR wouldn't work due to same reasons. Overload of block discontinuity that doesn't take tile predecessor always set the first item to 1.

@elstehle
Copy link
Collaborator Author

Thanks for the elaborate explanation and your suggestion, @senior-zero. I've adopted your suggestion 👍

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fixes! I'll start CI.

@gevtushenko gevtushenko added P1: should have Necessary, but not critical. type: bug: functional Does not work as intended. labels Nov 29, 2022
gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Nov 29, 2022
@gevtushenko gevtushenko added testing: gpuCI in progress Started gpuCI testing. testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Nov 29, 2022
@elstehle
Copy link
Collaborator Author

# cub::DeviceRunLengthEncode::Encode

## [0] Tesla V100-SXM2-32GB

|  T  |  Elements  |  Pattern  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|-----|------------|-----------|------------|-------------|------------|-------------|-----------|---------|----------|
| I8  |    2^20    |  random   |  19.068 us |       5.55% |  19.343 us |       6.42% |  0.275 us |   1.44% |   PASS   |
| I8  |    2^22    |  random   |  49.194 us |       1.53% |  49.096 us |       1.62% | -0.098 us |  -0.20% |   PASS   |
| I8  |    2^24    |  random   | 184.715 us |       0.46% | 184.799 us |       0.48% |  0.084 us |   0.05% |   PASS   |
| I8  |    2^26    |  random   | 743.839 us |       0.32% | 743.867 us |       0.31% |  0.027 us |   0.00% |   PASS   |
| I8  |    2^28    |  random   |   2.978 ms |       0.50% |   2.979 ms |       0.50% |  0.644 us |   0.02% |   PASS   |
| I16 |    2^20    |  random   |  20.463 us |       3.60% |  20.492 us |       3.41% |  0.030 us |   0.15% |   PASS   |
| I16 |    2^22    |  random   |  54.299 us |       1.52% |  54.363 us |       1.45% |  0.065 us |   0.12% |   PASS   |
| I16 |    2^24    |  random   | 200.045 us |       0.55% | 200.096 us |       0.50% |  0.051 us |   0.03% |   PASS   |
| I16 |    2^26    |  random   | 785.927 us |       0.47% | 786.012 us |       0.47% |  0.085 us |   0.01% |   PASS   |
| I16 |    2^28    |  random   |   3.127 ms |       0.50% |   3.127 ms |       0.50% |  0.099 us |   0.00% |   PASS   |
| I32 |    2^20    |  random   |  29.564 us |       4.04% |  29.661 us |       3.73% |  0.098 us |   0.33% |   PASS   |
| I32 |    2^22    |  random   |  89.184 us |       1.74% |  89.311 us |       1.71% |  0.127 us |   0.14% |   PASS   |
| I32 |    2^24    |  random   | 325.111 us |       0.89% | 325.154 us |       0.93% |  0.043 us |   0.01% |   PASS   |
| I32 |    2^26    |  random   |   1.276 ms |       0.60% |   1.276 ms |       0.57% |  0.086 us |   0.01% |   PASS   |
| I32 |    2^28    |  random   |   5.130 ms |       0.50% |   5.130 ms |       0.50% |  0.322 us |   0.01% |   PASS   |
| I64 |    2^20    |  random   |  29.464 us |       3.64% |  29.545 us |       3.80% |  0.080 us |   0.27% |   PASS   |
| I64 |    2^22    |  random   |  70.925 us |       1.53% |  70.980 us |       1.46% |  0.055 us |   0.08% |   PASS   |
| I64 |    2^24    |  random   | 236.658 us |       0.86% | 236.649 us |       0.88% | -0.009 us |  -0.00% |   PASS   |
| I64 |    2^26    |  random   | 900.727 us |       0.74% | 900.695 us |       0.74% | -0.032 us |  -0.00% |   PASS   |
| I64 |    2^28    |  random   |   3.557 ms |       0.73% |   3.557 ms |       0.73% | -0.060 us |  -0.00% |   PASS   |

@gevtushenko gevtushenko merged commit 6c496f4 into NVIDIA:main Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1: should have Necessary, but not critical. testing: gpuCI passed Passed gpuCI testing. type: bug: functional Does not work as intended.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

DeviceRunLengthEncode returns incorrect result with nan.
3 participants