Fix RLE when items[0] is NaN #598

elstehle · 2022-11-28T10:58:42Z

This is a suggestion to fix #596.

Situation

The root cause is in cub/agent/agent_reduce_by_key.cuh, where, for the very first tile of items, we're using keys[0] as the tile_predecessor that is later fed into the BlockDiscontinuity:

tile_predecessor = (tile_idx == 0) ? keys[0] : d_keys_in[tile_offset - 1];

...
BlockDiscontinuityKeys(temp_storage.scan_storage.discontinuity)
        .FlagHeads(head_flags, keys, prev_keys, flag_op, tile_predecessor);

Problem

Since NaN == NaN is false: if keys[0] is NaN, BlockDiscontinuityKeys evaluates tile_predecessor == keys[0] as false and will flag keys[0] as the beginning of a new run.

Alternative fix

An alternative to fix this would be to invoke a different overload of BlockDiscontinuityKeys for the very first tile of items. That is, invoking BlockDiscontinuityKeys that does not take the tile_predecessor. However, this comes at the cost of increased kernel size, as we'll end up with four BlockDiscontinuityKeys instantiations instead of two. This is the part that would have to be changed:

    if (IS_LAST_TILE)
    {
      // Use custom flag operator to additionally flag the first out-of-bounds
      // item
      GuardedInequalityWrapper<EqualityOpT> flag_op(equality_op, num_remaining);
      BlockDiscontinuityKeys(temp_storage.scan_storage.discontinuity)
        .FlagHeads(head_flags, keys, prev_keys, flag_op, tile_predecessor);
    }
    else
    {
      InequalityWrapper<EqualityOpT> flag_op(equality_op);
      BlockDiscontinuityKeys(temp_storage.scan_storage.discontinuity)
        .FlagHeads(head_flags, keys, prev_keys, flag_op, tile_predecessor);
    }

miscco

Thanks a lot, that is a really nice contained fix

gevtushenko

I'll put the algorithm description here to explain current issue.

As the first step, ReduceByKey marks first unique element in key sequence. In normal conditions we'd get:

keys  {    1, 1, 2, ...};
pred  {undef, 1, 1, ...};
flags {    1, 0, 1, ...};

But instead, we provide a tile predecessor item that's always equal to the very first key:

keys  {/* 1 */ 1, 1, 2, ...};
pred  {        1, 1, 1, ...};
flags {        0, 0, 1, ...};

After that, we scan the following pairs make_pair(flag, 1) using ReduceBySegmentOp:

scan_items {{0, 1}, {0, 1}, {1, 1}, ...};
excl_scan  {          ... , {0, 2}};

Later, at the scatter phase, we check flags for each i and when we notice new sequence we write pred as a unique key by excl_scan.key offset considering excl_scan.value count of this unique queue.

This contract means that the very first key is going to be written when processing the next unique key. So the first head has to be false, otherwise we'll attempt to write the first key using undefined offset (that's actually out of num_items array) and then write it again when processing the second unique key.

This second write should mask out newly introduced bug, since the count is calculated correctly and the unique key is written. But out of bounds access is not great. This should also introduce more issues when ScatterTwoPhase usage is triggered.

I'd advice to reset flags before filling scan items, instead of adjusting scan items afterwards. In this case algorithm behaves as expected:

if (threadIdx.x == 0 && tile_idx == 0)
{
  head_flags[0] = 0;
}

#pragma unroll
for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
{
  scan_items[ITEM].value = values[ITEM];
  scan_items[ITEM].key   = head_flags[ITEM];
}

Besides that, alternative solution that's discussed in this PR wouldn't work due to same reasons. Overload of block discontinuity that doesn't take tile predecessor always set the first item to 1.

elstehle · 2022-11-29T10:54:17Z

Thanks for the elaborate explanation and your suggestion, @senior-zero. I've adopted your suggestion 👍

gevtushenko

Thank you for the fixes! I'll start CI.

elstehle · 2022-11-30T12:53:31Z

# cub::DeviceRunLengthEncode::Encode

## [0] Tesla V100-SXM2-32GB

|  T  |  Elements  |  Pattern  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|-----|------------|-----------|------------|-------------|------------|-------------|-----------|---------|----------|
| I8  |    2^20    |  random   |  19.068 us |       5.55% |  19.343 us |       6.42% |  0.275 us |   1.44% |   PASS   |
| I8  |    2^22    |  random   |  49.194 us |       1.53% |  49.096 us |       1.62% | -0.098 us |  -0.20% |   PASS   |
| I8  |    2^24    |  random   | 184.715 us |       0.46% | 184.799 us |       0.48% |  0.084 us |   0.05% |   PASS   |
| I8  |    2^26    |  random   | 743.839 us |       0.32% | 743.867 us |       0.31% |  0.027 us |   0.00% |   PASS   |
| I8  |    2^28    |  random   |   2.978 ms |       0.50% |   2.979 ms |       0.50% |  0.644 us |   0.02% |   PASS   |
| I16 |    2^20    |  random   |  20.463 us |       3.60% |  20.492 us |       3.41% |  0.030 us |   0.15% |   PASS   |
| I16 |    2^22    |  random   |  54.299 us |       1.52% |  54.363 us |       1.45% |  0.065 us |   0.12% |   PASS   |
| I16 |    2^24    |  random   | 200.045 us |       0.55% | 200.096 us |       0.50% |  0.051 us |   0.03% |   PASS   |
| I16 |    2^26    |  random   | 785.927 us |       0.47% | 786.012 us |       0.47% |  0.085 us |   0.01% |   PASS   |
| I16 |    2^28    |  random   |   3.127 ms |       0.50% |   3.127 ms |       0.50% |  0.099 us |   0.00% |   PASS   |
| I32 |    2^20    |  random   |  29.564 us |       4.04% |  29.661 us |       3.73% |  0.098 us |   0.33% |   PASS   |
| I32 |    2^22    |  random   |  89.184 us |       1.74% |  89.311 us |       1.71% |  0.127 us |   0.14% |   PASS   |
| I32 |    2^24    |  random   | 325.111 us |       0.89% | 325.154 us |       0.93% |  0.043 us |   0.01% |   PASS   |
| I32 |    2^26    |  random   |   1.276 ms |       0.60% |   1.276 ms |       0.57% |  0.086 us |   0.01% |   PASS   |
| I32 |    2^28    |  random   |   5.130 ms |       0.50% |   5.130 ms |       0.50% |  0.322 us |   0.01% |   PASS   |
| I64 |    2^20    |  random   |  29.464 us |       3.64% |  29.545 us |       3.80% |  0.080 us |   0.27% |   PASS   |
| I64 |    2^22    |  random   |  70.925 us |       1.53% |  70.980 us |       1.46% |  0.055 us |   0.08% |   PASS   |
| I64 |    2^24    |  random   | 236.658 us |       0.86% | 236.649 us |       0.88% | -0.009 us |  -0.00% |   PASS   |
| I64 |    2^26    |  random   | 900.727 us |       0.74% | 900.695 us |       0.74% | -0.032 us |  -0.00% |   PASS   |
| I64 |    2^28    |  random   |   3.557 ms |       0.73% |   3.557 ms |       0.73% | -0.060 us |  -0.00% |   PASS   |

elstehle added 3 commits November 27, 2022 22:21

apply format

64fe909

fixes rle with nans at first item

9a1f535

adds test for NaN

6fad3b6

elstehle requested a review from gevtushenko November 28, 2022 10:58

miscco approved these changes Nov 28, 2022

View reviewed changes

gevtushenko suggested changes Nov 29, 2022

View reviewed changes

fixes resetting head flag on very first item

93f4b7b

elstehle requested a review from gevtushenko November 29, 2022 10:54

gevtushenko approved these changes Nov 29, 2022

View reviewed changes

gevtushenko added P1: should have Necessary, but not critical. type: bug: functional Does not work as intended. labels Nov 29, 2022

gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Nov 29, 2022

Testing NVIDIA/cub#598

2ecf82f

gevtushenko added testing: gpuCI in progress Started gpuCI testing. testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Nov 29, 2022

gevtushenko merged commit 6c496f4 into NVIDIA:main Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RLE when items[0] is NaN #598

Fix RLE when items[0] is NaN #598

elstehle commented Nov 28, 2022 •

edited

Loading

miscco left a comment

gevtushenko left a comment

elstehle commented Nov 29, 2022

gevtushenko left a comment

elstehle commented Nov 30, 2022

Fix RLE when items[0] is NaN #598

Fix RLE when items[0] is NaN #598

Conversation

elstehle commented Nov 28, 2022 • edited Loading

Situation

Problem

Suggested Solution

Alternative fix

miscco left a comment

Choose a reason for hiding this comment

gevtushenko left a comment

Choose a reason for hiding this comment

elstehle commented Nov 29, 2022

gevtushenko left a comment

Choose a reason for hiding this comment

elstehle commented Nov 30, 2022

elstehle commented Nov 28, 2022 •

edited

Loading