-
Notifications
You must be signed in to change notification settings - Fork 447
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, that is a really nice contained fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put the algorithm description here to explain current issue.
As the first step, ReduceByKey
marks first unique element in key sequence. In normal conditions we'd get:
keys { 1, 1, 2, ...};
pred {undef, 1, 1, ...};
flags { 1, 0, 1, ...};
But instead, we provide a tile predecessor item that's always equal to the very first key:
keys {/* 1 */ 1, 1, 2, ...};
pred { 1, 1, 1, ...};
flags { 0, 0, 1, ...};
After that, we scan the following pairs make_pair(flag, 1)
using ReduceBySegmentOp
:
scan_items {{0, 1}, {0, 1}, {1, 1}, ...};
excl_scan { ... , {0, 2}};
Later, at the scatter phase, we check flags
for each i
and when we notice new sequence we write pred
as a unique key by excl_scan.key
offset considering excl_scan.value
count of this unique queue.
This contract means that the very first key is going to be written when processing the next unique key. So the first head has to be false, otherwise we'll attempt to write the first key using undefined offset (that's actually out of num_items
array) and then write it again when processing the second unique key.
This second write should mask out newly introduced bug, since the count is calculated correctly and the unique key is written. But out of bounds access is not great. This should also introduce more issues when ScatterTwoPhase
usage is triggered.
I'd advice to reset flags before filling scan items, instead of adjusting scan items afterwards. In this case algorithm behaves as expected:
if (threadIdx.x == 0 && tile_idx == 0)
{
head_flags[0] = 0;
}
#pragma unroll
for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM)
{
scan_items[ITEM].value = values[ITEM];
scan_items[ITEM].key = head_flags[ITEM];
}
Besides that, alternative solution that's discussed in this PR wouldn't work due to same reasons. Overload of block discontinuity that doesn't take tile predecessor always set the first item to 1
.
Thanks for the elaborate explanation and your suggestion, @senior-zero. I've adopted your suggestion 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fixes! I'll start CI.
|
This is a suggestion to fix #596.
Situation
The root cause is in
cub/agent/agent_reduce_by_key.cuh
, where, for the very first tile of items, we're usingkeys[0]
as thetile_predecessor
that is later fed into theBlockDiscontinuity
:Problem
Since
NaN == NaN
is false: ifkeys[0]
isNaN
,BlockDiscontinuityKeys
evaluatestile_predecessor == keys[0]
as false and will flagkeys[0]
as the beginning of a new run.Suggested Solution
After having run
BlockDiscontinuity
, we reset the flag on the very first item.Alternative fix
An alternative to fix this would be to invoke a different overload of
BlockDiscontinuityKeys
for the very first tile of items. That is, invokingBlockDiscontinuityKeys
that does not take thetile_predecessor
. However, this comes at the cost of increased kernel size, as we'll end up with fourBlockDiscontinuityKeys
instantiations instead of two. This is the part that would have to be changed: