[BUG] GpuShuffleCoalesce op time metric doesn't include concat batch time #5891

tgravescs · 2022-06-22T21:10:36Z

Describe the bug
I was looking at 2 applications, one had the debug metrics and one didn't. With debug metrics I noticed that in GpuShuffleCoalesce the op time metric doesn't include the concat batch time. For example:

GpuShuffleCoalesce
concat batch time total (min, med, max (stageId: taskId))
2.4 s (7 ms, 23 ms, 49 ms (stage 42.0: task 12338))
output columnar batches: 99
input rows: 2,569,770
GPU semaphore wait time total (min, med, max (stageId: taskId))
10.7 s (0 ms, 112 ms, 249 ms (stage 42.0: task 12375))
input columnar batches: 4,200
op time total (min, med, max (stageId: taskId))
55 ms (0 ms, 0 ms, 7 ms (stage 42.0: task 12383))
output rows: 2,569,770

I looked at the one without debug metrics enabled and it only has the op time metric so the user can't really see how much time is taken.

GpuShuffleCoalesce

op time total (min, med, max (stageId: taskId))
60 ms (0 ms, 0 ms, 2 ms (stage 42.0: task 12346))

We should fix GpuShuffleCoalesce to include the concat batch time in op time. If there is a reason we don't do this we need to make sure all metrics are available with default spark.rapids.sql.metrics.level

The text was updated successfully, but these errors were encountered:

jlowe · 2022-06-22T22:21:57Z

If there is a reason we don't do this we need to make sure all metrics are available with default spark.rapids.sql.metrics.level

IMO it should always be a bug if the op time metric does not include all time spent actively performing computation specific to this node (as opposed to waiting for input iterators or other computation performed outside of this plan node). The point of the op time metric is to encapsulate all computation specifically performed in this node and not in other nodes. For the case of GpuShuffleCoalesce, concat time is a subset of the op time.

The problem in the code is that the optime metric does not cover the concatenation time performed in the next() method. One possible fix is to update this range to compute the time and then add it to both the concat time and op time metrics.

tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 22, 2022

sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Jun 28, 2022

res-life self-assigned this Jun 29, 2022

res-life mentioned this issue Jul 5, 2022

Fix GpuShuffleCoalesce op time metric doesn't include concat batch time #5950

Merged

res-life closed this as completed in #5950 Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GpuShuffleCoalesce op time metric doesn't include concat batch time #5891

[BUG] GpuShuffleCoalesce op time metric doesn't include concat batch time #5891

tgravescs commented Jun 22, 2022

jlowe commented Jun 22, 2022

[BUG] GpuShuffleCoalesce op time metric doesn't include concat batch time #5891

[BUG] GpuShuffleCoalesce op time metric doesn't include concat batch time #5891

Comments

tgravescs commented Jun 22, 2022

jlowe commented Jun 22, 2022