You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I was looking at 2 applications, one had the debug metrics and one didn't. With debug metrics I noticed that in GpuShuffleCoalesce the op time metric doesn't include the concat batch time. For example:
GpuShuffleCoalesce
concat batch time total (min, med, max (stageId: taskId))
2.4 s (7 ms, 23 ms, 49 ms (stage 42.0: task 12338))
output columnar batches: 99
input rows: 2,569,770
GPU semaphore wait time total (min, med, max (stageId: taskId))
10.7 s (0 ms, 112 ms, 249 ms (stage 42.0: task 12375))
input columnar batches: 4,200
op time total (min, med, max (stageId: taskId))
55 ms (0 ms, 0 ms, 7 ms (stage 42.0: task 12383))
output rows: 2,569,770
I looked at the one without debug metrics enabled and it only has the op time metric so the user can't really see how much time is taken.
GpuShuffleCoalesce
op time total (min, med, max (stageId: taskId))
60 ms (0 ms, 0 ms, 2 ms (stage 42.0: task 12346))
We should fix GpuShuffleCoalesce to include the concat batch time in op time. If there is a reason we don't do this we need to make sure all metrics are available with default spark.rapids.sql.metrics.level
The text was updated successfully, but these errors were encountered:
If there is a reason we don't do this we need to make sure all metrics are available with default spark.rapids.sql.metrics.level
IMO it should always be a bug if the op time metric does not include all time spent actively performing computation specific to this node (as opposed to waiting for input iterators or other computation performed outside of this plan node). The point of the op time metric is to encapsulate all computation specifically performed in this node and not in other nodes. For the case of GpuShuffleCoalesce, concat time is a subset of the op time.
The problem in the code is that the optime metric does not cover the concatenation time performed in the next() method. One possible fix is to update this range to compute the time and then add it to both the concat time and op time metrics.
Describe the bug
I was looking at 2 applications, one had the debug metrics and one didn't. With debug metrics I noticed that in GpuShuffleCoalesce the op time metric doesn't include the concat batch time. For example:
I looked at the one without debug metrics enabled and it only has the op time metric so the user can't really see how much time is taken.
We should fix GpuShuffleCoalesce to include the concat batch time in op time. If there is a reason we don't do this we need to make sure all metrics are available with default spark.rapids.sql.metrics.level
The text was updated successfully, but these errors were encountered: