Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GpuShuffleCoalesce op time metric doesn't include concat batch time #5891

Closed
tgravescs opened this issue Jun 22, 2022 · 1 comment · Fixed by #5950
Closed

[BUG] GpuShuffleCoalesce op time metric doesn't include concat batch time #5891

tgravescs opened this issue Jun 22, 2022 · 1 comment · Fixed by #5950
Assignees
Labels
bug Something isn't working P1 Nice to have for release

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
I was looking at 2 applications, one had the debug metrics and one didn't. With debug metrics I noticed that in GpuShuffleCoalesce the op time metric doesn't include the concat batch time. For example:

GpuShuffleCoalesce
concat batch time total (min, med, max (stageId: taskId))
2.4 s (7 ms, 23 ms, 49 ms (stage 42.0: task 12338))
output columnar batches: 99
input rows: 2,569,770
GPU semaphore wait time total (min, med, max (stageId: taskId))
10.7 s (0 ms, 112 ms, 249 ms (stage 42.0: task 12375))
input columnar batches: 4,200
op time total (min, med, max (stageId: taskId))
55 ms (0 ms, 0 ms, 7 ms (stage 42.0: task 12383))
output rows: 2,569,770

I looked at the one without debug metrics enabled and it only has the op time metric so the user can't really see how much time is taken.

GpuShuffleCoalesce

op time total (min, med, max (stageId: taskId))
60 ms (0 ms, 0 ms, 2 ms (stage 42.0: task 12346))

We should fix GpuShuffleCoalesce to include the concat batch time in op time. If there is a reason we don't do this we need to make sure all metrics are available with default spark.rapids.sql.metrics.level

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 22, 2022
@jlowe
Copy link
Contributor

jlowe commented Jun 22, 2022

If there is a reason we don't do this we need to make sure all metrics are available with default spark.rapids.sql.metrics.level

IMO it should always be a bug if the op time metric does not include all time spent actively performing computation specific to this node (as opposed to waiting for input iterators or other computation performed outside of this plan node). The point of the op time metric is to encapsulate all computation specifically performed in this node and not in other nodes. For the case of GpuShuffleCoalesce, concat time is a subset of the op time.

The problem in the code is that the optime metric does not cover the concatenation time performed in the next() method. One possible fix is to update this range to compute the time and then add it to both the concat time and op time metrics.

@sameerz sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Jun 28, 2022
@res-life res-life self-assigned this Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Nice to have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants