Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The rs. serialization time metric is misleading #11173

Closed
jihoonson opened this issue Jul 11, 2024 · 0 comments · Fixed by #11180
Closed

[BUG] The rs. serialization time metric is misleading #11173

jihoonson opened this issue Jul 11, 2024 · 0 comments · Fixed by #11180
Assignees
Labels
bug Something isn't working

Comments

@jihoonson
Copy link
Collaborator

jihoonson commented Jul 11, 2024

Describe the bug
The rs. serialization time metric is a metric indicating how much time you spend for serializing shuffle data in a stage. This metric is shown in the spark UI when the metric level is set to debug. This metric is computed here by measuring the time to execute the whole logic and then excluding the input processing time (records.next) from it. However, we are currently not taking the time for records.haxNext into account in which actual processing seems to be happening rather than in records.next. Here is a breakdown of the time of a write call in a task.

  • Total time to execute the entire logic in the call: 56,780 ms (this time includes everything below)
  • records.hasNext: 56,757 ms
  • records.next: 0.629 ms
  • Time to prepare a columnar batch (incrementing its reference count and computing its size): 5 ms
  • Wait time on BytesInFlightLimiter: 1 ms
  • Write task enqueue time: 10 ms
  • Wait time for all write tasks to finish: 0.066 ms

As you see, this write call spent 99.9% of its execution time in records.hasNext.

Steps/Code to reproduce bug
Set spark.rapids.sql.metrics.level to DEBUG and run any query that involves non-trival shuffle.

Expected behavior
We should exclude the records.hasNext time when measuring writeTimeNs, just like records.next. This will make the current way of estimating the serialization time more accurate.

Environment details (please complete the following information)

  • Environment location: Standalone
  • Spark configuration settings related to the issue: spark.rapids.sql.metrics.level should be set to DEBUG.

Additional context
The more I think about it, even after we fix this bug, I would still find this rs. serialization time metric a bit misleading. Here is how we compute it today.

  1. We measure the time to write the shuffle data with multiple write threads. This is called writeTimeNs in the code.
  2. We measure the accumulated I/O time and record serialization time as if they happened in sequential.
  3. We compute ratios of I/O time and record serialization time to the sum of results from 2).
  4. rs. serialization time is computed by writeTimeNs * record serialization ratio computed in 3).

This logic assumes that the serialization and the I/O do not happen at the same time. However, in fact, there could be overlaps between serialization and I/O as one thread can perform the serialization while another is writing serialized data on disk. As such, this metric is rather an approximation, which I'm not sure how useful it could be.

I suppose we could first make it crystal clear in the doc that this metric is an approximation and thus people should use it just to get a sense of how expensive the serialization cost could be, but should not trust the number as it is. Though, I don't seem to find any existing doc explaining this metric. Please correct me if I'm wrong.

As for a long term solution, we may want to come up with better metrics, maybe avgSerializationTimePerRecord, that would be more obvious and thus easier to interpret.

@jihoonson jihoonson added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 11, 2024
@jihoonson jihoonson self-assigned this Jul 12, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants