More accurate estimation for the result serialization time in RapidsShuffleThreadedWriterBase #11180

jihoonson · 2024-07-12T18:32:58Z

The shuffle result serialization time metric currently includes input data processing time as well, which is misleading. This PR excludes the processing time from the serialization time estimation.

… time estimation Signed-off-by: Jihoon Son <ghoonson@gmail.com>

jlowe

What about the time spent blocking on the limiter -- is that still desired in the serialization time?

jihoonson · 2024-07-12T19:12:12Z

What about the time spent blocking on the limiter -- is that still desired in the serialization time?

Good point. It should be excluded as well. In fact, there are other things as well we may want to exclude from the serialization time estimation. They were trivial in my testing as seen in #11173, but could have larger impacts with different cluster settings or data sets. I will fix it soon.

jihoonson · 2024-07-12T20:26:19Z

Alright, the batch size computing time and the wait time on the limiter are both excluded from the serialization time estimation now. The former is usually trivial, but maybe will become non-trivial in some cases when you have lots of columns. It is not expensive to compute anyway.

jlowe · 2024-07-12T21:18:45Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

            // writeTime is the amount of time it took to push bytes through the stream
            // minus the amount of time it took to get the batch from the upstream execs


This comment is out of date

Thanks, fixed now.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

abellina · 2024-07-15T13:40:48Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

            val recordWriteTime: AtomicLong = new AtomicLong(0L)
-            var computeTime: Long = 0L
+            // Time spent waiting on the limiter
+            var waitTimeOnLimiterNs: Long = 0L


for future work, we may want to expose waitTimeOnLimiterNs as a metric. It's hard to figure out we are waiting for a limit otherwise. Filed #11187

Thanks. This will be useful!

abellina

Minor nit, looking good

abellina · 2024-07-16T13:33:30Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

+    write(new TimeTrackingIterator(records))
+  }
+
+  def write(records: TimeTrackingIterator): Unit = {


nit, lets mark this private, I like the addition of the new method.

Made this and a couple of others private.

abellina · 2024-07-17T13:28:20Z

build

jihoonson changed the title ~~Exclude the processing time in records.hasNext from the serialization time estimation~~ Exclude input data processing time from the result serialization time estimation in RapidsShuffleThreadedWriterBase Jul 12, 2024

Exclude the processing time in records.hasNext from the serialization…

6cf48ab

… time estimation Signed-off-by: Jihoon Son <ghoonson@gmail.com>

jihoonson force-pushed the SPARK-RAPIDS-11173 branch from e931294 to 6cf48ab Compare July 12, 2024 18:34

jlowe reviewed Jul 12, 2024

View reviewed changes

jihoonson added 2 commits July 12, 2024 13:14

Exclude the wait time on limiter

0552fa8

Exclude batch size computing time as well

f39353b

jihoonson changed the title ~~Exclude input data processing time from the result serialization time estimation in RapidsShuffleThreadedWriterBase~~ More accurate estimation for the result serialization time in RapidsShuffleThreadedWriterBase Jul 12, 2024

abellina self-requested a review July 12, 2024 20:54

jlowe reviewed Jul 12, 2024

View reviewed changes

fix outdated comment; add more comments

fc299dc

abellina mentioned this pull request Jul 15, 2024

[FEA] publish wait time on rapids shuffle limiter as a metric #11187

Open

abellina reviewed Jul 15, 2024

View reviewed changes

Add a function that takes a TimeTrackingIterator

3b076cf

abellina reviewed Jul 16, 2024

View reviewed changes

make stuff private

5c8d107

abellina approved these changes Jul 17, 2024

View reviewed changes

jihoonson merged commit f8439b4 into NVIDIA:branch-24.08 Jul 18, 2024
43 checks passed

sameerz added the bug Something isn't working label Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More accurate estimation for the result serialization time in RapidsShuffleThreadedWriterBase #11180

More accurate estimation for the result serialization time in RapidsShuffleThreadedWriterBase #11180

jihoonson commented Jul 12, 2024

jlowe left a comment

jihoonson commented Jul 12, 2024

jihoonson commented Jul 12, 2024 •

edited

Loading

jlowe Jul 12, 2024

jihoonson Jul 12, 2024 •

edited

Loading

abellina Jul 15, 2024

jihoonson Jul 15, 2024

abellina left a comment

abellina Jul 16, 2024

jihoonson Jul 16, 2024

abellina commented Jul 17, 2024

		// writeTime is the amount of time it took to push bytes through the stream
		// minus the amount of time it took to get the batch from the upstream execs

More accurate estimation for the result serialization time in RapidsShuffleThreadedWriterBase #11180

More accurate estimation for the result serialization time in RapidsShuffleThreadedWriterBase #11180

Conversation

jihoonson commented Jul 12, 2024

jlowe left a comment

Choose a reason for hiding this comment

jihoonson commented Jul 12, 2024

jihoonson commented Jul 12, 2024 • edited Loading

jlowe Jul 12, 2024

Choose a reason for hiding this comment

jihoonson Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

abellina Jul 15, 2024

Choose a reason for hiding this comment

jihoonson Jul 15, 2024

Choose a reason for hiding this comment

abellina left a comment

Choose a reason for hiding this comment

abellina Jul 16, 2024

Choose a reason for hiding this comment

jihoonson Jul 16, 2024

Choose a reason for hiding this comment

abellina commented Jul 17, 2024

jihoonson commented Jul 12, 2024 •

edited

Loading

jihoonson Jul 12, 2024 •

edited

Loading