Multi-threaded shuffle writer for RapidsShuffleManager [databricks] #6052

abellina · 2022-07-21T21:12:26Z

This adds an experimental feature for RapidsShuffleManager where it can be configured to write shuffle blocks using multiple threads.

You would do this by:

--conf spark.shuffle.manager=com.nvidia.spark.rapids.[spark version].RapidsShuffleManager
--conf spark.rapids.shuffle.manager.mode=MULTI_THREADED
--conf spark.rapids.shuffle.multiThreaded.writer.threads=N

Where [spark version] is something like spark321 (see other class names here https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/docs/additional-functionality/rapids-shuffle.md#spark-app-configuration), and N is the number of threads you want to use (defaults to 20). I removed an internal config spark.rapids.shuffle.transport.enabled which would have allowed you to run the cache-only shuffle (for testing and have a local-mode app), and added spark.rapids.shuffle.manager.mode which can be UCX (default), CACHE_ONLY (for testing), or the experimental MULTI_THREADED.

Note there is no flow control here. The shuffle writer is going to get an iterator and is going to pull on it until it is done, no matter what the consequences are. This is one of the reasons why this is experimental.

This PR copies/adapts some tests from Spark as well, and most of the writer interface is Spark, except this is in scala and has a thread pool.

I've tested this in Apache Spark 3.1.2 and 3.2.1. I haven't tested it in other environments yet. Posting here for comments as I work my way through other sparks that need testing.

Closes #6060

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

abellina · 2022-07-21T21:26:08Z

build

abellina · 2022-07-21T22:45:00Z

build

abellina · 2022-07-22T03:51:58Z

build

docs/configs.md

...in/311until320-all/scala/org/apache/spark/sql/rapids/shims/RapidsShuffleThreadedWriter.scala

tgravescs · 2022-07-22T13:29:22Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

+ */
+class ThreadSafeShuffleWriteMetricsReporter(wrapped: ShuffleWriteMetricsReporter)
+  extends ShuffleWriteMetrics {
+  override private[spark] def incBytesWritten(v: Long): Unit = synchronized {


I think most of these are updated in batches, just curious if this synchronization causes us much time waiting? might be something to look at later if you haven't

I haven't looked at this. I kind of ran into this late where the tests from spark were actually testing the metrics. Yes absolutely we could do better than this, but I figure that could be a punt.

+1, especially when we consider the impact of synchronized blocks on cache performance. Thread-safe accumulators could be a good option to explore later.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

abellina · 2022-07-22T14:12:38Z

build

sql-plugin/src/main/java/com/nvidia/spark/rapids/SlicedGpuColumnVector.java

abellina · 2022-07-22T16:09:39Z

build

jenkins/spark-premerge-build.sh

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuPartitioning.scala

jlowe · 2022-07-22T16:14:10Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

+    conf("spark.rapids.shuffle.multiThreaded.writer.threads")
+      .doc("The number of threads to use for writing shuffle blocks per executor.")
+      .integerConf
+      .createWithDefault(20)


This is going to need a similar followup issue to try to automatically tune this as we are working on for the multithreaded input readers. 20 may be a very poor choice on an executor with many configured cores.

I think we can fold this into: #5039.

I suggest to use ForkJoinPool class in this case. We can scale the number of threads dynamically using the API maximum pool size.
getStealCount can be used to evaluate how busy the threads are in the pool. if that returned value is too high, we can increase the number of workers.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuShuffleDependency.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

abellina · 2022-07-22T23:08:17Z

Build failed last due to: #6054

jenkins/spark-premerge-build.sh

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

amahussein · 2022-07-26T16:50:00Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

+ */
+class ThreadSafeShuffleWriteMetricsReporter(wrapped: ShuffleWriteMetricsReporter)
+  extends ShuffleWriteMetrics {
+  override private[spark] def incBytesWritten(v: Long): Unit = synchronized {


+1, especially when we consider the impact of synchronized blocks on cache performance. Thread-safe accumulators could be a good option to explore later.

amahussein · 2022-07-26T16:50:36Z

...main/330/scala/org/apache/spark/sql/rapids/shims/spark330/RapidsShuffleInternalManager.scala

+  extends ProxyRapidsShuffleInternalManagerBase(conf, isDriver)
+    with ShuffleManager


nit, new line.

I am not sure we are following a rule on adding a line at the end of each file, if I understand you correctly @amahussein. I know this is a nit, but I wanted to make sure I am not missing something.

ok I added these.

amahussein · 2022-07-26T16:50:57Z

...12-nondb/scala/org/apache/spark/sql/rapids/shims/spark312/RapidsShuffleInternalManager.scala

+    extends ProxyRapidsShuffleInternalManagerBase(conf, isDriver)
+      with ShuffleManager


nit, new line

amahussein · 2022-07-26T16:54:23Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

+          myMapStatus = Some(MapStatus(blockManager.shuffleServerId, partLengths, mapId))
+        } catch {
+          // taken directly from BypassMergeSortShuffleWriter
+          case e: Exception =>


Should we check for specific Exception classes instead of generic Exception. I wonder if generic exception can would hide bugs and crashes that we were not expecting.

I am changing this code also because of the comment @jlowe had. I'll push in a second.

oh wait this code, this is logging the exception, aborting, and re-throwing. It was taken from Apache Spark verbatim. I'd like to keep this as is.

…uffle/helper_threads_writer_latest_rebased

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala

abellina · 2022-07-26T20:54:15Z

@amahussein @jlowe should be ready for another look.

abellina · 2022-07-26T20:54:22Z

build

abellina · 2022-07-26T21:31:02Z

build

abellina · 2022-07-26T22:31:08Z

build

abellina · 2022-07-26T22:45:38Z

build

abellina changed the title ~~Multi-threaded shuffle writer for RapidsShuffleManager~~ Multi-threaded shuffle writer for RapidsShuffleManager [databricks] Jul 21, 2022

Multi-threaded shuffle writer for RapidsShuffleManager

c7688ae

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

abellina force-pushed the shuffle/helper_threads_writer_latest_rebased branch from a20efdb to c7688ae Compare July 21, 2022 21:13

Fix databricks shims

9361cc1

Fix scalastyle issue

2d69460

tgravescs reviewed Jul 22, 2022

View reviewed changes

abellina added 4 commits July 22, 2022 08:42

Fix 314

f7082bf

Fix spark331

2664184

MULTI_THREADED -> MULTITHREADED, fix some scalastyle issues

d84e115

Handle some code review comments

59bc73c

revans2 previously approved these changes Jul 22, 2022

View reviewed changes

sql-plugin/src/main/java/com/nvidia/spark/rapids/SlicedGpuColumnVector.java Show resolved Hide resolved

Update copyright

9f47204

abellina dismissed revans2’s stale review via 9f47204 July 22, 2022 14:53

jlowe added this to the Jul 11 - Jul 22 milestone Jul 22, 2022

abellina marked this pull request as ready for review July 22, 2022 15:22

abellina requested review from jlowe, GaryShen2008, NvTimLiu and zhanga5 as code owners July 22, 2022 15:22

jlowe reviewed Jul 22, 2022

View reviewed changes

abellina added 3 commits July 22, 2022 17:56

Commonize configs in spark-premerge-build

c073ae2

Copyrights

c2f532e

Address code review comments

e1ab931

jlowe reviewed Jul 25, 2022

View reviewed changes

amahussein requested changes Jul 26, 2022

View reviewed changes

abellina added 4 commits July 26, 2022 14:13

Use Future.get to wait for asynchronous writes

5dc5487

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids into sh…

c8e176d

…uffle/helper_threads_writer_latest_rebased

Cleanup spark-premerge-build

f5e362c

Add newlines

954e8aa

abellina mentioned this pull request Jul 26, 2022

[FEA] Parallelize shuffle compress/decompress with opportunistic idle task threads #5039

Closed

jlowe reviewed Jul 26, 2022

View reviewed changes

Cleanup future waiting loop and other code review comments

465d88c

Missing import

69ea033

jlowe previously approved these changes Jul 26, 2022

View reviewed changes

Get underlying exception from ExecutionException

4584e83

abellina dismissed jlowe’s stale review via 4584e83 July 26, 2022 21:18

Scalastyle

12bdbe9

Missed spark322

03400c8

amahussein previously approved these changes Jul 26, 2022

View reviewed changes

Scalastyle

bc1bdcb

abellina dismissed amahussein’s stale review via bc1bdcb July 26, 2022 22:44

tgravescs approved these changes Jul 27, 2022

View reviewed changes

abellina merged commit f720386 into NVIDIA:branch-22.08 Jul 27, 2022

abellina deleted the shuffle/helper_threads_writer_latest_rebased branch July 27, 2022 13:48

abellina mentioned this pull request Jul 28, 2022

Fixes threaded shuffle writer test mocks for spark 3.3.0+ [databricks] #6141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded shuffle writer for RapidsShuffleManager [databricks] #6052

Multi-threaded shuffle writer for RapidsShuffleManager [databricks] #6052

abellina commented Jul 21, 2022 •

edited

Loading

abellina commented Jul 21, 2022

abellina commented Jul 21, 2022

abellina commented Jul 22, 2022

tgravescs Jul 22, 2022

abellina Jul 22, 2022

amahussein Jul 26, 2022

abellina commented Jul 22, 2022

abellina commented Jul 22, 2022

jlowe Jul 22, 2022

abellina Jul 26, 2022

amahussein Jul 26, 2022

abellina commented Jul 22, 2022

amahussein Jul 26, 2022

amahussein Jul 26, 2022

abellina Jul 26, 2022

abellina Jul 26, 2022

amahussein Jul 26, 2022

amahussein Jul 26, 2022

abellina Jul 26, 2022

abellina Jul 26, 2022

abellina commented Jul 26, 2022

abellina commented Jul 26, 2022

abellina commented Jul 26, 2022

abellina commented Jul 26, 2022

abellina commented Jul 26, 2022

		extends ProxyRapidsShuffleInternalManagerBase(conf, isDriver)
		with ShuffleManager

Multi-threaded shuffle writer for RapidsShuffleManager [databricks] #6052

Multi-threaded shuffle writer for RapidsShuffleManager [databricks] #6052

Conversation

abellina commented Jul 21, 2022 • edited Loading

abellina commented Jul 21, 2022

abellina commented Jul 21, 2022

abellina commented Jul 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented Jul 22, 2022

abellina commented Jul 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented Jul 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented Jul 26, 2022

abellina commented Jul 26, 2022

abellina commented Jul 26, 2022

abellina commented Jul 26, 2022

abellina commented Jul 26, 2022

abellina commented Jul 21, 2022 •

edited

Loading