[QST] Understanding of "Maximum pool size exceeded" #5373

martinstuder · 2020-12-10T16:29:08Z

martinstuder
Dec 10, 2020

What is your question?
I'm running a job on Azure Databricks 7.3 LTS using Standard_NC6s_v3 workers (6 vCPU, 112 GiB, 1 V100 GPU) and I'm running into the following issue:

Py4JJavaError: An error occurred while calling o20039.checkpoint.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1568.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1568.0 (TID 267491, 10.50.24.6, executor 8): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
	at ai.rapids.cudf.Table.hashPartition(Native Method)
	at ai.rapids.cudf.Table.access$3900(Table.java:44)
	at ai.rapids.cudf.Table$TableOperation.hashPartition(Table.java:2353)
	at com.nvidia.spark.rapids.GpuHashPartitioning.partitionInternal(GpuHashPartitioning.scala:103)
	at com.nvidia.spark.rapids.GpuHashPartitioning.columnarEval(GpuHashPartitioning.scala:129)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.$anonfun$prepareBatchShuffleDependency$2(GpuShuffleExchangeExec.scala:196)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:217)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:228)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
	at org.apache.spark.scheduler.Task.run(Task.scala:117)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2331)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2352)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2371)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2396)
	at org.apache.spark.rdd.RDD.count(RDD.scala:1234)
	at org.apache.spark.sql.Dataset.$anonfun$checkpoint$1(Dataset.scala:701)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3702)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:836)
	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3700)
	at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:692)
	at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:655)
	at sun.reflect.GeneratedMethodAccessor486.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/usr/local/rapids/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
	at ai.rapids.cudf.Table.hashPartition(Native Method)
	at ai.rapids.cudf.Table.access$3900(Table.java:44)
	at ai.rapids.cudf.Table$TableOperation.hashPartition(Table.java:2353)
	at com.nvidia.spark.rapids.GpuHashPartitioning.partitionInternal(GpuHashPartitioning.scala:103)
	at com.nvidia.spark.rapids.GpuHashPartitioning.columnarEval(GpuHashPartitioning.scala:129)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.$anonfun$prepareBatchShuffleDependency$2(GpuShuffleExchangeExec.scala:196)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:217)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:228)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
	at org.apache.spark.scheduler.Task.run(Task.scala:117)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

My spark-rapids configuration is:

"spark.plugins": "com.nvidia.spark.SQLPlugin",
"spark.rapids.sql.incompatibleOps.enabled": "true",
"spark.rapids.memory.pinnedPool.size": "8G",
"spark.executor.resource.gpu.amount": 1,
"spark.task.resource.gpu.amount": 0.1,
"spark.rapids.sql.concurrentGpuTasks": 2,
"spark.locality.wait": "0s",
"spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": 1,
"spark.kryo.registrator": "com.nvidia.spark.rapids.GpuKryoRegistrator"

Is "Maximum pool size exceeded" an indication that the partitions are too big, i.e. that spark.sql.shuffle.partitions is too low (600 in my case)? I tried with "spark.rapids.sql.concurrentGpuTasks": 1 but to no avail. If I understand correctly, increasing spark.task.resource.gpu.amount (e.g. to 0.2 or 0.25) would only limit the number of tasks scheduled on an executor (^= worker on Azure Databricks) but not influence task GPU memory consumption, correct?

Answered by jlowe

Dec 10, 2020

"Maximum pool size exceeded" from RMM means the GPU memory pool has been exhausted, and it was unable to satisfy a GPU memory allocation request. There can be lots of causes. Try to run with too much GPU data generated per task or running too many tasks simultaneously on the GPU are primary causes, so setting spark.rapids.sql.concurrentGpuTasks=1 from a higher initial value will reduce at least some of that memory pressure.

Increasing the number of shuffle partitions should also help, assuming your processing does not have high key skew, causing most of the data to show up in only a few task partitions.

If I understand correctly, increasing spark.task.resource.gpu.amount (e.g. to 0.2 or …

View full answer

jlowe · 2020-12-10T18:40:00Z

jlowe
Dec 10, 2020
Maintainer

"Maximum pool size exceeded" from RMM means the GPU memory pool has been exhausted, and it was unable to satisfy a GPU memory allocation request. There can be lots of causes. Try to run with too much GPU data generated per task or running too many tasks simultaneously on the GPU are primary causes, so setting spark.rapids.sql.concurrentGpuTasks=1 from a higher initial value will reduce at least some of that memory pressure.

Increasing the number of shuffle partitions should also help, assuming your processing does not have high key skew, causing most of the data to show up in only a few task partitions.

If I understand correctly, increasing spark.task.resource.gpu.amount (e.g. to 0.2 or 0.25) would only limit the number of tasks scheduled on an executor (^= worker on Azure Databricks) but not influence task GPU memory consumption, correct?

Yes, increasing the GPU amount per task will only change how Spark assigns pending tasks to executors but not how much memory a single task will take once it starts running. This is the same for executor CPU memory -- it's not tracked or limited per task.

spark.rapids.sql.concurrentGpuTasks allows the plugin to control how many concurrent executor tasks are allowed to be processing data on the GPU simultaneously. More tasks can get scheduled and started by Spark on an executor, but that plugin setting limits how many are allowed on the GPU at the same time. This allows us to keep the number of concurrent tasks on an executor and on the GPU somewhat disjoint, which helps for any parts of the query that are not directly using GPU memory. This comes in handy when the query has expensive portions that are not executing on the GPU. The CPU-only portion will execute with the full executor parallelism yet not explode GPU memory by allowing all tasks on the GPU simultaneously at another point in the query.

0 replies

jlowe · 2020-12-15T21:39:37Z

jlowe
Dec 15, 2020
Maintainer

@martinstuder can this be closed per the above explanation?

0 replies

martinstuder · 2020-12-16T15:41:13Z

martinstuder
Dec 16, 2020
Author

@jlowe Sorry for not getting back to you earlier. Yes, going to close.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Understanding of "Maximum pool size exceeded" #5373

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

[QST] Understanding of "Maximum pool size exceeded" #5373

martinstuder Dec 10, 2020

Replies: 3 comments

jlowe Dec 10, 2020 Maintainer

jlowe Dec 15, 2020 Maintainer

martinstuder Dec 16, 2020 Author

martinstuder
Dec 10, 2020

jlowe
Dec 10, 2020
Maintainer

jlowe
Dec 15, 2020
Maintainer

martinstuder
Dec 16, 2020
Author