[BUG] Failed to split an empty string with error "ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal" #11183

viadea · 2024-07-13T01:28:45Z

Describe the bug
Query failed to split an empty string with error :

ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal

Steps/Code to reproduce bug

sc.parallelize(Seq("")).toDF.withColumn("b", split($"value", "#")).show

Expected behavior
CPU mode works fine and we should follow the same behavior:

scala> spark.conf.set("spark.rapids.sql.enabled","false")

scala> sc.parallelize(Seq("")).toDF.withColumn("b", split($"value", "#")).show
+-----+---+
|value|  b|
+-----+---+
|     | []|
+-----+---+

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue
Dataproc 2.1
Spark RAPIDS 24.08 snapshot jar

The text was updated successfully, but these errors were encountered:

ttnghia · 2024-07-16T05:07:52Z

This is due to a bug in cudf: rapidsai/cudf#16284.

LIN-Yu-Ting · 2024-07-17T07:21:40Z

We also encounter this issue with spark-rapids 24.08-SNAPSHOT

ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal
	at ai.rapids.cudf.ParquetChunkedReader.hasNext(Native Method)
	at ai.rapids.cudf.ParquetChunkedReader.hasNext(ParquetChunkedReader.java:155)
	at com.nvidia.spark.rapids.ParquetTableReader.hasNext(GpuParquetScan.scala:2688)
	at com.nvidia.spark.rapids.GpuDataProducer.foreach(GpuDataProducer.scala:56)
	at com.nvidia.spark.rapids.GpuDataProducer.foreach$(GpuDataProducer.scala:55)
	at com.nvidia.spark.rapids.ParquetTableReader.foreach(GpuParquetScan.scala:2664)
	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.$anonfun$apply$2(GpuDataProducer.scala:168)
	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:98)
	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.$anonfun$apply$1(GpuDataProducer.scala:159)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.$anonfun$readBufferToBatches$3(GpuParquetScan.scala:2573)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:132)
	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.readBufferToBatches(GpuParquetScan.scala:2560)
	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.readBatches(GpuParquetScan.scala:2530)
	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.liftedTree1$1(GpuMultiFileReader.scala:483)
	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.readBuffersToBatch(GpuMultiFileReader.scala:482)
	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.$anonfun$next$1(GpuMultiFileReader.scala:675)
	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.$anonfun$next$1$adapted(GpuMultiFileReader.scala:630)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.next(GpuMultiFileReader.scala:630)
	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1(GpuDataSourceRDD.scala:73)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(GpuDataSourceRDD.scala:73)
	at scala.Option.exists(Option.scala:376)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:73)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.advanceToNextIter(GpuDataSourceRDD.scala:97)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:73)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:477)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$hasNext$4(GpuAggregateExec.scala:2005)
	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
	at scala.Option.getOrElse(Option.scala:189)
	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.hasNext(GpuAggregateExec.scala:2005)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:332)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

jlowe · 2024-07-17T13:53:36Z

@LIN-Yu-Ting is the query using a string split operation as described in this issue? If not, it would be best to file this as a new issue, as the root cause is likely to be very different. Also, if you have a way for us to reproduce the issue you're seeing, that would be a big help.

ttnghia · 2024-07-17T15:25:13Z

I see that the exception is thrown from at ai.rapids.cudf.ParquetChunkedReader.hasNext(Native Method).

jlowe · 2024-07-17T17:58:58Z

I see that the exception is thrown from at ai.rapids.cudf.ParquetChunkedReader.hasNext(Native Method)

Yes, that's definitely different than the string split originally reported here. However I'm wondering if the chunked reader error was the result a "sticky" CUDA error that actually occurred elsewhere first. If this is the only error being reported in the executor logs then that would be a different issue than the one reported here.

LIN-Yu-Ting · 2024-07-17T19:56:59Z

@jlowe Sorry for that. I originally think that it is from the same root cause as we are also using split() function in our SQL operation so that I paste the exception here. However, this exception continuously happens even we remove our split function in our query. I posted a new issue here.

viadea · 2024-08-05T19:19:14Z

Closing this issue since it is already fixed in 24.08 branch

viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 13, 2024

mattahrens assigned ttnghia Jul 15, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 15, 2024

jlowe mentioned this issue Jul 15, 2024

Improved testing of expressions with empty input #11190

Open

viadea closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Failed to split an empty string with error "ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal" #11183

[BUG] Failed to split an empty string with error "ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal" #11183

viadea commented Jul 13, 2024 •

edited

Loading

ttnghia commented Jul 16, 2024

LIN-Yu-Ting commented Jul 17, 2024 •

edited

Loading

jlowe commented Jul 17, 2024

ttnghia commented Jul 17, 2024

jlowe commented Jul 17, 2024

LIN-Yu-Ting commented Jul 17, 2024

viadea commented Aug 5, 2024

[BUG] Failed to split an empty string with error "ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal" #11183

[BUG] Failed to split an empty string with error "ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal" #11183

Comments

viadea commented Jul 13, 2024 • edited Loading

ttnghia commented Jul 16, 2024

LIN-Yu-Ting commented Jul 17, 2024 • edited Loading

jlowe commented Jul 17, 2024

ttnghia commented Jul 17, 2024

jlowe commented Jul 17, 2024

LIN-Yu-Ting commented Jul 17, 2024

viadea commented Aug 5, 2024

viadea commented Jul 13, 2024 •

edited

Loading

LIN-Yu-Ting commented Jul 17, 2024 •

edited

Loading