Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to split an empty string with error "ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal" #11183

Closed
viadea opened this issue Jul 13, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@viadea
Copy link
Collaborator

viadea commented Jul 13, 2024

Describe the bug
Query failed to split an empty string with error :

ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal

Steps/Code to reproduce bug

sc.parallelize(Seq("")).toDF.withColumn("b", split($"value", "#")).show

Expected behavior
CPU mode works fine and we should follow the same behavior:

scala> spark.conf.set("spark.rapids.sql.enabled","false")

scala> sc.parallelize(Seq("")).toDF.withColumn("b", split($"value", "#")).show
+-----+---+
|value|  b|
+-----+---+
|     | []|
+-----+---+

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue
    Dataproc 2.1
    Spark RAPIDS 24.08 snapshot jar
@viadea viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 13, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 15, 2024
@ttnghia
Copy link
Collaborator

ttnghia commented Jul 16, 2024

This is due to a bug in cudf: rapidsai/cudf#16284.

@LIN-Yu-Ting
Copy link

LIN-Yu-Ting commented Jul 17, 2024

We also encounter this issue with spark-rapids 24.08-SNAPSHOT

ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal
	at ai.rapids.cudf.ParquetChunkedReader.hasNext(Native Method)
	at ai.rapids.cudf.ParquetChunkedReader.hasNext(ParquetChunkedReader.java:155)
	at com.nvidia.spark.rapids.ParquetTableReader.hasNext(GpuParquetScan.scala:2688)
	at com.nvidia.spark.rapids.GpuDataProducer.foreach(GpuDataProducer.scala:56)
	at com.nvidia.spark.rapids.GpuDataProducer.foreach$(GpuDataProducer.scala:55)
	at com.nvidia.spark.rapids.ParquetTableReader.foreach(GpuParquetScan.scala:2664)
	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.$anonfun$apply$2(GpuDataProducer.scala:168)
	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:98)
	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.$anonfun$apply$1(GpuDataProducer.scala:159)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:156)
	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.$anonfun$readBufferToBatches$3(GpuParquetScan.scala:2573)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:132)
	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.readBufferToBatches(GpuParquetScan.scala:2560)
	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.readBatches(GpuParquetScan.scala:2530)
	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.liftedTree1$1(GpuMultiFileReader.scala:483)
	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.readBuffersToBatch(GpuMultiFileReader.scala:482)
	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.$anonfun$next$1(GpuMultiFileReader.scala:675)
	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.$anonfun$next$1$adapted(GpuMultiFileReader.scala:630)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.next(GpuMultiFileReader.scala:630)
	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1(GpuDataSourceRDD.scala:73)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(GpuDataSourceRDD.scala:73)
	at scala.Option.exists(Option.scala:376)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:73)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.advanceToNextIter(GpuDataSourceRDD.scala:97)
	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:73)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:477)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$hasNext$4(GpuAggregateExec.scala:2005)
	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
	at scala.Option.getOrElse(Option.scala:189)
	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.hasNext(GpuAggregateExec.scala:2005)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:332)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
截圖 2024-07-17 下午4 24 32

@jlowe
Copy link
Member

jlowe commented Jul 17, 2024

@LIN-Yu-Ting is the query using a string split operation as described in this issue? If not, it would be best to file this as a new issue, as the root cause is likely to be very different. Also, if you have a way for us to reproduce the issue you're seeing, that would be a big help.

@ttnghia
Copy link
Collaborator

ttnghia commented Jul 17, 2024

I see that the exception is thrown from at ai.rapids.cudf.ParquetChunkedReader.hasNext(Native Method).

@jlowe
Copy link
Member

jlowe commented Jul 17, 2024

I see that the exception is thrown from at ai.rapids.cudf.ParquetChunkedReader.hasNext(Native Method)

Yes, that's definitely different than the string split originally reported here. However I'm wondering if the chunked reader error was the result a "sticky" CUDA error that actually occurred elsewhere first. If this is the only error being reported in the executor logs then that would be a different issue than the one reported here.

@LIN-Yu-Ting
Copy link

@jlowe Sorry for that. I originally think that it is from the same root cause as we are also using split() function in our SQL operation so that I paste the exception here. However, this exception continuously happens even we remove our split function in our query. I posted a new issue here.

@viadea
Copy link
Collaborator Author

viadea commented Aug 5, 2024

Closing this issue since it is already fixed in 24.08 branch

@viadea viadea closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants