Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CSV/JSON data sources should avoid globbing paths when inferring schema #11158

Open
thirtiseven opened this issue Jul 9, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@thirtiseven
Copy link
Collaborator

thirtiseven commented Jul 9, 2024

Describe the bug

apache/spark#29659 fixed an issue with the CSV and JSON data sources in Spark SQL when both of the following are true:

  • no user specified schema
  • some file paths contain escaped glob metacharacters, such as [], {}, * etc.

It makes Spark UT SPARK-32810: JSON data source should be able to read files with escaped glob metacharacter in the paths failed in plugin.

Steps/Code to reproduce bug

scala> val data = Seq(("Alice", 28), ("Bob", 34), ("Cathy", 23))
data: Seq[(String, Int)] = List((Alice,28), (Bob,34), (Cathy,23))

scala> val df = data.toDF("name", "age")
df: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> df.write.mode("OVERWRITE").json("""[abc].json""")
24/07/09 08:02:19 WARN GpuOverrides:
!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced
  !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because JSON output is not supported
  ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
    @Expression <AttributeReference> name#7 could run on GPU
    @Expression <AttributeReference> age#8 could run on GPU


scala> spark.conf.set("spark.rapids.sql.format.json.enabled", true)

scala> spark.conf.set("spark.rapids.sql.format.json.read.enabled", true)

scala> val dfRead = spark.read.json("""\[abc\].json""")
24/07/09 08:02:43 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported file format: org.apache.spark.sql.execution.datasources.text.TextFileFormat

dfRead: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> dfRead.show()
24/07/09 08:02:48 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(age#21L as string) AS age#27 will run on GPU
      *Expression <Cast> cast(age#21L as string) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

24/07/09 08:02:48 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 8)
org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
	at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
	at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
	at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
	... 23 more
24/07/09 08:02:48 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 8) (spark-haoyang executor driver): org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
	at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
	at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
	at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
	... 23 more

24/07/09 08:02:48 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 8) (spark-haoyang executor driver): org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
	at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
	at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
	at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
	... 23 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
  at scala.Option.foreach(Option.scala:407)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
  at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:767)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:776)
  ... 47 elided
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
  at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:136)
  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
  at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
  at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
  at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
  at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
  ... 23 more

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> dfRead.show()
+---+-----+
|age| name|
+---+-----+
| 28|Alice|
| 23|Cathy|
| 34|  Bob|
+---+-----+

Expected behavior
Read files with escaped glob meta characters successfully.

@thirtiseven thirtiseven added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 9, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 9, 2024
@mattahrens
Copy link
Collaborator

Need to test with Parquet and ORC to make sure the issue does not exist with those data sources.

@thirtiseven
Copy link
Collaborator Author

test with Parquet and ORC to make sure the issue does not exist with those data sources.

Verified by testing that Parquet and Orc does not exist this issue.

@revans2 revans2 added the ? - Needs Triage Need team to review and classify label Jul 10, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants