[BUG] CSV/JSON data sources should avoid globbing paths when inferring schema #11158

thirtiseven · 2024-07-09T08:10:37Z

Describe the bug

apache/spark#29659 fixed an issue with the CSV and JSON data sources in Spark SQL when both of the following are true:

no user specified schema
some file paths contain escaped glob metacharacters, such as [], {}, * etc.

It makes Spark UT SPARK-32810: JSON data source should be able to read files with escaped glob metacharacter in the paths failed in plugin.

Steps/Code to reproduce bug

scala> val data = Seq(("Alice", 28), ("Bob", 34), ("Cathy", 23))
data: Seq[(String, Int)] = List((Alice,28), (Bob,34), (Cathy,23))

scala> val df = data.toDF("name", "age")
df: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> df.write.mode("OVERWRITE").json("""[abc].json""")
24/07/09 08:02:19 WARN GpuOverrides:
!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced
  !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because JSON output is not supported
  ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
    @Expression <AttributeReference> name#7 could run on GPU
    @Expression <AttributeReference> age#8 could run on GPU


scala> spark.conf.set("spark.rapids.sql.format.json.enabled", true)

scala> spark.conf.set("spark.rapids.sql.format.json.read.enabled", true)

scala> val dfRead = spark.read.json("""\[abc\].json""")
24/07/09 08:02:43 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported file format: org.apache.spark.sql.execution.datasources.text.TextFileFormat

dfRead: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> dfRead.show()
24/07/09 08:02:48 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(age#21L as string) AS age#27 will run on GPU
      *Expression <Cast> cast(age#21L as string) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

24/07/09 08:02:48 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 8)
org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
	at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
	at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
	at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
	... 23 more
24/07/09 08:02:48 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 8) (spark-haoyang executor driver): org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
	at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
	at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
	at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
	... 23 more

24/07/09 08:02:48 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 8) (spark-haoyang executor driver): org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
	at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
	at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
	at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
	at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
	... 23 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
  at scala.Option.foreach(Option.scala:407)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
  at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:767)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:776)
  ... 47 elided
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
  at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:136)
  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
  at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
  at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
  at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
  at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
  ... 23 more

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> dfRead.show()
+---+-----+
|age| name|
+---+-----+
| 28|Alice|
| 23|Cathy|
| 34|  Bob|
+---+-----+

Expected behavior
Read files with escaped glob meta characters successfully.

The text was updated successfully, but these errors were encountered:

mattahrens · 2024-07-09T20:35:47Z

Need to test with Parquet and ORC to make sure the issue does not exist with those data sources.

thirtiseven · 2024-07-10T07:50:45Z

test with Parquet and ORC to make sure the issue does not exist with those data sources.

Verified by testing that Parquet and Orc does not exist this issue.

thirtiseven added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 9, 2024

thirtiseven mentioned this issue Jul 9, 2024

[BUG] Issues found by Spark UT Framework on RapidsJsonSuite #10773

Open

19 tasks

mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 9, 2024

revans2 added the ? - Needs Triage Need team to review and classify label Jul 10, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CSV/JSON data sources should avoid globbing paths when inferring schema #11158

[BUG] CSV/JSON data sources should avoid globbing paths when inferring schema #11158

thirtiseven commented Jul 9, 2024 •

edited

Loading

mattahrens commented Jul 9, 2024

thirtiseven commented Jul 10, 2024

[BUG] CSV/JSON data sources should avoid globbing paths when inferring schema #11158

[BUG] CSV/JSON data sources should avoid globbing paths when inferring schema #11158

Comments

thirtiseven commented Jul 9, 2024 • edited Loading

mattahrens commented Jul 9, 2024

thirtiseven commented Jul 10, 2024

thirtiseven commented Jul 9, 2024 •

edited

Loading