Fix file format checks to be exact and handle Delta Lake column mapping [databricks] #9279

jlowe · 2023-09-20T21:06:22Z

This fixes the following problems:

Format checks were allowing derived classes of the CPU class to be considered supported
DeltaParquetFileFormat semantics were being completely ignored (i.e.; column mapping and deletion vectors)

File format checks were updated to check for specific CPU classes so we don't accidentally think we know how to replace the semantics when a derived class that alters the behavior is used instead.

The interface for Delta Lake providers was updated to handle read interfaces, and ExternalSource was updated accordingly.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

jlowe · 2023-09-20T21:06:33Z

build

revans2

Looks good just one question, and there are some test failures so I am not going to approve it yet.

revans2 · 2023-09-21T13:45:24Z

...-lake/delta-20x/src/main/scala/com/nvidia/spark/rapids/delta/delta20x/Delta20xProvider.scala

@@ -41,4 +49,31 @@ object Delta20xProvider extends DeltaIOProvider {
          .disabledByDefault("Delta Lake update support is experimental")
    ).map(r => (r.getClassFor.asSubclass(classOf[RunnableCommand]), r)).toMap
  }
+
+  override def isSupportedFormat(format: Class[_ <: FileFormat]): Boolean = {
+    format == classOf[DeltaParquetFileFormat] || format == classOf[GpuDelta20xParquetFileFormat]


When would we ever have replaced the file format and be tagging the plan again? I understand if we are just being cautious, but with the other checks for an exact class it feels like we would have made some horrible franken mix of CPU exec and GPU FileFormat to hit this. That or I don't really understand the context that this is called in. If that is the case, then we need some better docs for the method definition.

This checks for the GPU version of the format because isSupportedFormat is how ExternalSource figures out which source, of many, is supposed to handle a particular call. For example, one of the interfaces is createMultiFileReaderFactory and that happens after we've already converted to a GPU format. I'll add some comments to the various isSupported methods of the provider interfaces to make this clearer.

Okay could you add some docs to the isSupportedFormat declaration about who could call this and what is expected.

I decided this API is inherently confusing and fragile given it's looking at non-CPU formats. I ended up solving this via #9283 and will update this PR to be based on that. I'll also update the isSupportedFormat calls to check for the explicit CPU class we're expecting.

jlowe · 2023-09-21T20:34:39Z

Putting this in draft until it's been upmerged with #9283.

LIN-Yu-Ting · 2023-09-22T09:40:47Z

@jlowe. Very interesting development. I saw that

you modified GpuFileSourceScanExec.scala to handle DeltaParquetFileFormat with ExternalSource.scala.
DeltaProvider inside ExternalSource, implemented by DeltaProviderImpl.scala will execute function createMultiFileReaderFactory.
Inside, you will run createMultiFileReaderFactory based on delta table version and finally all of them call GpuDeltaParquetFileFormat to return ReaderFactory.

At the end, you prepare metadata such as StructType for DeltaTable with columnMappingMode.

    GpuParquetMultiFilePartitionReaderFactory(
      fileScan.conf,
      broadcastedConf,
      prepareSchema(fileScan.relation.dataSchema),
      prepareSchema(fileScan.requiredSchema),
      prepareSchema(fileScan.readPartitionSchema),
      pushedFilters,
      fileScan.rapidsConf,
      fileScan.allMetrics,
      fileScan.queryUsesInputFile,
      fileScan.alluxioPathsMap)

which is different from original return value inside GpuFileSourceScanExec.

      GpuParquetMultiFilePartitionReaderFactory(
        sqlConf,
        broadcastedHadoopConf,
        relation.dataSchema,
        requiredSchema,
        readPartitionSchema,
        pushedDownFilters.toArray,
        rapidsConf,
        allMetrics,
        queryUsesInputFile,
        alluxioPathReplacementMap)

jlowe · 2023-09-22T13:32:14Z

you prepare metadata such as StructType for DeltaTable with columnMappingMode

Yes, this reflects the behavior implemented by DeltaParquetFileFormat, the format used by the CPU when reading Delta Lake tables. See https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala#L128-L136

jlowe · 2023-09-22T15:23:34Z

build

jlowe added 2 commits September 19, 2023 13:36

Check for exact file format class rather than an instance of the class

187c582

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

Add GpuDeltaParquetFileFormat

5fa81d2

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

jlowe self-assigned this Sep 20, 2023

jlowe mentioned this pull request Sep 20, 2023

[BUG] Unable to read DeltaTable with columnMapping.mode = name #9255

Closed

revans2 reviewed Sep 21, 2023

View reviewed changes

jlowe added 2 commits September 21, 2023 15:33

Fix Parquet field IDs not being written when column ID mapping requested

44fb929

Merge branch 'branch-23.10' into fix-file-format-check

3c09367

jlowe marked this pull request as draft September 21, 2023 20:34

Merge branch 'branch-23.10' into fix-file-format-check

6780351

jlowe marked this pull request as ready for review September 22, 2023 15:23

revans2 approved these changes Sep 22, 2023

View reviewed changes

jlowe merged commit f68cb94 into NVIDIA:branch-23.10 Sep 22, 2023

jlowe deleted the fix-file-format-check branch September 22, 2023 21:22

This was referenced Sep 24, 2023

[BUG] delta_lake_test FAILED on "column mapping mode id is not supported for this Delta version" NvTimLiu/spark-rapids#25

Open

[BUG] delta_lake_test FAILED on "column mapping mode id is not supported for this Delta version" #9290

Closed

jlowe mentioned this pull request Sep 25, 2023

Fix test_delta_read_column_mapping test failures on Spark 3.2.x and 3.3.x #9294

Merged

sameerz added the bug Something isn't working label Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix file format checks to be exact and handle Delta Lake column mapping [databricks] #9279

Fix file format checks to be exact and handle Delta Lake column mapping [databricks] #9279

jlowe commented Sep 20, 2023

jlowe commented Sep 20, 2023

revans2 left a comment

revans2 Sep 21, 2023

jlowe Sep 21, 2023

revans2 Sep 21, 2023

jlowe Sep 21, 2023

jlowe commented Sep 21, 2023

LIN-Yu-Ting commented Sep 22, 2023

jlowe commented Sep 22, 2023

jlowe commented Sep 22, 2023

Fix file format checks to be exact and handle Delta Lake column mapping [databricks] #9279

Fix file format checks to be exact and handle Delta Lake column mapping [databricks] #9279

Conversation

jlowe commented Sep 20, 2023

jlowe commented Sep 20, 2023

revans2 left a comment

Choose a reason for hiding this comment

revans2 Sep 21, 2023

Choose a reason for hiding this comment

jlowe Sep 21, 2023

Choose a reason for hiding this comment

revans2 Sep 21, 2023

Choose a reason for hiding this comment

jlowe Sep 21, 2023

Choose a reason for hiding this comment

jlowe commented Sep 21, 2023

LIN-Yu-Ting commented Sep 22, 2023

jlowe commented Sep 22, 2023

jlowe commented Sep 22, 2023