[Kernel][Defaults] Handle legacy map types in Parquet files #3097

vkorukanti · 2024-05-14T22:23:40Z

Description

Currently, Kernel's Parquet reader explicitly looks for the key_value repeated group under the Parquet map type, but the older versions of Parquet writers wrote any name for the repeated group. Instead of looking for the explicit key_value element, fetch the first element in the list. See here for more details.

How was this patch tested?

The test and sample file written by legacy writers are taken from Apache Spark™.

Some columns (arrays with 2-level encoding, another legacy format) from the test file are currently not supported. I will follow up with a separate PR. It involves bit refactoring on the ArrayColumnReader.

scottsand-db · 2024-05-14T22:27:28Z

...kernel-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/MapColumnReader.java

+            int initialBatchSize,
+            MapType typeFromClient,
+            GroupType typeFromFile) {
+        // Repeated element can be any name. Latest Parquet versions use "key_value" as the name,


Can we check which version it was written as?

i.e. latest -> use key_value. older? only then allow arbitrary name

As far as I know, there is no clear or confirmed way to identify which version (i.e., Parquet format version) the file was written with. Each writer followed its own way to add the metadata.

Currently, Kernel's Parquet reader explicitly looks for the `key_value` repeated group under the Parquet map type, but the older versions of Parquet writers wrote any name for the repeated group. Instead of looking for the explicit `key_value` element, fetch the first element in the list. See [here](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps) for more details. The [test](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetThriftCompatibilitySuite.scala#L29) and sample file written by legacy writers are taken from Apache Spark™. Some columns (arrays with 2-level encoding, another legacy format) from the test file are currently not supported. I will follow up with a separate PR. It involves bit refactoring on the ArrayColumnReader.

vkorukanti added 3 commits May 14, 2024 15:12

[Kernel][Defaults] Handle legacy map types in Parquet files

de04516

fixes

7ce71a2

f

cfa4653

vkorukanti added the kernel label May 14, 2024

scottsand-db reviewed May 14, 2024

View reviewed changes

scottsand-db approved these changes May 14, 2024

View reviewed changes

fix the file path in test

ec8b866

vkorukanti merged commit b9fe0e1 into delta-io:master May 14, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Defaults] Handle legacy map types in Parquet files #3097

[Kernel][Defaults] Handle legacy map types in Parquet files #3097

vkorukanti commented May 14, 2024

scottsand-db May 14, 2024

vkorukanti May 14, 2024

[Kernel][Defaults] Handle legacy map types in Parquet files #3097

[Kernel][Defaults] Handle legacy map types in Parquet files #3097

Conversation

vkorukanti commented May 14, 2024

Description

How was this patch tested?

scottsand-db May 14, 2024

Choose a reason for hiding this comment

vkorukanti May 14, 2024

Choose a reason for hiding this comment