Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Defaults] Handle legacy map types in Parquet files #3097

Merged
merged 4 commits into from
May 14, 2024

Conversation

vkorukanti
Copy link
Collaborator

Description

Currently, Kernel's Parquet reader explicitly looks for the key_value repeated group under the Parquet map type, but the older versions of Parquet writers wrote any name for the repeated group. Instead of looking for the explicit key_value element, fetch the first element in the list. See here for more details.

How was this patch tested?

The test and sample file written by legacy writers are taken from Apache Spark™.

Some columns (arrays with 2-level encoding, another legacy format) from the test file are currently not supported. I will follow up with a separate PR. It involves bit refactoring on the ArrayColumnReader.

int initialBatchSize,
MapType typeFromClient,
GroupType typeFromFile) {
// Repeated element can be any name. Latest Parquet versions use "key_value" as the name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check which version it was written as?

i.e. latest -> use key_value. older? only then allow arbitrary name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, there is no clear or confirmed way to identify which version (i.e., Parquet format version) the file was written with. Each writer followed its own way to add the metadata.

@vkorukanti vkorukanti merged commit b9fe0e1 into delta-io:master May 14, 2024
10 checks passed
vkorukanti added a commit that referenced this pull request May 14, 2024
Currently, Kernel's Parquet reader explicitly looks for the `key_value`
repeated group under the Parquet map type, but the older versions of
Parquet writers wrote any name for the repeated group. Instead of
looking for the explicit `key_value` element, fetch the first element in
the list. See
[here](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps)
for more details.

The
[test](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetThriftCompatibilitySuite.scala#L29)
and sample file written by legacy writers are taken from Apache Spark™.

Some columns (arrays with 2-level encoding, another legacy format) from
the test file are currently not supported. I will follow up with a
separate PR. It involves bit refactoring on the ArrayColumnReader.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants