Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Add system config for unsafe Parquet ID fallback. #9324

Merged
merged 4 commits into from
Jan 22, 2024

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 17, 2023

This adds a SystemConfig setting to enable or disable unsafe Parquet ID fallback.

Unsafe Parquet ID fallback (ParquetSchemaUtil.pruneColumnsFallback) is Netflix-specific behavior that assigns column IDs by position if a file is missing IDs and there is no name mapping. This pre-dates the public release (a5eb3f6). This is only applicable for Netflix datasets that resolved columns by position in Parquet and could not use name mapping.

This PR is the first step to removing fallback ID assignment by adding an environment config that can enable it. We need to remove fallback ID assignment because it is not safe. For data from Spark or other systems with name-based evolution, fallback ID assignment will produce incorrect results when data file schemas have changed (dropped a column). Also, integrations have recently used this behavior accidentally by producing incorrect Parquet files with no field IDs. Removing this will help Iceberg producers to adhere to the spec.

Currently, fallback assignment behavior is triggered when a NameMapping is null and a file has no field IDs. This adds the ability to disable fallback assignment by passing an empty NameMapping that will result in the original file schema after mapping.

@rdblue
Copy link
Contributor Author

rdblue commented Dec 17, 2023

@jackye1995, this is the fix I mentioned for Parquet data files.

@rdblue rdblue force-pushed the add-system-config-for-unsafe-id-fallback branch from 7c1e751 to 99a169f Compare December 19, 2023 00:57
@rdblue
Copy link
Contributor Author

rdblue commented Jan 14, 2024

@jackye1995, @danielcweeks, @RussellSpitzer, could you look at this? I'd like to ideally get it into the next release since we have been allowing unsafe reads.

@@ -1119,27 +1120,29 @@ public <D> CloseableIterable<D> build() {

ParquetReadOptions options = optionsBuilder.build();

NameMapping mapping;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: You could also check the system property when picking the default value in the constructor or add a method called nameMapping() to avoid two variables with similar names in build().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went to add this, but then I thought it was a bit confusing to have both the field and method. I think it should be okay to have an extra variable.

@Deprecated
public static final ConfigEntry<Boolean> NETFLIX_UNSAFE_PARQUET_ID_FALLBACK_ENABLED =
new ConfigEntry<>(
"iceberg.netflix.unsafe-parquet-id-fallback.enabled",
Copy link
Contributor

@aokolnychyi aokolnychyi Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure including Netflix in the property name gives us any benefit. I'd consider just calling it any of these:

iceberg.unsafe-parquet-id-fallback.enabled
iceberg.parquet.unsafe-id-fallback.enabled

Up to you, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having Netflix in the name because it signals that people should not use this. It is not a feature, it is an artifact of Iceberg's history.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 on adding a conf now and disabling it in 2.0. I think the fallback can lead to silent data correctness issues and that's not good.

Is there an easy way to test this?

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've added the 1.5 milestone as well to make sure that we can remove this in 2.0

@Fokko Fokko added this to the Iceberg 1.5.0 milestone Jan 19, 2024
@rdblue rdblue merged commit 0f509d2 into apache:main Jan 22, 2024
41 checks passed
@rdblue
Copy link
Contributor Author

rdblue commented Jan 22, 2024

Merged. Thanks for reviewing, @aokolnychyi and @Fokko!

geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024
Co-authored-by: Fokko Driesprong <fokko@apache.org>
adnanhemani pushed a commit to adnanhemani/iceberg that referenced this pull request Jan 30, 2024
Co-authored-by: Fokko Driesprong <fokko@apache.org>
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants