-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Add system config for unsafe Parquet ID fallback. #9324
Parquet: Add system config for unsafe Parquet ID fallback. #9324
Conversation
@jackye1995, this is the fix I mentioned for Parquet data files. |
7c1e751
to
99a169f
Compare
@jackye1995, @danielcweeks, @RussellSpitzer, could you look at this? I'd like to ideally get it into the next release since we have been allowing unsafe reads. |
@@ -1119,27 +1120,29 @@ public <D> CloseableIterable<D> build() { | |||
|
|||
ParquetReadOptions options = optionsBuilder.build(); | |||
|
|||
NameMapping mapping; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: You could also check the system property when picking the default value in the constructor or add a method called nameMapping()
to avoid two variables with similar names in build()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went to add this, but then I thought it was a bit confusing to have both the field and method. I think it should be okay to have an extra variable.
@Deprecated | ||
public static final ConfigEntry<Boolean> NETFLIX_UNSAFE_PARQUET_ID_FALLBACK_ENABLED = | ||
new ConfigEntry<>( | ||
"iceberg.netflix.unsafe-parquet-id-fallback.enabled", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure including Netflix in the property name gives us any benefit. I'd consider just calling it any of these:
iceberg.unsafe-parquet-id-fallback.enabled
iceberg.parquet.unsafe-id-fallback.enabled
Up to you, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having Netflix in the name because it signals that people should not use this. It is not a feature, it is an artifact of Iceberg's history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am +1 on adding a conf now and disabling it in 2.0. I think the fallback can lead to silent data correctness issues and that's not good.
Is there an easy way to test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I've added the 1.5 milestone as well to make sure that we can remove this in 2.0
Merged. Thanks for reviewing, @aokolnychyi and @Fokko! |
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
This adds a
SystemConfig
setting to enable or disable unsafe Parquet ID fallback.Unsafe Parquet ID fallback (
ParquetSchemaUtil.pruneColumnsFallback
) is Netflix-specific behavior that assigns column IDs by position if a file is missing IDs and there is no name mapping. This pre-dates the public release (a5eb3f6). This is only applicable for Netflix datasets that resolved columns by position in Parquet and could not use name mapping.This PR is the first step to removing fallback ID assignment by adding an environment config that can enable it. We need to remove fallback ID assignment because it is not safe. For data from Spark or other systems with name-based evolution, fallback ID assignment will produce incorrect results when data file schemas have changed (dropped a column). Also, integrations have recently used this behavior accidentally by producing incorrect Parquet files with no field IDs. Removing this will help Iceberg producers to adhere to the spec.
Currently, fallback assignment behavior is triggered when a
NameMapping
is null and a file has no field IDs. This adds the ability to disable fallback assignment by passing an emptyNameMapping
that will result in the original file schema after mapping.