-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible parquet struct converter #4714
Conversation
Our parquet struct schema keeps evolving. Very frequently, we have parquet files of different versions co-existing. Instead of failing the query due to schema mismatching, it is better to return null for new fields if the data file is old. |
@@ -735,7 +743,7 @@ public ParquetStructConverter(Type prestoType, String columnName, GroupType entr | |||
List<Type> prestoTypeParameters = prestoType.getTypeParameters(); | |||
List<parquet.schema.Type> fieldTypes = entryType.getFields(); | |||
checkArgument( | |||
prestoTypeParameters.size() == fieldTypes.size(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes this check helps to catch problems with the data definition in the metastore, so I think silently returning nulls if the schemas do not match is not the right way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is good but too restrictive. It makes it hard to evolve the schema seamlessly. We return nulls and log some message below so that queries can still go on. If they pay attention to the logging, they should realize what's happened.
Even if the number matches, the schema still could be incompatible. Users still need to know what they are doing.
can you add unit tests? |
parquet.schema.Type fieldType = fieldTypes.get(i); | ||
converters.add(createConverter(prestoTypeParameters.get(i), columnName + "." + fieldType.getName(), fieldType, i)); | ||
} | ||
if (prestoTypeParameters.size() != fieldTypes.size()) { | ||
log.info("Parquet column " + columnName + " field number mismatch, metastore has: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the log here (especially in INFO level)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to log something since we don't fail the query any more. I can make this a warning.
Sure, will add some unit tests. |
306170b
to
b316a2e
Compare
Added a unit test to the patch. |
5d449a5
to
706735e
Compare
Updated the patch: 1) now we support both adding and removing struct fields; 2) support adding fields at any place in the struct; 3) added more test cases for these scenarios. |
706735e
to
cc17148
Compare
Added another patch that supports changing the order of existing fields in a struct. With these patches, now we fully support Parquet struct schema evolution.
Schema mismatch will be logged although query executes. |
aebcbe4
to
7753059
Compare
@@ -684,32 +693,40 @@ public void end() | |||
void afterValue(); | |||
} | |||
|
|||
private interface BlockFieldConverter extends BlockConverter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to update all BlockConverter into BlockFieldConverter? Then we just need on BlockConverter, with fieldIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of converters can share the same interface, except ParquetListEntryConverter and ParquetMapEntryConverter. The field index doesn't apply to these two entry converters.
@zhenxiao now that the new parquet reader supports structs, we should add this same feature to it |
@dain yes, we are working / stress testing it |
@jxiang Is there anything that I can do to help out with this PR? I'm facing the same problem where we have mismatching schemas for structs due to schema evolution and it's not very feasible for us backfill the old Parquet files to match the new schemas. I can apply this change to my fork but I think other people may find this feature useful as well. That being said, is there a different approach to schema evolution involving Parquet files for cases similar to this, without applying this patch? |
@zhenxiao the code still doesn't allow for struct types to evolve https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java#L743-L748 anything we can do to help in this pr ? |
we have a rebased version here: |
eee4f3b
to
333ecb2
Compare
adding the updated pull request for reference #6675 |
2b018e0
to
5b14fbb
Compare
Thanks @zhenxiao, I pushed the rebased version to this branch. |
5b14fbb
to
0f526c9
Compare
@dain could you take a look when you get a chance? Thanks. |
@jxiang yep. I see there are two (or three) PRs related to this. Can you help me understand which ones I should review in which order? |
Yeah, as @zhenxiao said, this is the first one. Thanks. |
continue with: |
No description provided.