Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible parquet struct converter #4714

Closed
wants to merge 1 commit into from

Conversation

jxiang
Copy link
Contributor

@jxiang jxiang commented Mar 3, 2016

No description provided.

@jxiang
Copy link
Contributor Author

jxiang commented Mar 3, 2016

Our parquet struct schema keeps evolving. Very frequently, we have parquet files of different versions co-existing. Instead of failing the query due to schema mismatching, it is better to return null for new fields if the data file is old.

@@ -735,7 +743,7 @@ public ParquetStructConverter(Type prestoType, String columnName, GroupType entr
List<Type> prestoTypeParameters = prestoType.getTypeParameters();
List<parquet.schema.Type> fieldTypes = entryType.getFields();
checkArgument(
prestoTypeParameters.size() == fieldTypes.size(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes this check helps to catch problems with the data definition in the metastore, so I think silently returning nulls if the schemas do not match is not the right way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is good but too restrictive. It makes it hard to evolve the schema seamlessly. We return nulls and log some message below so that queries can still go on. If they pay attention to the logging, they should realize what's happened.

Even if the number matches, the schema still could be incompatible. Users still need to know what they are doing.

@nezihyigitbasi
Copy link
Contributor

can you add unit tests?

parquet.schema.Type fieldType = fieldTypes.get(i);
converters.add(createConverter(prestoTypeParameters.get(i), columnName + "." + fieldType.getName(), fieldType, i));
}
if (prestoTypeParameters.size() != fieldTypes.size()) {
log.info("Parquet column " + columnName + " field number mismatch, metastore has: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the log here (especially in INFO level)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to log something since we don't fail the query any more. I can make this a warning.

@jxiang
Copy link
Contributor Author

jxiang commented Mar 4, 2016

Sure, will add some unit tests.

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch from 306170b to b316a2e Compare March 6, 2016 00:18
@jxiang
Copy link
Contributor Author

jxiang commented Mar 6, 2016

Added a unit test to the patch.

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from 5d449a5 to 706735e Compare March 15, 2016 18:42
@jxiang
Copy link
Contributor Author

jxiang commented Mar 15, 2016

Updated the patch: 1) now we support both adding and removing struct fields; 2) support adding fields at any place in the struct; 3) added more test cases for these scenarios.
However, it is not supported to change the order of existing fields in a struct. In such a case, a schema mismatch error will be thrown.

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch from 706735e to cc17148 Compare March 16, 2016 19:59
@jxiang
Copy link
Contributor Author

jxiang commented Mar 16, 2016

Added another patch that supports changing the order of existing fields in a struct. With these patches, now we fully support Parquet struct schema evolution.

  • The schema from the metastore is the source of truth
  • Use field order from the metastore schema,
  • Field in metastore schema but not in parquet schema, has null value,
  • Filed not in metasotre schema but in parquet schema, is ignored.

Schema mismatch will be logged although query executes.

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from aebcbe4 to 7753059 Compare March 23, 2016 16:40
@@ -684,32 +693,40 @@ public void end()
void afterValue();
}

private interface BlockFieldConverter extends BlockConverter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to update all BlockConverter into BlockFieldConverter? Then we just need on BlockConverter, with fieldIndex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of converters can share the same interface, except ParquetListEntryConverter and ParquetMapEntryConverter. The field index doesn't apply to these two entry converters.

@dain
Copy link
Contributor

dain commented Jan 11, 2017

@zhenxiao now that the new parquet reader supports structs, we should add this same feature to it

@zhenxiao
Copy link
Collaborator

@dain yes, we are working / stress testing it
New Parquet Reader is using nested path names to look up ColumnDescriptors, it has most schema evolution support now

@markcho
Copy link

markcho commented Feb 10, 2017

@jxiang Is there anything that I can do to help out with this PR?

I'm facing the same problem where we have mismatching schemas for structs due to schema evolution and it's not very feasible for us backfill the old Parquet files to match the new schemas.

I can apply this change to my fork but I think other people may find this feature useful as well.

That being said, is there a different approach to schema evolution involving Parquet files for cases similar to this, without applying this patch?

@billonahill
Copy link

+1 to @markcho's comment. We're also in need of this patch as well. cc/ @Yaliang.

@Gauravshah
Copy link

@zhenxiao the code still doesn't allow for struct types to evolve https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java#L743-L748

anything we can do to help in this pr ?

@zhenxiao
Copy link
Collaborator

we have a rebased version here:
ba5a3e4

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from eee4f3b to 333ecb2 Compare February 14, 2017 19:27
@Gauravshah
Copy link

adding the updated pull request for reference #6675

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from 2b018e0 to 5b14fbb Compare February 14, 2017 19:41
@jxiang
Copy link
Contributor Author

jxiang commented Feb 14, 2017

Thanks @zhenxiao, I pushed the rebased version to this branch.

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch from 5b14fbb to 0f526c9 Compare February 14, 2017 19:58
@jxiang
Copy link
Contributor Author

jxiang commented Feb 14, 2017

@dain could you take a look when you get a chance? Thanks.

@dain
Copy link
Contributor

dain commented Feb 14, 2017

@jxiang yep. I see there are two (or three) PRs related to this. Can you help me understand which ones I should review in which order?

@dain dain self-requested a review February 14, 2017 23:14
@zhenxiao
Copy link
Collaborator

@dain this PR is the very first one, now @jxiang has all schema evolution stuff in one commit. #6675 is built on top of this

@jxiang
Copy link
Contributor Author

jxiang commented Feb 15, 2017

Yeah, as @zhenxiao said, this is the first one. Thanks.

@dain dain assigned nezihyigitbasi and unassigned dain Mar 17, 2017
@dain dain requested review from nezihyigitbasi and removed request for dain March 17, 2017 19:10
@nezihyigitbasi
Copy link
Contributor

@zhenxiao @jxiang AFAIU #6675 supersedes this one. If that's correct please close this PR and then we can work on the other one.

@zhenxiao
Copy link
Collaborator

continue with:
#6675

@jxiang jxiang closed this Apr 17, 2017
@jxiang jxiang deleted the extra_parquet_struct_fields branch April 18, 2017 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants