Flexible parquet struct converter #4714

jxiang · 2016-03-03T22:04:05Z

No description provided.

jxiang · 2016-03-03T22:07:42Z

Our parquet struct schema keeps evolving. Very frequently, we have parquet files of different versions co-existing. Instead of failing the query due to schema mismatching, it is better to return null for new fields if the data file is old.

nezihyigitbasi · 2016-03-04T17:45:07Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java

@@ -735,7 +743,7 @@ public ParquetStructConverter(Type prestoType, String columnName, GroupType entr
            List<Type> prestoTypeParameters = prestoType.getTypeParameters();
            List<parquet.schema.Type> fieldTypes = entryType.getFields();
            checkArgument(
-                    prestoTypeParameters.size() == fieldTypes.size(),


Sometimes this check helps to catch problems with the data definition in the metastore, so I think silently returning nulls if the schemas do not match is not the right way to go.

This check is good but too restrictive. It makes it hard to evolve the schema seamlessly. We return nulls and log some message below so that queries can still go on. If they pay attention to the logging, they should realize what's happened.

Even if the number matches, the schema still could be incompatible. Users still need to know what they are doing.

nezihyigitbasi · 2016-03-04T17:48:00Z

can you add unit tests?

nezihyigitbasi · 2016-03-04T17:49:26Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java

                parquet.schema.Type fieldType = fieldTypes.get(i);
                converters.add(createConverter(prestoTypeParameters.get(i), columnName + "." + fieldType.getName(), fieldType, i));
            }
+            if (prestoTypeParameters.size() != fieldTypes.size()) {
+                log.info("Parquet column " + columnName + " field number mismatch, metastore has: "


I don't think we need the log here (especially in INFO level)

It's good to log something since we don't fail the query any more. I can make this a warning.

jxiang · 2016-03-04T18:59:14Z

Sure, will add some unit tests.

jxiang · 2016-03-06T00:19:59Z

Added a unit test to the patch.

jxiang · 2016-03-15T18:47:53Z

Updated the patch: 1) now we support both adding and removing struct fields; 2) support adding fields at any place in the struct; 3) added more test cases for these scenarios.
However, it is not supported to change the order of existing fields in a struct. In such a case, a schema mismatch error will be thrown.

jxiang · 2016-03-16T20:14:54Z

Added another patch that supports changing the order of existing fields in a struct. With these patches, now we fully support Parquet struct schema evolution.

The schema from the metastore is the source of truth
Use field order from the metastore schema,
Field in metastore schema but not in parquet schema, has null value,
Filed not in metasotre schema but in parquet schema, is ignored.

Schema mismatch will be logged although query executes.

zhenxiao · 2016-03-30T01:54:49Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java

@@ -684,32 +693,40 @@ public void end()
        void afterValue();
    }

+    private interface BlockFieldConverter extends BlockConverter


Is it possible to update all BlockConverter into BlockFieldConverter? Then we just need on BlockConverter, with fieldIndex

Most of converters can share the same interface, except ParquetListEntryConverter and ParquetMapEntryConverter. The field index doesn't apply to these two entry converters.

dain · 2017-01-11T22:57:16Z

@zhenxiao now that the new parquet reader supports structs, we should add this same feature to it

zhenxiao · 2017-01-13T00:57:10Z

@dain yes, we are working / stress testing it
New Parquet Reader is using nested path names to look up ColumnDescriptors, it has most schema evolution support now

markcho · 2017-02-10T01:13:51Z

@jxiang Is there anything that I can do to help out with this PR?

I'm facing the same problem where we have mismatching schemas for structs due to schema evolution and it's not very feasible for us backfill the old Parquet files to match the new schemas.

I can apply this change to my fork but I think other people may find this feature useful as well.

That being said, is there a different approach to schema evolution involving Parquet files for cases similar to this, without applying this patch?

billonahill · 2017-02-10T01:36:45Z

+1 to @markcho's comment. We're also in need of this patch as well. cc/ @Yaliang.

Gauravshah · 2017-02-14T18:34:32Z

@zhenxiao the code still doesn't allow for struct types to evolve https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java#L743-L748

anything we can do to help in this pr ?

zhenxiao · 2017-02-14T18:51:24Z

we have a rebased version here:
ba5a3e4

Gauravshah · 2017-02-14T19:33:10Z

adding the updated pull request for reference #6675

jxiang · 2017-02-14T19:48:37Z

Thanks @zhenxiao, I pushed the rebased version to this branch.

jxiang · 2017-02-14T20:00:55Z

@dain could you take a look when you get a chance? Thanks.

dain · 2017-02-14T23:14:17Z

@jxiang yep. I see there are two (or three) PRs related to this. Can you help me understand which ones I should review in which order?

zhenxiao · 2017-02-14T23:40:10Z

@dain this PR is the very first one, now @jxiang has all schema evolution stuff in one commit. #6675 is built on top of this

jxiang · 2017-02-15T16:11:26Z

Yeah, as @zhenxiao said, this is the first one. Thanks.

nezihyigitbasi · 2017-04-10T20:13:32Z

@zhenxiao @jxiang AFAIU #6675 supersedes this one. If that's correct please close this PR and then we can work on the other one.

zhenxiao · 2017-04-17T23:33:36Z

continue with:
#6675

facebook-github-bot added the CLA Signed label Mar 3, 2016

nezihyigitbasi reviewed Mar 4, 2016
View reviewed changes

jxiang force-pushed the extra_parquet_struct_fields branch from 306170b to b316a2e Compare March 6, 2016 00:18

jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from 5d449a5 to 706735e Compare March 15, 2016 18:42

jxiang force-pushed the extra_parquet_struct_fields branch from 706735e to cc17148 Compare March 16, 2016 19:59

jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from aebcbe4 to 7753059 Compare March 23, 2016 16:40

zhenxiao reviewed Mar 30, 2016
View reviewed changes

This was referenced Feb 4, 2017

Parquet schema evolution on non-primitive type twitter-forks/presto#71

Closed

Parquet partition schema evolution on non-primitive columns #7305

Closed

jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from eee4f3b to 333ecb2 Compare February 14, 2017 19:27

jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from 2b018e0 to 5b14fbb Compare February 14, 2017 19:41

Support Schema Evolution in Parquet

0f526c9

jxiang force-pushed the extra_parquet_struct_fields branch from 5b14fbb to 0f526c9 Compare February 14, 2017 19:58

dain self-requested a review February 14, 2017 23:14

zhenxiao mentioned this pull request Mar 1, 2017

Parquet Hive Record Cursor does not support "nullable records" #7453

Closed

dain assigned nezihyigitbasi and unassigned dain Mar 17, 2017

dain requested review from nezihyigitbasi and removed request for dain March 17, 2017 19:10

jxiang closed this Apr 17, 2017

jxiang deleted the extra_parquet_struct_fields branch April 18, 2017 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible parquet struct converter #4714

Flexible parquet struct converter #4714

jxiang commented Mar 3, 2016

jxiang commented Mar 3, 2016

nezihyigitbasi Mar 4, 2016

jxiang Mar 4, 2016

nezihyigitbasi commented Mar 4, 2016

nezihyigitbasi Mar 4, 2016

jxiang Mar 4, 2016

jxiang commented Mar 4, 2016

jxiang commented Mar 6, 2016

jxiang commented Mar 15, 2016

jxiang commented Mar 16, 2016

zhenxiao Mar 30, 2016

jxiang Mar 30, 2016

dain commented Jan 11, 2017

zhenxiao commented Jan 13, 2017

markcho commented Feb 10, 2017

billonahill commented Feb 10, 2017

Gauravshah commented Feb 14, 2017

zhenxiao commented Feb 14, 2017

Gauravshah commented Feb 14, 2017

jxiang commented Feb 14, 2017

jxiang commented Feb 14, 2017

dain commented Feb 14, 2017

zhenxiao commented Feb 14, 2017

jxiang commented Feb 15, 2017

nezihyigitbasi commented Apr 10, 2017

zhenxiao commented Apr 17, 2017

Flexible parquet struct converter #4714

Flexible parquet struct converter #4714

Conversation

jxiang commented Mar 3, 2016

jxiang commented Mar 3, 2016

nezihyigitbasi Mar 4, 2016

Choose a reason for hiding this comment

jxiang Mar 4, 2016

Choose a reason for hiding this comment

nezihyigitbasi commented Mar 4, 2016

nezihyigitbasi Mar 4, 2016

Choose a reason for hiding this comment

jxiang Mar 4, 2016

Choose a reason for hiding this comment

jxiang commented Mar 4, 2016

jxiang commented Mar 6, 2016

jxiang commented Mar 15, 2016

jxiang commented Mar 16, 2016

zhenxiao Mar 30, 2016

Choose a reason for hiding this comment

jxiang Mar 30, 2016

Choose a reason for hiding this comment

dain commented Jan 11, 2017

zhenxiao commented Jan 13, 2017

markcho commented Feb 10, 2017

billonahill commented Feb 10, 2017

Gauravshah commented Feb 14, 2017

zhenxiao commented Feb 14, 2017

Gauravshah commented Feb 14, 2017

jxiang commented Feb 14, 2017

jxiang commented Feb 14, 2017

dain commented Feb 14, 2017

zhenxiao commented Feb 14, 2017

jxiang commented Feb 15, 2017

nezihyigitbasi commented Apr 10, 2017

zhenxiao commented Apr 17, 2017