[NSE-1171] Support merge parquet schema and read missing schema #1175

jackylee-ch · 2022-12-01T14:38:47Z

What changes were proposed in this pull request?

This pr is trying to support Parquet Schema merge in ArrowFileFormat.infer_schema and support dealing with missing column or filter in Parquet reading.

How was this patch tested?

unit tests.

github-actions · 2022-12-01T14:39:05Z

#1171

jackylee-ch · 2022-12-02T02:06:07Z

This PR could be tested in Filter applied on merged Parquet schema with new column should work with #1162 .

...ard/src/main/scala/com/intel/oap/spark/sql/execution/datasources/arrow/ArrowFileFormat.scala

PHILO-HE · 2022-12-05T07:19:10Z

...ard/src/main/scala/com/intel/oap/spark/sql/execution/datasources/arrow/ArrowFileFormat.scala

+
+        val nullVectors = if (hashMissingColumns) {
+          val vectors =
+            ArrowWritableColumnVector.allocateColumns(batchSize, requiredSchema)


It looks these columnar vectors are allocated based on all required schema. And then keep truly null vectors for missing columns in ArrowUtils.scala. Can this part be optimized? It looks unnecessary to create null vectors for non-null column. Thanks!

Hm, good catch, I would have a try.

...ndard/src/main/scala/com/intel/oap/spark/sql/execution/datasources/v2/arrow/ArrowUtils.scala

PHILO-HE · 2022-12-06T06:46:07Z

...ndard/src/main/scala/com/intel/oap/spark/sql/execution/datasources/v2/arrow/ArrowUtils.scala

+              .getOrElse {
+                // The missing column need to be find in nullVectors
+                val nullVector =
+                  nullVectors.find(_.getValueVector.getName.equalsIgnoreCase(field.name)).get


Just a small suggestion for code refinement. I think the different handling for case sensitive/insensitive can be simplified. We can define a lambda expression as below. Thus, we can directly use this eql in finding matched field, like anArray.find(x => eql(x, "MATCH_TARGET")), without separating the code logic with if/else. We can do the similar code refinement in other places. Right?

val eql = if (caseSensitive) { (a: String, b: String) => a.equals(b) } else { (a: String, b: String) => a.equalsIgnoreCase(b) }

…a_in_arrow_format

jackylee-ch · 2022-12-07T05:08:26Z

cc @zhouyuan @PHILO-HE

zhouyuan · 2022-12-09T00:54:49Z

@jackylee-ch could you please also add a small Scala unit test for this feature?

jackylee-ch · 2022-12-09T01:28:36Z

@jackylee-ch could you please also add a small Scala unit test for this feature?

Sure

…project#1175) * Support merge parquet schema and read missing schema * fix error * optimize null vectors * optimize code * optimize code * change code * add schema merge suite tests * add test for struct type

* [NSE-1170] Set correct row number in batch scan w/ partition columns (#1172) * [NSE-1171] Throw RuntimeException when reading duplicate fields in case-insensitive mode (#1173) * throw exception if one more columns matched in case insensitive mode * add schema check in arrow v2 * bump h2/pgsql version (#1176) * bump h2/pgsql version Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * ignore one failed test Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [NSE-956] allow to write parquet with compression (#1014) This patch adds support for writing parquet with compression df.coalesce(1).write.format("arrow").option("parquet.compression","zstd").save(path) Signed-off-by: Yuan Zhou yuan.zhou@intel.com * [NSE-1161] Support read-write parquet conversion to read-write arrow (#1162) * add ArrowConvertExtension * do not convert parquet fileformat while writing to partitioned/bucketed/sorted output * fix cache failed * care about write codec * disable convertor extension by default * add some comments * remove wrong compress type check (#1178) Since the compresssion has been supported in #1014 . The extra compression check in ArrowConvertorExtension can be remove now. * fix to use right arrow branch (#1179) fix to use right arrow branch Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [NSE-1171] Support merge parquet schema and read missing schema (#1175) * Support merge parquet schema and read missing schema * fix error * optimize null vectors * optimize code * optimize code * change code * add schema merge suite tests * add test for struct type * to use 1.5 branch arrow Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Signed-off-by: Yuan Zhou yuan.zhou@intel.com Co-authored-by: Jacky Lee <lijunqing@baidu.com>

Support merge parquet schema and read missing schema

04106ff

fix error

496340d

PHILO-HE reviewed Dec 5, 2022

View reviewed changes

optimize null vectors

461cff8

PHILO-HE reviewed Dec 6, 2022

View reviewed changes

jackylee-ch added 4 commits December 6, 2022 21:43

optimize code

7570eee

Merge remote-tracking branch 'upstream/main' into support_merge_schem…

00be017

…a_in_arrow_format

optimize code

f7fa7a5

change code

761d41b

jackylee-ch changed the title ~~[NSE-1171][WIP] Support merge parquet schema and read missing schema~~ [NSE-1171] Support merge parquet schema and read missing schema Dec 7, 2022

jackylee-ch added 2 commits December 9, 2022 11:08

add schema merge suite tests

1e16105

add test for struct type

28c51ef

zhouyuan merged commit cf11842 into oap-project:main Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NSE-1171] Support merge parquet schema and read missing schema #1175

[NSE-1171] Support merge parquet schema and read missing schema #1175

jackylee-ch commented Dec 1, 2022

github-actions bot commented Dec 1, 2022

jackylee-ch commented Dec 2, 2022

PHILO-HE Dec 5, 2022

jackylee-ch Dec 5, 2022

PHILO-HE Dec 6, 2022

jackylee-ch Dec 6, 2022

jackylee-ch commented Dec 7, 2022

zhouyuan commented Dec 9, 2022

jackylee-ch commented Dec 9, 2022

[NSE-1171] Support merge parquet schema and read missing schema #1175

[NSE-1171] Support merge parquet schema and read missing schema #1175

Conversation

jackylee-ch commented Dec 1, 2022

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Dec 1, 2022

jackylee-ch commented Dec 2, 2022

PHILO-HE Dec 5, 2022

Choose a reason for hiding this comment

jackylee-ch Dec 5, 2022

Choose a reason for hiding this comment

PHILO-HE Dec 6, 2022

Choose a reason for hiding this comment

jackylee-ch Dec 6, 2022

Choose a reason for hiding this comment

jackylee-ch commented Dec 7, 2022

zhouyuan commented Dec 9, 2022

jackylee-ch commented Dec 9, 2022