[HUDI-3974] Fix schema projection to skip non-existent preCombine field #5427

yihua · 2022-04-25T23:35:12Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

nsivabalan · 2022-04-25T23:44:09Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/InternalSchemaUtils.java

-    }).collect(Collectors.toList());
+    List<Integer> prunedIds = names.stream()
+        .filter(name -> {
+          int id = schema.findIdByName(name);


can you help me understand something. I understand if non existant preCombine is part of the names, we ignore it.
But if someone does a query "select a,b,c from tbl", where b does not even exist in the table, we have to throw exception. Can you confirm that is not affected by this fix here.

Actually, the filtering should not happen after a second thought. I'm going to rethink how to make the fix.

alexeykudinkin · 2022-04-26T00:21:23Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala

@@ -324,7 +326,14 @@ object HoodieSparkUtils extends SparkAdapterSupport {
      val name2Fields = tableAvroSchema.getFields.asScala.map(f => f.name() -> f).toMap
      // Here have to create a new Schema.Field object
      // to prevent throwing exceptions like "org.apache.avro.AvroRuntimeException: Field already used".
-      val requiredFields = requiredColumns.map(c => name2Fields(c))
+      val requiredFields = requiredColumns.filter(c => {


We should not relax this here actually, b/c requiredColumns will contain also query columns.

Instead we should provide HoodieMergeOnReadRDD 2 parquet readers:

Primed for merging (ie for schema containing record-key, precombine-key)

Primed for NO merging (ie whose schema could be essentially empty)

can't we do this while appending mandatory columns ? i.e compare w/ table schema and drop missing fields. so that we do this filtering only for the mandatory columns that we look to add and not touch query columns.

alexey: I am not very sure on the amount of changes required for the proposal you have made. but lets try to make minimal changes to make progress w/o requiring more testing. anyways, we will revisit the preCombine field setting altogether for 0.12 and put some fixes.

I'm going to rethink the minimal changes to unblock 0.11 release. The changes in the current shape introduce the problem with non-existent query columns as you mentioned.

hudi-bot · 2022-04-26T05:50:40Z

CI report:

c71805c Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2022-04-26T05:59:20Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/InternalSchemaUtils.java

+        .filter(name -> {
+          int id = schema.findIdByName(name);
+          if (id < 0) {
+            LOG.warn(String.format("cannot prune col: %s does not exist in hudi table", name));


prior to this patch, we were throwing exception and now we are not? is this change intended?

nsivabalan

couple of comments/clarifications

nsivabalan · 2022-04-26T06:01:11Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/InternalSchemaUtils.java

+   * @param hudiFields  project names required by Hudi merging.
+   * @return a project internalSchema.
+   */
+  public static InternalSchema pruneInternalSchema(InternalSchema schema, List<String> queryFields, List<String> hudiFields) {


with the addition of this new method, is method at L 59 called anywhere? I expect all callers to use this instead of that?

alexeykudinkin · 2022-04-26T16:11:50Z

Given we're punting on this fix for 0.11, i think we can avoid making these changes in the light of #5430 following up fairly soon

WDYT?

yihua · 2022-04-26T16:51:59Z

Agree. This is more like a bandaid fix for the 0.11.0 release. Since we don't need this for 0.11.0, we should close this one in favor of #5430 which is a proper fix and improvement. Closing this PR.

nsivabalan added the priority:blocker label Apr 25, 2022

nsivabalan reviewed Apr 25, 2022

View reviewed changes

alexeykudinkin requested changes Apr 26, 2022

View reviewed changes

[HUDI-3974] Fix schema projection to skip non-existent preCombine field

c71805c

yihua force-pushed the HUDI-3974-fix-projection-precombine-field branch from fe6cc9d to c71805c Compare April 26, 2022 03:36

nsivabalan reviewed Apr 26, 2022

View reviewed changes

yihua closed this Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3974] Fix schema projection to skip non-existent preCombine field #5427

[HUDI-3974] Fix schema projection to skip non-existent preCombine field #5427

yihua commented Apr 25, 2022

nsivabalan Apr 25, 2022

yihua Apr 26, 2022

alexeykudinkin Apr 26, 2022

nsivabalan Apr 26, 2022

nsivabalan Apr 26, 2022

yihua Apr 26, 2022

hudi-bot commented Apr 26, 2022

nsivabalan Apr 26, 2022

nsivabalan left a comment

nsivabalan Apr 26, 2022

alexeykudinkin commented Apr 26, 2022

yihua commented Apr 26, 2022

[HUDI-3974] Fix schema projection to skip non-existent preCombine field #5427

[HUDI-3974] Fix schema projection to skip non-existent preCombine field #5427

Conversation

yihua commented Apr 25, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Apr 26, 2022

CI report:

Choose a reason for hiding this comment

nsivabalan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeykudinkin commented Apr 26, 2022

yihua commented Apr 26, 2022