Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-3974] Fix schema projection to skip non-existent preCombine field #5427

Closed

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Apr 25, 2022

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

}).collect(Collectors.toList());
List<Integer> prunedIds = names.stream()
.filter(name -> {
int id = schema.findIdByName(name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me understand something. I understand if non existant preCombine is part of the names, we ignore it.
But if someone does a query "select a,b,c from tbl", where b does not even exist in the table, we have to throw exception. Can you confirm that is not affected by this fix here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the filtering should not happen after a second thought. I'm going to rethink how to make the fix.

@@ -324,7 +326,14 @@ object HoodieSparkUtils extends SparkAdapterSupport {
val name2Fields = tableAvroSchema.getFields.asScala.map(f => f.name() -> f).toMap
// Here have to create a new Schema.Field object
// to prevent throwing exceptions like "org.apache.avro.AvroRuntimeException: Field already used".
val requiredFields = requiredColumns.map(c => name2Fields(c))
val requiredFields = requiredColumns.filter(c => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not relax this here actually, b/c requiredColumns will contain also query columns.

Instead we should provide HoodieMergeOnReadRDD 2 parquet readers:

  1. Primed for merging (ie for schema containing record-key, precombine-key)
  2. Primed for NO merging (ie whose schema could be essentially empty)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we do this while appending mandatory columns ? i.e compare w/ table schema and drop missing fields. so that we do this filtering only for the mandatory columns that we look to add and not touch query columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alexey: I am not very sure on the amount of changes required for the proposal you have made. but lets try to make minimal changes to make progress w/o requiring more testing. anyways, we will revisit the preCombine field setting altogether for 0.12 and put some fixes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to rethink the minimal changes to unblock 0.11 release. The changes in the current shape introduce the problem with non-existent query columns as you mentioned.

@yihua yihua force-pushed the HUDI-3974-fix-projection-precombine-field branch from fe6cc9d to c71805c Compare April 26, 2022 03:36
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

.filter(name -> {
int id = schema.findIdByName(name);
if (id < 0) {
LOG.warn(String.format("cannot prune col: %s does not exist in hudi table", name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prior to this patch, we were throwing exception and now we are not? is this change intended?

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of comments/clarifications

* @param hudiFields project names required by Hudi merging.
* @return a project internalSchema.
*/
public static InternalSchema pruneInternalSchema(InternalSchema schema, List<String> queryFields, List<String> hudiFields) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the addition of this new method, is method at L 59 called anywhere? I expect all callers to use this instead of that?

@alexeykudinkin
Copy link
Contributor

Given we're punting on this fix for 0.11, i think we can avoid making these changes in the light of #5430 following up fairly soon

WDYT?

@yihua
Copy link
Contributor Author

yihua commented Apr 26, 2022

Agree. This is more like a bandaid fix for the 0.11.0 release. Since we don't need this for 0.11.0, we should close this one in favor of #5430 which is a proper fix and improvement. Closing this PR.

@yihua yihua closed this Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants