-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3974] Fix schema projection to skip non-existent preCombine field #5427
[HUDI-3974] Fix schema projection to skip non-existent preCombine field #5427
Conversation
}).collect(Collectors.toList()); | ||
List<Integer> prunedIds = names.stream() | ||
.filter(name -> { | ||
int id = schema.findIdByName(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you help me understand something. I understand if non existant preCombine is part of the names, we ignore it.
But if someone does a query "select a,b,c from tbl", where b does not even exist in the table, we have to throw exception. Can you confirm that is not affected by this fix here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the filtering should not happen after a second thought. I'm going to rethink how to make the fix.
@@ -324,7 +326,14 @@ object HoodieSparkUtils extends SparkAdapterSupport { | |||
val name2Fields = tableAvroSchema.getFields.asScala.map(f => f.name() -> f).toMap | |||
// Here have to create a new Schema.Field object | |||
// to prevent throwing exceptions like "org.apache.avro.AvroRuntimeException: Field already used". | |||
val requiredFields = requiredColumns.map(c => name2Fields(c)) | |||
val requiredFields = requiredColumns.filter(c => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not relax this here actually, b/c requiredColumns
will contain also query columns.
Instead we should provide HoodieMergeOnReadRDD
2 parquet readers:
- Primed for merging (ie for schema containing record-key, precombine-key)
- Primed for NO merging (ie whose schema could be essentially empty)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we do this while appending mandatory columns ? i.e compare w/ table schema and drop missing fields. so that we do this filtering only for the mandatory columns that we look to add and not touch query columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alexey: I am not very sure on the amount of changes required for the proposal you have made. but lets try to make minimal changes to make progress w/o requiring more testing. anyways, we will revisit the preCombine field setting altogether for 0.12 and put some fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to rethink the minimal changes to unblock 0.11 release. The changes in the current shape introduce the problem with non-existent query columns as you mentioned.
fe6cc9d
to
c71805c
Compare
.filter(name -> { | ||
int id = schema.findIdByName(name); | ||
if (id < 0) { | ||
LOG.warn(String.format("cannot prune col: %s does not exist in hudi table", name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prior to this patch, we were throwing exception and now we are not? is this change intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of comments/clarifications
* @param hudiFields project names required by Hudi merging. | ||
* @return a project internalSchema. | ||
*/ | ||
public static InternalSchema pruneInternalSchema(InternalSchema schema, List<String> queryFields, List<String> hudiFields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the addition of this new method, is method at L 59 called anywhere? I expect all callers to use this instead of that?
Given we're punting on this fix for 0.11, i think we can avoid making these changes in the light of #5430 following up fairly soon WDYT? |
Agree. This is more like a bandaid fix for the 0.11.0 release. Since we don't need this for 0.11.0, we should close this one in favor of #5430 which is a proper fix and improvement. Closing this PR. |
Tips
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.