Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns #2937
base: master
Are you sure you want to change the base?
DRILL-8507, DRILL-8508 Better handling of partially missing parquet columns #2937
Changes from all commits
c385b6e
5a775e3
c3f7a72
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general rule has to be:
IIRC, EVF handles all the above for dynamic columns. If Drill had type logic in the Calcite planner, it should handle these same rules.
Again, this kind of logic requires extensive unit tests of all the cases above, plus any others you can think up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first item is about resolving different data types even if there are no missing columns, which I didn't cover.
but only if the other types are REQUIRED
- is this condition necessary?Regarding REPEATED - I haven't covered it in any way.
In theory, implementing these should not be that hard...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, if we are enforcing a planner-provided schema, the job is to map whatever the Parquet type is into the given, fixed Drill type. There is only one right answer when the schema is provided. Again, see EVF for how this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start! See the other cases described above.
Also, I seem to remember creating code to handle evolving column types as part of EVF. Perhaps you can find that code. The code likely has a large number of unit tests (I'm a test-driven kinda guy) which you can reuse to test your parallel implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change: we are propagating the old column type. This is consistent with EVF.
However, this stuff is horribly complex. If we reuse the column, we must reuse the actual value vector. Otherwise, you'll get crashes in the downstream operators that are bound to that vector. The binding is redone only on a schema change. But, your fix avoids the schema change, and hence prevents the rebinding.
Also, note that this fix works ONLY in one direction (column appears, then disappears), and ONLY within a single thread: it can't solve the same problem if the two files are read in different threads and sent to the SORT to reconcile.
Further, we are changing the mode to
OPTIONAL
as required so we can fill the vector withNULL
values. However, change of mode (i.e. nullability) is a schema change and will cause the SORT to fail. We have to have known, on the previous file, that the column will be missing in this file, so that we can create the original column asOPTIONAL
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see this comment.