-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to DataFusion 13 (784f10bb) / Arrow 25.0.0 #176
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The actual 13.0.0 DF release uses Arrow 24.0.0, but we need to pick up 25.0.0, since it brings back the Arrow Schema/Field-to-JSON serialization code (albeit in a different crate for integration tests). apache/arrow-rs#2868 apache/arrow-rs#2724
It's now the default HashMap implementation and DF's planner uses it as well, so we can use std::HashMap everywhere.
Arrow file hash changes and minor changes in the query plan output
Make the `Update`/`Delete` nodes expose `inputs` and `expressions` in order to let the DF query optimizer work on the `WHERE ...` / `SET col = expr` expressions. This is slightly hacky: - as an "input", we return a `TableScan` node that we don't use after that (this is just so that the optimizer knows the input schema for all the expressions) - return the expressions used by the node and add code to pack/unpack them into a list The point of this is to let DataFusion run the `TypeCoercion` optimization, without which something like `WHERE float_col > 42` will raise an error (as after DF 13 these type coercions got removed from other places and moved into optimizations) (NB this doesn't work yet, we still get type coercion errors)
(normally it's run only by DataFusion's `create_physical_plan`, but we don't run that, so we have to execute it manually to get auto type coercion working)
Include `SET` expressions and the predicate if it exists to aid debugging.
These expressions are similar to what DataFusion uses in the `Filter` node and not doing this seems to break partition pruning (perhaps it stops at the `Alias` node and doesn't prone anything, didn't investigate in depth). Copy the `ExprRewriter` visitor from https://github.com/apache/arrow-datafusion/blob/c50573939d21de40e591c04915d41f7c46a51d0d/datafusion/expr/src/utils.rs#L384-L428 and adapt it to remove aliases from all expressions that the query optimizer gives back to `Update`/`Delete` nodes.
Make sure the constants are correctly cast and let us detect changes to the optimizer faster with new DF updates.
gruuya
approved these changes
Oct 27, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #173
TypeCoercion
optimizer rule. This means we had to make the user-definedUpdate
/Delete
nodes liable to be optimized (otherwise something likeDELETE FROM some_table WHERE some_float_value > 42
errors out because 42 isn't a float).expressions()
andchildren()
(return a placeholderTableScan
node that we don't use downstream) and reinitialize the nodes infrom_template()
(also stripping aliases like theFilter
node does: https://github.com/apache/arrow-datafusion/blob/c50573939d21de40e591c04915d41f7c46a51d0d/datafusion/expr/src/utils.rs#L384-L428)CREATE EXTERNAL TABLE
(we had to copypaste this and the parser code, so we don't pick them up automatically)