-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Utf8View for date/temporal codepaths #11518
Conversation
// TODO(aduffy): just return a manipulated Utf8View array by modifying the view only. | ||
pub fn substr_view(args: &[ArrayRef]) -> Result<ArrayRef> { | ||
match args.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to revert this since it's outside the scope of this PR.
Ideally, we'd be able to defer to the substring
kernel in arrow_string, but that kernel needs to be fixed to support utf8views
I see @XiangpengHao that you've got some of this in #11517, I can rebase this after your PR merges |
Marking as draft as I think this is waiting on some other PRs |
Thanks @a10y Also, it might help to target this PR to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't see too much replication in this PR -- I think we could target it at the string-view2
branch and it would be good to go
Thank you @a10y (and 👋 )
Hey! Just changing the base branch in the PR made a massive diff, but did some quick git surgery to cp these changes to be on top of existing string-view2 branch instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @a10y
I think the only thing missing from this PR are some tests -- maybe we can adapt some existing sqllogictests and use StringView rather Strings 🤔
datafusion/common/src/hash_utils.rs
Outdated
@@ -369,6 +369,7 @@ pub fn create_hashes<'a>( | |||
DataType::Null => hash_null(random_state, hashes_buffer, rehash), | |||
DataType::Boolean => hash_array(as_boolean_array(array)?, random_state, hashes_buffer, rehash), | |||
DataType::Utf8 => hash_array(as_string_array(array)?, random_state, hashes_buffer, rehash), | |||
DataType::Utf8View => hash_array(as_string_view_array(array)?, random_state, hashes_buffer, rehash), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be the only thing that is already present
Hey @alamb, I added some sqllogictest cases. However, they will fail currently, I'd forgotten this change is dependent on the fixes in apache/arrow-rs#6077. I was doing testing over in spiral bringing in both of these patchsets, and had not tested this one in isolation. |
No worries -- I'll try and get apache/arrow-rs#6077 reviewed and merged today |
Ok, sorry for the delay -- I just merged apache/arrow-rs#6077 I think if you updated the pin to Cargo.toml here https://github.com/apache/datafusion/blob/string-view2/Cargo.toml To be the appropriate commit in arrow-rs we could get the PR to pass |
Just rebased to get #11517 changes and to bump arrow-rs patch version. sqllogictests are passing for me locally now! |
Thank you very much @a10y I took the liberty of pushing some commits to this branch to get CI to pass. Also I think by doing so the CI will run automatically for this PR from now on (and will run automatically for you in any subsequent PRs) |
🚀 |
… some ClickBench queries (not on by default) (#11667) * Pin to pre-release version of arrow 52.2.0 * Update for deprecated method * Add a config to force using string view in benchmark (#11514) * add a knob to force string view in benchmark * fix sql logic test * update doc * fix ci * fix ci only test * Update benchmarks/src/util/options.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/common/src/config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update tests --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add String view helper functions (#11517) * add functions * add tests for hash util * Add ArrowBytesViewMap and ArrowBytesViewSet (#11515) * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * merge * update cast * consistent dep * fix ci * add more tests * make doc happy * update new implementation * fix bug * avoid unused dep * update dep * update * fix cargo check * update doc * pick up the comments change again --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Enable `GroupValueBytesView` for aggregation with StringView types (#11519) * add functions * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * merge * update cast * consistent dep * fix ci * avoid unused dep * update dep * update * fix cargo check * better group value view aggregation * update --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Initial support for regex_replace on `StringViewArray` (#11556) * initial support for string view regex * update tests * Add support for Utf8View for date/temporal codepaths (#11518) * Add StringView support for date_part and make_date funcs * run cargo update in datafusion-cli * cargo fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * GC `StringViewArray` in `CoalesceBatchesStream` (#11587) * gc string view when appropriate * make clippy happy * address comments * make doc happy * update style * Add comments and tests for gc_string_view_batch * better herustic * update test * Update datafusion/physical-plan/src/coalesce_batches.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * [Bug] fix bug in return type inference of `utf8_to_int_type` (#11662) * fix bug in return type inference * update doc * add tests --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix clippy * Increase ByteViewMap block size to 2MB (#11674) * better default block size * fix related test * Change `--string-view` to only apply to parquet formats (#11663) * use inferenced schema, don't load schema again * move config to parquet-only * update * update * better format * format * update * Implement native support StringView for character length (#11676) * native support for character length * Update datafusion/functions/src/unicode/character_length.rs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Remove uneeded patches * cargo fmt --------- Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: Xiangpeng Hao <me@haoxp.xyz> Co-authored-by: Andrew Duffy <a10y@users.noreply.github.com>
(targets string-view2)
Which issue does this PR close?
More Utf8View support, per #10918
What changes are included in this PR?
Implements Utf8View support for folllowing:
date_part
functionmake_date
functionAre these changes tested?
Currently just using existing tests, I've tested this on the https://github.com/spiraldb/vortex/ TPC-H benchmarks.
Are there any user-facing changes?
Makes several things that used to be errors not error.