Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Utf8View for date/temporal codepaths #11518

Merged
merged 3 commits into from
Jul 22, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Jul 17, 2024

(targets string-view2)

Which issue does this PR close?

More Utf8View support, per #10918

What changes are included in this PR?

Implements Utf8View support for folllowing:

  • Coercion between Utf8View and other string-y types
  • Coercion between Utf8View and temporal types
  • Sending a Utf8View into the date_part function
  • Sending a Utf8View into the make_date function

Are these changes tested?

Currently just using existing tests, I've tested this on the https://github.com/spiraldb/vortex/ TPC-H benchmarks.

Are there any user-facing changes?

Makes several things that used to be errors not error.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jul 17, 2024
@a10y a10y changed the title Fixes to incorporate stringview Add support for Utf8View for date/temporal codepaths Jul 17, 2024
Comment on lines 103 to 105
// TODO(aduffy): just return a manipulated Utf8View array by modifying the view only.
pub fn substr_view(args: &[ArrayRef]) -> Result<ArrayRef> {
match args.len() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to revert this since it's outside the scope of this PR.

Ideally, we'd be able to defer to the substring kernel in arrow_string, but that kernel needs to be fixed to support utf8views

@a10y
Copy link
Contributor Author

a10y commented Jul 17, 2024

I see @XiangpengHao that you've got some of this in #11517, I can rebase this after your PR merges

@alamb alamb marked this pull request as draft July 18, 2024 19:59
@alamb
Copy link
Contributor

alamb commented Jul 18, 2024

I see @XiangpengHao that you've got some of this in #11517, I can rebase this after your PR merges

Marking as draft as I think this is waiting on some other PRs

@alamb
Copy link
Contributor

alamb commented Jul 18, 2024

Thanks @a10y

Also, it might help to target this PR to the string-view2 branch where we are going to accumulate various StringView code in DataFusion that might depend on features that are not yet released in arrow-rs (will be released in arrow-rs 52.2.0 soon)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't see too much replication in this PR -- I think we could target it at the string-view2 branch and it would be good to go

Thank you @a10y (and 👋 )

@a10y a10y changed the base branch from main to string-view2 July 18, 2024 21:04
@a10y a10y changed the base branch from string-view2 to main July 18, 2024 21:05
@a10y a10y force-pushed the aduffy/utf8view branch from 2bc8b54 to 7ddd7e6 Compare July 18, 2024 21:09
@github-actions github-actions bot added the core Core DataFusion crate label Jul 18, 2024
@a10y a10y changed the base branch from main to string-view2 July 18, 2024 21:10
@a10y
Copy link
Contributor Author

a10y commented Jul 18, 2024

Hey! Just changing the base branch in the PR made a massive diff, but did some quick git surgery to cp these changes to be on top of existing string-view2 branch instead

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @a10y

I think the only thing missing from this PR are some tests -- maybe we can adapt some existing sqllogictests and use StringView rather Strings 🤔

@@ -369,6 +369,7 @@ pub fn create_hashes<'a>(
DataType::Null => hash_null(random_state, hashes_buffer, rehash),
DataType::Boolean => hash_array(as_boolean_array(array)?, random_state, hashes_buffer, rehash),
DataType::Utf8 => hash_array(as_string_array(array)?, random_state, hashes_buffer, rehash),
DataType::Utf8View => hash_array(as_string_view_array(array)?, random_state, hashes_buffer, rehash),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be the only thing that is already present

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) and removed core Core DataFusion crate labels Jul 18, 2024
@a10y
Copy link
Contributor Author

a10y commented Jul 18, 2024

Hey @alamb, I added some sqllogictest cases. However, they will fail currently, I'd forgotten this change is dependent on the fixes in apache/arrow-rs#6077.

I was doing testing over in spiral bringing in both of these patchsets, and had not tested this one in isolation.

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

Hey @alamb, I added some sqllogictest cases. However, they will fail currently, I'd forgotten this change is dependent on the fixes in apache/arrow-rs#6077.

I was doing testing over in spiral bringing in both of these patchsets, and had not tested this one in isolation.

No worries -- I'll try and get apache/arrow-rs#6077 reviewed and merged today

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

Ok, sorry for the delay -- I just merged apache/arrow-rs#6077

I think if you updated the pin to Cargo.toml here

https://github.com/apache/datafusion/blob/string-view2/Cargo.toml
https://github.com/apache/datafusion/blob/string-view2/datafusion-cli/Cargo.toml

To be the appropriate commit in arrow-rs we could get the PR to pass

@a10y a10y force-pushed the aduffy/utf8view branch from deb7351 to 580e64b Compare July 20, 2024 15:38
@a10y a10y marked this pull request as ready for review July 20, 2024 15:38
@a10y
Copy link
Contributor Author

a10y commented Jul 20, 2024

Just rebased to get #11517 changes and to bump arrow-rs patch version. sqllogictests are passing for me locally now!

@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

Thank you very much @a10y

I took the liberty of pushing some commits to this branch to get CI to pass. Also I think by doing so the CI will run automatically for this PR from now on (and will run automatically for you in any subsequent PRs)

@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

🚀

@alamb alamb merged commit bb780b3 into apache:string-view2 Jul 22, 2024
25 checks passed
@a10y a10y deleted the aduffy/utf8view branch July 22, 2024 20:19
alamb added a commit that referenced this pull request Jul 29, 2024
… some ClickBench queries (not on by default) (#11667)

* Pin to pre-release version of arrow 52.2.0

* Update for deprecated method

* Add a config to force using string view in benchmark (#11514)

* add a knob to force string view in benchmark

* fix sql logic test

* update doc

* fix ci

* fix ci only test

* Update benchmarks/src/util/options.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/common/src/config.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* update tests

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Add String view helper functions (#11517)

* add functions

* add tests for hash util

* Add ArrowBytesViewMap and ArrowBytesViewSet (#11515)

* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* merge

* update cast

* consistent dep

* fix ci

* add more tests

* make doc happy

* update new implementation

* fix bug

* avoid unused dep

* update dep

* update

* fix cargo check

* update doc

* pick up the comments change again

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Enable `GroupValueBytesView` for aggregation with StringView types (#11519)

* add functions

* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* merge

* update cast

* consistent dep

* fix ci

* avoid unused dep

* update dep

* update

* fix cargo check

* better group value view aggregation

* update

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Initial support for regex_replace on `StringViewArray` (#11556)

* initial support for string view regex

* update tests

* Add support for Utf8View for date/temporal codepaths (#11518)

* Add StringView support for date_part and make_date funcs

* run cargo update in datafusion-cli

* cargo fmt

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* GC `StringViewArray` in `CoalesceBatchesStream` (#11587)

* gc string view when appropriate

* make clippy happy

* address comments

* make doc happy

* update style

* Add comments and tests for gc_string_view_batch

* better herustic

* update test

* Update datafusion/physical-plan/src/coalesce_batches.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* [Bug] fix bug in return type inference of `utf8_to_int_type` (#11662)

* fix bug in return type inference

* update doc

* add tests

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Fix clippy

* Increase ByteViewMap block size to 2MB (#11674)

* better default block size

* fix related test

* Change `--string-view` to only apply to parquet formats (#11663)

* use inferenced schema, don't load schema again

* move config to parquet-only

* update

* update

* better format

* format

* update

* Implement native support StringView for character length (#11676)

* native support for character length

* Update datafusion/functions/src/unicode/character_length.rs

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Remove uneeded patches

* cargo fmt

---------

Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: Xiangpeng Hao <me@haoxp.xyz>
Co-authored-by: Andrew Duffy <a10y@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants