Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add String view helper functions #11517

Merged
merged 2 commits into from
Jul 19, 2024

Conversation

XiangpengHao
Copy link
Contributor

@XiangpengHao XiangpengHao commented Jul 17, 2024

Note: targets string-view2 branch

Which issue does this PR close?

Let's wait until #11514 is merged

Part of #10918

Rationale for this change

Add a few helper functions to handle string views.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added documentation Improvements or additions to documentation physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 17, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao (for the code as well as for breaking apart these PRs)

I gave this an early review as its contents don't really depend on #11514 (though I see why you want to delay the review)

I feel bad I am slowing down the process wtih all the stacked PRs


Ok(())
// Same values should map to same hash values
assert_eq!(binary[0], binary[5]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -453,6 +454,14 @@ fn min_batch(values: &ArrayRef) -> Result<ScalarValue> {
DataType::LargeUtf8 => {
typed_min_max_batch_string!(values, LargeStringArray, LargeUtf8, min_string)
}
DataType::Utf8View => {
typed_min_max_batch_string!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this is using the fast kernel from apache/arrow-rs#6053

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@XiangpengHao XiangpengHao marked this pull request as ready for review July 19, 2024 17:53
@XiangpengHao XiangpengHao force-pushed the string-view-helper-func branch from 8210eae to 0859c9a Compare July 19, 2024 17:55
@github-actions github-actions bot removed documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 19, 2024
@XiangpengHao
Copy link
Contributor Author

I rebased the branch so that it only contains the relevant changes

@XiangpengHao XiangpengHao requested a review from alamb July 19, 2024 17:56
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao

@alamb alamb merged commit 8e0ca1a into apache:string-view2 Jul 19, 2024
24 checks passed
alamb added a commit that referenced this pull request Jul 29, 2024
… some ClickBench queries (not on by default) (#11667)

* Pin to pre-release version of arrow 52.2.0

* Update for deprecated method

* Add a config to force using string view in benchmark (#11514)

* add a knob to force string view in benchmark

* fix sql logic test

* update doc

* fix ci

* fix ci only test

* Update benchmarks/src/util/options.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/common/src/config.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* update tests

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Add String view helper functions (#11517)

* add functions

* add tests for hash util

* Add ArrowBytesViewMap and ArrowBytesViewSet (#11515)

* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* merge

* update cast

* consistent dep

* fix ci

* add more tests

* make doc happy

* update new implementation

* fix bug

* avoid unused dep

* update dep

* update

* fix cargo check

* update doc

* pick up the comments change again

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Enable `GroupValueBytesView` for aggregation with StringView types (#11519)

* add functions

* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* merge

* update cast

* consistent dep

* fix ci

* avoid unused dep

* update dep

* update

* fix cargo check

* better group value view aggregation

* update

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Initial support for regex_replace on `StringViewArray` (#11556)

* initial support for string view regex

* update tests

* Add support for Utf8View for date/temporal codepaths (#11518)

* Add StringView support for date_part and make_date funcs

* run cargo update in datafusion-cli

* cargo fmt

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* GC `StringViewArray` in `CoalesceBatchesStream` (#11587)

* gc string view when appropriate

* make clippy happy

* address comments

* make doc happy

* update style

* Add comments and tests for gc_string_view_batch

* better herustic

* update test

* Update datafusion/physical-plan/src/coalesce_batches.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* [Bug] fix bug in return type inference of `utf8_to_int_type` (#11662)

* fix bug in return type inference

* update doc

* add tests

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Fix clippy

* Increase ByteViewMap block size to 2MB (#11674)

* better default block size

* fix related test

* Change `--string-view` to only apply to parquet formats (#11663)

* use inferenced schema, don't load schema again

* move config to parquet-only

* update

* update

* better format

* format

* update

* Implement native support StringView for character length (#11676)

* native support for character length

* Update datafusion/functions/src/unicode/character_length.rs

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Remove uneeded patches

* cargo fmt

---------

Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: Xiangpeng Hao <me@haoxp.xyz>
Co-authored-by: Andrew Duffy <a10y@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants