-
Notifications
You must be signed in to change notification settings - Fork 789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement native support StringViewArray for regexp_is_match
and regexp_is_match_scalar
function, deprecate regexp_is_match_utf8
and regexp_is_match_utf8_scalar
#6376
Conversation
Thanks @tlm365 ❤️ I am running the benchmarks on this PR now and will report back when they are complete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tlm365 -- this is looking really nice
I wonder if you might also be willing to add StringView to the benchmarks as well, specifically
arrow-rs/arrow/benches/comparison_kernels.rs
Lines 56 to 62 in 704f90b
fn bench_regexp_is_match_utf8_scalar(arr_a: &StringArray, value_b: &str) { | |
regexp_is_match_utf8_scalar( | |
criterion::black_box(arr_a), | |
criterion::black_box(value_b), | |
None, | |
) | |
.unwrap(); |
So that if this code is changed in the future we can ensure it doesn't regress in performance
arrow-string/src/regexp.rs
Outdated
); | ||
test_flag_utf8!( | ||
test_utf8_array_regexp_is_match_insensitive_2, | ||
StringViewArray::from(vec!["arrow", "arrow", "arrow", "arrow", "arrow", "arrow"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringViewArray has special case handling for strings that are more than 12 bytes long (the string data is stored out of band in those cases)
Can you please add tests that have some strings that are longer than 12 bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add tests that have some strings that are longer than 12 bytes?
Yes, noted. I will review and update test cases for this scenario.
arrow-string/src/regexp.rs
Outdated
/// | ||
/// See the documentation on [`regexp_is_match_utf8`] for more details. | ||
pub fn regexp_is_match_utf8_scalar<OffsetSize: OffsetSizeTrait>( | ||
array: &GenericStringArray<OffsetSize>, | ||
pub fn regexp_is_match_utf8_scalar<'a, S>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I think this is a API change (as is the above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an idea of how to update this PR to avoid an API change -- the reason this is important is that a breaking API change would need to wait until the next major release (Dec 2024) per the release schedule: https://github.com/apache/arrow-rs?tab=readme-ov-file#release-versioning-and-schedule
TLDR is I think if we introduced a new function like the following:
fn regexp_is_match(
array: &dyn Array,
regex_array: &dyn Array,
flags_array: Option<&dyn Array, >,
) -> Result<BooleanArray, ArrowError> {
..
}
``
We could then support StringView and StringArray and LargeStringArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I think if we introduced a new function like the following:
@alamb Sounds good 👍 But why do we use &dyn Array
for the new regex_is_match
function instead of keeping the current implementation?
Or am I misunderstanding you? I understand that we will provide a new regex_is_match
function, and mark the current regex_is_match_utf8
function as:
#[deprecated(since="54.0.0", note="please use `regex_is_match` instead")]
pub fn regexp_is_match_utf8(...) { ... }
Is that right? 🤔
@alamb Thanks for reviewing, willing to add benchmark for this one. I will update it soon. |
Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>
514847f
to
e80deea
Compare
Here are the benchmark results (aka this PR doesn't slow down the existing implementation)
|
c4763a9
to
65e6839
Compare
regexp_is_match_utf8
and regexp_is_match_utf8_scalar
functionregexp_is_match
and regexp_is_match_scalar
function
Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>
65e6839
to
4fed56b
Compare
I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @tlm365 -- this looks great.
I was reviewing this PR and I had the code checked out locally, so I took the liberty of making a few changes:
- I fixed clippy (was failing due to using deprecated functions)
- I updated the comments / added an example to ease the transition (passing
None
as the flags argument results in type inference errors without some type help) - Improved the comments in general making it clearer what regexp_is_match does and how it is related to
regexp_match
/// * [`regexp_is_match_scalar`] for matching a single regular expression against an array of strings | ||
/// * [`regexp_match`] for extracting groups from a string array based on a regular expression | ||
/// | ||
/// # Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn regexp_is_match_utf8<OffsetSize: OffsetSizeTrait>( | ||
array: &GenericStringArray<OffsetSize>, | ||
regex_array: &GenericStringArray<OffsetSize>, | ||
flags_array: Option<&GenericStringArray<OffsetSize>>, | ||
) -> Result<BooleanArray, ArrowError> { | ||
regexp_is_match(array, regex_array, flags_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched the implementation to just call the new function to avoid duplication
regexp_is_match
and regexp_is_match_scalar
functionregexp_is_match
and regexp_is_match_scalar
function, deprecate regexp_is_match_utf8
and regexp_is_match_utf8_scalar
@tlm365 I wonder if you have a few minutes to review the changes I pushed to this PR. I again I am sorry about the review delays |
@alamb it looks very nice 👍 thank you so much for this update! ❤️ |
We updated this PR so it was not an API change so removing the label |
Which issue does this PR close?
Closes #6370.
Rationale for this change
What changes are included in this PR?
Introduce
regexp_is_match
andregexp_is_match_scalar
(which can replaceregexp_is_match_utf8
andregexp_is_match_utf8_scalar
) can perform onStringArray
/LargeStringArray
/StringViewArray
arguments.Are there any user-facing changes?
No.