-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve StringView support for SUBSTR #12044
Conversation
The following is my latest bench report of
|
let mut group = c.benchmark_group("SHORTER THAN 12"); | ||
group.sampling_mode(SamplingMode::Flat); | ||
group.sample_size(10); | ||
group.measurement_time(Duration::from_secs(10)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we tune the sampling parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually simply copied these settings from benches/repeat. I'm not very sure about this.
if length == 0 { | ||
builder.append_null(); | ||
} else if length > 12 { | ||
let buffer_index = (*raw >> 64) as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use ByteView
from arrow-rs:
https://github.com/apache/arrow-rs/blob/27789d7c9abb50796a4042e7e193703efe3c95b3/arrow-data/src/byte_view.rs#L44-L54
But ByteView is behind arrow-data
, which is not explicitly depended by DataFusion, what's your opinion? @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add an explclit dependence on arrow-data
similar to
Line 86 in 950dc73
arrow-ord = { version = "52.2.0", default-features = false } |
In general I think it would be better if the arrow
crate re-exported these various structures (so DataFusion could depend only on arrow
rather than all the sub crates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made apache/arrow-rs#6275 to export this structure publically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also kind of annoying that the use of inline_value
always comes with an immediate call to from_utf8_unchecked
let bytes =
StringViewArray::inline_value(raw, length as usize);
let str = std::str::from_utf8_unchecked(
&bytes[..length as usize],
);
I'll try and make a PR in arrow-rs to do that too
I think the PR is looking good, left some comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Kev1n8 and @XiangpengHao -- I agree this PR is very close
This is a really neat PR
The only thing I think is needed is using ByteView
, as suggested by @XiangpengHao in https://github.com/apache/datafusion/pull/12044/files#r1720861699
I left some smaller suggestions, that is the big one
/// substr('alphabet', 3) = 'phabet' | ||
/// substr('alphabet', 3, 2) = 'ph' | ||
/// The implementation uses UTF-8 code points as characters | ||
fn calculate_substr<'a, V, T>(string_array: V, args: &[ArrayRef]) -> Result<ArrayRef> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function adds an unecessary level of indirection
Rather than havingsubstr
dispatch to calculate_substr
which then dispatches to calculate_string
or calculate_string_view
I suspect the code would be easier to follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I didn't realize it earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Kev1n8 and @XiangpengHao -- this is quite cool and I think looking close
As @XiangpengHao mentions in https://github.com/apache/datafusion/pull/12044/files#r1720861699 , I think we should be using ByteView instead of inlined comparison. If the stuff about using arrow-data
isn't clear, I would be happy to work on it
I'll make a PR with some proposed changes to ths one shortly
datafusion/functions/src/utils.rs
Outdated
pub(crate) fn optimized_utf8_to_str_type( | ||
arg_type: &DataType, | ||
name: &str, | ||
) -> Result<DataType> { | ||
let support_list = ["substr"]; | ||
if support_list.contains(&name) { | ||
Ok(DataType::Utf8View) | ||
} else { | ||
utf8_to_str_type(arg_type, name) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 since this function simply returns Utf8View
regardless of the input types, maybe we should just change substr()
to directly return Ok(DataType::Utf8View)
-- that would probably be the simplest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think so too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @Kev1n8 and @XiangpengHao -- I went though this PR again and I pushed a few more commits. I think this exercise is exploring where using StringViewArray is valuable, though since it is early days it is taking a way.
I think I have two concerns with this PR:
- The logic for StringView is now entirely separate from the String/LargeString array
- There is some non trivial overlap for manipulating the views
Here is what I suggest to move forward:
- Add the benchmark in its own new PR (so we more easily compare just the code change in this PR)
- See if we can try and avoid some of the duplication (I left some specific comments)
// Safety: | ||
// 1. idx < string_array.views.size() | ||
// 2. builder is guaranteed to have corresponding blocks | ||
unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems like there is some non trivial duplication here and the clauses below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use value_unchecked
to retrieve str
in both cases.
// (2) we do not slice on utf-8 codepoint | ||
unsafe { | ||
let bytes = | ||
StringViewArray::inline_value(raw, length as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that we never had to call append_value
-- as I think the substr
calculation can be done entirely in terms of the views (and never modify the values in the buffer). Reading this code carefully I do think is the case but it is quite implicit (it knows that calling append_value on a string that is less than 12 bytes) will not push values values into the buffer
I wonder if we could refactor some of this code into functions with names
Perhaps something like
/// Modify a `view` with length <= 12 bytes so it reflects the substring [start..end]
fn substr_small_view(len: usize, view: u128, start: usize, end: size) -> u128 { ... }
/// Modify a `view` with length > 12 bytes so it reflects the substring [start..end]
fn substr_large_view(len: usize, view: u128, start: usize, end: size) -> u128 { ... }
Then we could simply updated the views directly
However, the StringViewBuilder
doesn't seem to allow for that at the moment....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add a function like append_view_u128_unchecked(view: u128)
in arrow/src/builder/generic_bytes_view_builder.rs to simply add a view with a given u128
. Then the whole process would be:
- Get the str of the view by
value_unchecked
, then get the [start, end) - sub_view =
if end-start>12 substr_large_view() else substr_small_view()make_view() - call
appned_view_u128_unchecked(sub_view)
on the builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this code carefully I do think is the case but it is quite implicit (it knows that calling append_value on a string that is less than 12 bytes)
Yes
Maybe we could add a function like append_view_u128_unchecked(view: u128)
I like this direction.
The current append_view_unchecked
is basically only for long strings (as it asks for a block id), and we don't have a similar method for short strings, what's why we call append_value
as a workaround.
We can probably have a function called append_inlined_view_unchecked(len: usize, prefix: &[u8])
to make it the intention clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that I think retrieving the &str
by value_unchecked()
is a must-do to calculate the [start, end)
, then the sub_view: u128
could be created based on the original view using make_view
. I'll make some commits.
I can open a PR for that Refactor the code again into the following behavior: |
Thanks @Kev1n8 ad @XiangpengHao -- I am running some benchmarks on this now. BTW I was thinking that once we have completed #12119 we could potentially make |
Sounds great. And I think we can also improve the |
Sorry -- here are the results of my benchmark run -- TLDR is this looks great to me
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Kev1n8 and @XiangpengHao - I am sorry for the delay in review / approval / testing of this PR
I also merged up from main for this PR to get the arrow 53 release (and thus the changes to add arrow-data as a dependency were removed) |
Filed #12338 to track the idea of using StringViewArray always |
🚀 |
Which issue does this PR close?
Closes #12031
Closes #12033
Rationale for this change
What changes are included in this PR?
StringVIew
inSUBSTR
, operate directly on the views instead of generating newString
s.Utf8View
column asText
.One is both the input and output of
SUBSTR
is larger or smaller than 12B, for another the input is larger while the output is smaller than 12B. Also, compareUtf8View
,Utf8
, andLargeUtf8
with each other.Are these changes tested?
yes
Are there any user-facing changes?
no