-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: initial support string_view and binary_view, supports layout and basic construction + tests #5481
Conversation
7e11917
to
a6410ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good, made some suggestions.
I also wonder if this should be ByteView
and not BytesView
for consistency with things like GenericByteArray
, although I do agree in hindsight it probably should be GenericBytesArray
FWIW I rebased #4585 in https://github.com/tustvold/arrow-rs/tree/array-view and compared the changes this makes on top to speed up my review.
49dc2a1
to
1fd109b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @ariesdevil -- I think this code is looking basically ready to me. I have some API suggestions that might be worth considering but we can do it as a follow on PR.
The only thing missing from this PR are a few more tests and then it will be ready to go
|
||
/// Returns the views buffer | ||
#[inline] | ||
pub fn views(&self) -> &ScalarBuffer<u128> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but I wonder if we should consider implementing a ByteViewBuffer
type, similarly to the OffsetBuffer
used for GenericBinaryView
-- https://docs.rs/arrow/latest/arrow/buffer/struct.OffsetBuffer.html
I thought that the introduction of OffsetBuffer
made working with StringArray/BinaryArray much easier.
I can imagine ByteViewBuffer
encapsulating the 12 byte inline string calculation, as well as building such values up as well as hosting documentation explaining what types are present.
If this seems like a reasonable idea, I can write up a ticket / maybe whack up a PR to show what it might look like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, I'll do it in the next PR.
pub fn append_value(&mut self, value: impl AsRef<T::Native>) { | ||
let v: &[u8] = value.as_ref().as_ref(); | ||
let length: u32 = v.len().try_into().unwrap(); | ||
if length <= 12 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a views builder maybe this could look like
self.views_builder.append_inline(v, length)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Hi @alamb , I added more tests for your kindly comments, PTAL again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again @ariesdevil -- I think a few more tests and some cleanups and this PR will be good to go from my perspective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ariesdevil -- I think this PR is ready to go from my perspective.
All remaining issues / comments I think we can address as follow on PRs.
I plan to merge this tomorrow morning (around 18 hours from now) unless anyone else would like additional time to review.
Thanks again @ariesdevil
…s padded with `0` (#40512) ### Rationale for this change While implementing `Variable-size Binary View Layout` (thanks @ ariesdevil !) in apache/arrow-rs#5481 it was not 100% clear if the inlined string was zero padded. @ bkietz noted that > The spec does say "padded with zero" https://github.com/apache/arrow/blob/main/docs/source/format/Columnar.rst?plain=1#L384 but it could be repeated in the surrounding paragraph. In any case, padded with zero is definitely the intent ``` * Short strings, length <= 12 | Bytes 0-3 | Bytes 4-15 | |------------|---------------------------------------| | length | data (padded with 0) | ``` ### What changes are included in this PR? Add a sentence in the surrounding text to make it clear the inlined strings values are zero padded Note I do not think this is a specification change (and therefore doesn't need a vote on the mailing list) as the spec already specifies the padding is zero (in the diagram). This simply clarifies the text to emphasize this point for ease of understanding ### Are these changes tested? ### Are there any user-facing changes? Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@@ -1027,6 +1028,44 @@ fn test_extend_nulls_panic() { | |||
mutable.extend_nulls(2); | |||
} | |||
|
|||
#[test] | |||
fn test_string_view() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Which issue does this PR close?
Closes #5469
Rationale for this change
Initially support
StringViewArray
andBinaryViewArray
, mainly for adding layout and basic construction and tests for these two new types of array.Note: This implementation is primarily based on these two PRs [#4585 databendlabs/databend/pull/14662]
What changes are included in this PR?
Add two new types of arrays.
Are there any user-facing changes?
Yes