Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16773: [Docs][Format] Document Run-Length encoding in Arrow columnar format #13333
ARROW-16773: [Docs][Format] Document Run-Length encoding in Arrow columnar format #13333
Changes from 1 commit
8186f56
70ea2fa
2ef0a24
3486009
7c348ad
cba824e
f1e1a16
e3d3e9b
77bb500
b501674
7211923
86e12d1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formulation of null count is a little surprising, it implies that when computing the parent ArrayData from a set of children, it must iterate the null mask in conjunction with the run ends. This is also inconsistent with how null counts work for other nested types.
I've asked for clarification of this on the mailing list - https://lists.apache.org/thread/4x14b0h3fcfwzk68jpoq3n5xvr241qz5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it should actually say 0, since the array has no validity bitmap
(I tried to reply via the reply button in the ml web ui but that does not seem to have come through)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a general restriction, or is it just in this case there is no validity mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a general restriction (at least I think so, and that's how the code works). The idea so far is that Null is just one of the possible values for a run.
The is somewhat consistent with how union types work.
(if we were to allow the RLE array parent to have an additional null mask, the null count field would represent that - there seems to be a generall assumption in Arrow code that a non-zero (or array length for the NULL) null count means the presence of the standard null mask)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably document that if so, personally it seems a shame as it complicates things like nullif kernels which now have to explicitly handle the case of RLE data, whereas normally they don't need concern themselves with anything but the array data.
Yeah, union types are a bit of a special snowflake though 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is actually documented by saying there are no buffers, but that's a bit indirect, so guess mentioning it explicitly won't hurt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted a PR for this here, please take a look:
#33831
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it would be best to be explicit here that no validity bitmap is allowed to avoid ambiguity