-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16773: [Docs][Format] Document Run-Length encoding in Arrow columnar format #13333
Conversation
docs/source/format/Columnar.rst
Outdated
describing how often this value is repeated. | ||
|
||
Any array can be run-length-encoded. A run-length encoded array has a single | ||
buffer holding as many signed 32-bit integers, as there are runs. The actual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not true if we use 32 bits. An array whose length is larger than a 32 bit integer cannot be represented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is very readable.
Co-authored-by: Weston Pace <weston.pace@gmail.com>
(code already works like this)
This looks reasonable to me, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you for this @zagto
|
||
* run ends (Int32): | ||
* Length: 3, Null count: 0 | ||
* Validity bitmap buffer: Not required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Validity bitmap buffer: Not required | |
* Validity bitmap buffer: Not present (not allowed) |
See comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it would be best to be explicit here that no validity bitmap is allowed to avoid ambiguity
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Closing as this has been incorporated into #14176 |
|
||
:: | ||
|
||
* Length: 7, Null count: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formulation of null count is a little surprising, it implies that when computing the parent ArrayData from a set of children, it must iterate the null mask in conjunction with the run ends. This is also inconsistent with how null counts work for other nested types.
I've asked for clarification of this on the mailing list - https://lists.apache.org/thread/4x14b0h3fcfwzk68jpoq3n5xvr241qz5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it should actually say 0, since the array has no validity bitmap
(I tried to reply via the reply button in the ml web ui but that does not seem to have come through)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the array has no validity bitmap
Is this a general restriction, or is it just in this case there is no validity mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a general restriction (at least I think so, and that's how the code works). The idea so far is that Null is just one of the possible values for a run.
The is somewhat consistent with how union types work.
(if we were to allow the RLE array parent to have an additional null mask, the null count field would represent that - there seems to be a generall assumption in Arrow code that a non-zero (or array length for the NULL) null count means the presence of the standard null mask)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a general restriction
We should probably document that if so, personally it seems a shame as it complicates things like nullif kernels which now have to explicitly handle the case of RLE data, whereas normally they don't need concern themselves with anything but the array data.
union types work
Yeah, union types are a bit of a special snowflake though 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is actually documented by saying there are no buffers, but that's a bit indirect, so guess mentioning it explicitly won't hurt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted a PR for this here, please take a look:
#33831
No description provided.