Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16773: [Docs][Format] Document Run-Length encoding in Arrow columnar format #13333

Closed
wants to merge 12 commits into from
Prev Previous commit
Next Next commit
update columnar format doc
  • Loading branch information
zagto committed Aug 25, 2022

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit 77bb500871711ba0f8d861bc50ed53168b42b06f
38 changes: 20 additions & 18 deletions docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
@@ -774,22 +774,20 @@ Run-Length is a data representation that represents data as sequences of the
same value, called runs. Each run is represented as a value, and an integer
describing how often this value is repeated.

Any array can be run-length encoded. A run-length encoded array has a single
buffer holding a signed 32-bit integer for each run. The actual
values are held in a child array, which is just a regular array.
Any array can be run-length encoded. A run-length encoded array has no buffers
by itself, but has two child arrays. The first one holds a signed 32-bit integer
zagto marked this conversation as resolved.
Show resolved Hide resolved
for each run. The actual values of each run are held the second child array.

The values in the parent array buffer represent the length of each run. They do
The values in the first child array represent the length of each run. They do
not hold the length of the respective run directly, but the accumulated length
of all runs from the first to the current one. This allows relatively efficient
random access from a logical index using binary search. The length of an
individual run can be determined by subtracting two adjacent values.
of all runs from the first to the current one, i.e. the logical index where the
current run ends. This allows relatively efficient random access from a logical
index using binary search. The length of an individual run can be determined by
subtracting two adjacent values.

A run has to have a length of at least 1. This means the values in the
zagto marked this conversation as resolved.
Show resolved Hide resolved
accumulated run lengths buffer are all positive and in strictly ascending
order.

An accumulated run length cannot be null, therefore the parent array has no
validity buffer.
run ends array all positive and in strictly ascending order. A run end cannot be
null.
alamb marked this conversation as resolved.
Show resolved Hide resolved

As an example, you could have the following data: ::

@@ -801,15 +799,18 @@ In Run-length-encoded form, this could appear as:
::

* Length: 7, Null count: 2
Copy link
Contributor

@tustvold tustvold Jan 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formulation of null count is a little surprising, it implies that when computing the parent ArrayData from a set of children, it must iterate the null mask in conjunction with the run ends. This is also inconsistent with how null counts work for other nested types.

I've asked for clarification of this on the mailing list - https://lists.apache.org/thread/4x14b0h3fcfwzk68jpoq3n5xvr241qz5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it should actually say 0, since the array has no validity bitmap

(I tried to reply via the reply button in the ml web ui but that does not seem to have come through)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the array has no validity bitmap

Is this a general restriction, or is it just in this case there is no validity mask?

Copy link
Contributor Author

@zagto zagto Jan 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a general restriction (at least I think so, and that's how the code works). The idea so far is that Null is just one of the possible values for a run.

The is somewhat consistent with how union types work.

(if we were to allow the RLE array parent to have an additional null mask, the null count field would represent that - there seems to be a generall assumption in Arrow code that a non-zero (or array length for the NULL) null count means the presence of the standard null mask)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a general restriction

We should probably document that if so, personally it seems a shame as it complicates things like nullif kernels which now have to explicitly handle the case of RLE data, whereas normally they don't need concern themselves with anything but the array data.

union types work

Yeah, union types are a bit of a special snowflake though 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is actually documented by saying there are no buffers, but that's a bit indirect, so guess mentioning it explicitly won't hurt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted a PR for this here, please take a look:
#33831

* Accumulated run lengths buffer:
* Children arrays:

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 |
|-------------|-------------|-------------|-----------------------|
| 4 | 6 | 7 | unspecified (padding) |
* run ends (Int32):
* Length: 3, Null count: 0
* Validity bitmap buffer: Not required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Validity bitmap buffer: Not required
* Validity bitmap buffer: Not present (not allowed)

See comment above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it would be best to be explicit here that no validity bitmap is allowed to avoid ambiguity

* Values buffer

* Children arrays:
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 |
|-------------|-------------|-------------|-----------------------|
| 4 | 6 | 7 | unspecified (padding) |

* values (Float32):
* values (Float32):
* Length: 3, Null count: 1
* Validity bitmap buffer:

@@ -843,6 +844,7 @@ of memory buffers for each layout.
"Dense Union",type ids,offsets,
"Null",,,
"Dictionary-encoded",validity,data (indices),
"Run-length encoded",,,

Logical Types
=============