Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16773: [Docs][Format] Document Run-Length encoding in Arrow columnar format #13333

Closed
wants to merge 12 commits into from
61 changes: 60 additions & 1 deletion docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -765,6 +765,65 @@ application.
We discuss dictionary encoding as it relates to serialization further
below.

.. _run-length-encoded-layout:

Run-Length-encoded Layout
zagto marked this conversation as resolved.
Show resolved Hide resolved
-------------------------

Run-Length is a data representation that represents data as sequences of the
same value, called runs. Each run is represented as a value, and an integer
describing how often this value is repeated.

Any array can be run-length-encoded. A run-length encoded array has a single
zagto marked this conversation as resolved.
Show resolved Hide resolved
buffer holding as many signed 32-bit integers, as there are runs. The actual
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not true if we use 32 bits. An array whose length is larger than a 32 bit integer cannot be represented

zagto marked this conversation as resolved.
Show resolved Hide resolved
values are held in a child array, which is just a regular array.

The values in the parent array buffer represent the length of each run. They do
not hold the length of the respective run directly, but the accumulated length
of all runs from the first to the current one. This allows relatively efficient
random access from a logical index using binary search. The length of an
individual run can be determined by subtracting two adjacent values.

A run has to have a length of at least 1. This means the values in the
zagto marked this conversation as resolved.
Show resolved Hide resolved
accumulated run lengths buffer are all positive and in strictly ascending
order.

An accumulated run length cannot be null, therefore the parent array has no
validity buffer.

As an example, you could have the following data: ::

type: Float32
[1.0, 1.0, 1.0, 1.0, null, null, 2.0]

In Run-length-encoded form, this could appear as:

::

* Length: 3, Null count: 2
zagto marked this conversation as resolved.
Show resolved Hide resolved
* Accumulated run lengths buffer:

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 |
|-------------|-------------|-------------|-----------------------|
| 4 | 6 | 7 | unspecified (padding) |
zagto marked this conversation as resolved.
Show resolved Hide resolved

* Children arrays:

* values (Float32):
* Length: 3, Null count: 1
* Validity bitmap buffer:

| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00000101 | 0 (padding) |

* Values buffer

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 |
|-------------|-------------|-------------|-----------------------|
| 1.0 | unspecified | 2.0 | unspecified (padding) |


Buffer Listing for Each Layout
------------------------------

Expand Down Expand Up @@ -957,7 +1016,7 @@ The ``Buffer`` Flatbuffers value describes the location and size of a
piece of memory. Generally these are interpreted relative to the
**encapsulated message format** defined below.

The ``size`` field of ``Buffer`` is not required to account for padding
The ``size`` field of ``Buffer`` is not required to account for paddingeng-career-mgmt
zagto marked this conversation as resolved.
Show resolved Hide resolved
bytes. Since this metadata can be used to communicate in-memory pointer
addresses between libraries, it is recommended to set ``size`` to the actual
memory size rather than the padded size.
Expand Down