Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16773: [Docs][Format] Document Run-Length encoding in Arrow columnar format #13333

Closed
wants to merge 12 commits into from

Conversation

zagto
Copy link
Contributor

@zagto zagto commented Jun 7, 2022

No description provided.

@github-actions
Copy link

github-actions bot commented Jun 7, 2022

docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
describing how often this value is repeated.

Any array can be run-length-encoded. A run-length encoded array has a single
buffer holding as many signed 32-bit integers, as there are runs. The actual
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not true if we use 32 bits. An array whose length is larger than a 32 bit integer cannot be represented

@westonpace westonpace self-requested a review June 23, 2022 15:00
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is very readable.

docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
zagto and others added 3 commits June 27, 2022 17:40
Co-authored-by: Weston Pace <weston.pace@gmail.com>
(code already works like this)
@lidavidm
Copy link
Member

This looks reasonable to me, thanks.

@zagto zagto marked this pull request as ready for review September 1, 2022 18:03
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you for this @zagto

docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Show resolved Hide resolved

* run ends (Int32):
* Length: 3, Null count: 0
* Validity bitmap buffer: Not required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Validity bitmap buffer: Not required
* Validity bitmap buffer: Not present (not allowed)

See comment above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it would be best to be explicit here that no validity bitmap is allowed to avoid ambiguity

@alamb
Copy link
Contributor

alamb commented Sep 14, 2022

FYI @tustvold @viirya @nevi-me in case you are interested in this proposed format addition

zagto and others added 3 commits September 20, 2022 18:48
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@zeroshade
Copy link
Member

Closing as this has been incorporated into #14176

@zeroshade zeroshade closed this Jan 3, 2023

::

* Length: 7, Null count: 2
Copy link
Contributor

@tustvold tustvold Jan 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formulation of null count is a little surprising, it implies that when computing the parent ArrayData from a set of children, it must iterate the null mask in conjunction with the run ends. This is also inconsistent with how null counts work for other nested types.

I've asked for clarification of this on the mailing list - https://lists.apache.org/thread/4x14b0h3fcfwzk68jpoq3n5xvr241qz5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it should actually say 0, since the array has no validity bitmap

(I tried to reply via the reply button in the ml web ui but that does not seem to have come through)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the array has no validity bitmap

Is this a general restriction, or is it just in this case there is no validity mask?

Copy link
Contributor Author

@zagto zagto Jan 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a general restriction (at least I think so, and that's how the code works). The idea so far is that Null is just one of the possible values for a run.

The is somewhat consistent with how union types work.

(if we were to allow the RLE array parent to have an additional null mask, the null count field would represent that - there seems to be a generall assumption in Arrow code that a non-zero (or array length for the NULL) null count means the presence of the standard null mask)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a general restriction

We should probably document that if so, personally it seems a shame as it complicates things like nullif kernels which now have to explicitly handle the case of RLE data, whereas normally they don't need concern themselves with anything but the array data.

union types work

Yeah, union types are a bit of a special snowflake though 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is actually documented by saying there are no buffers, but that's a bit indirect, so guess mentioning it explicitly won't hurt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted a PR for this here, please take a look:
#33831

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants