Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add ExtensionType implementation for 8-bit boolean values #17682

Closed
asfimport opened this issue Oct 14, 2017 · 6 comments · Fixed by #43234
Closed

[C++] Add ExtensionType implementation for 8-bit boolean values #17682

asfimport opened this issue Oct 14, 2017 · 6 comments · Fixed by #43234

Comments

@asfimport
Copy link
Collaborator

Some libraries (e.g. NumPy) represent boolean values using an array of int8 or uint8 values of 1's and 0's. This can present a challenge at times to receive such memory without copying.

Now that we have ExtensionType capabilities, we could define an extension type distinguish UInt8/Int8-annotated-as-boolean to be able to flow through such data in applications.

A discussion about introducing a new logical type didn't go anywhere, so having a custom container that can be used for these specialized applications is one way to unblock the use case. If we develop some endogenous use of such data in C++, we would need to be mindful to sanitize it to bitpacked boolean before sending to another Arrow application

Reporter: Wes McKinney / @wesm

PRs and other links:

Note: This issue was originally created as ARROW-1674. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Uwe Korn / @xhochy:
This is only a hint that the data was initially 8bit but we won't support 8bit booleans? (My preferred answer would be "yes" here to keep the implementation of the Arrow spec as simple as possible)

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
The goal is to have enough metadata to support zero copy transport of memory to or from other runtimes. As a primary representation for computation, we would use the 1-bit variety. Right now there is no way to describe an 8-bit boolean in the metadata, and some applications that are only transporting memory (e.g. to/from Plasma) will not want to convert to bit-packed form

@asfimport
Copy link
Collaborator Author

Philipp Moritz / @pcmoritz:
I'm giving this a shot now; one question here is if we want a separate type on the C++ side or one type with a boolean flag. I'm leaning towards a separate type BOOL8 right now.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I think we should probably define a metadata annotation for uint8/int8 to indicate that the data is semantically boolean. This will enable numpy.bool_ to be roundtrippped more gracefully. Doesn't necessarily need to be a formal part of the Arrow format

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Is there still an actual need for this?

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Yes, it is still needed for zero-copy compatibility with numpy which can be useful in a few situations.

joellubi added a commit that referenced this issue Aug 8, 2024
### Rationale for this change

Closes: #17682

Arrow Boolean arrays store values as individual bits, which is a very compact representation but does not match the layout of many systems with which it interoperates. By adding an 8-bit Boolean extension type, zero-copy compatibility with many systems can be improved at the cost of large physical representation.

Go implementation: #43323
C++ / Python implementation: #43488

### What changes are included in this PR?

Proposal and documentation for `Bool8` canonical extension type.

### Are these changes tested?

N/A

### Are there any user-facing changes?

N/A

* GitHub Issue: #17682

Lead-authored-by: Joel Lubinitsky <joellubi@gmail.com>
Co-authored-by: Joel Lubinitsky <33523178+joellubi@users.noreply.github.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Joel Lubinitsky <joellubi@gmail.com>
@joellubi joellubi added this to the 18.0.0 milestone Aug 8, 2024
joellubi added a commit that referenced this issue Aug 12, 2024
### Rationale for this change

Go implementation of #43234

### What changes are included in this PR?

- Go implementation of the `Bool8` extension type
- Minor refactor of existing extension builder interfaces

### Are these changes tested?

Yes, unit tests and basic read/write benchmarks are included.

### Are there any user-facing changes?

- A new extension type is added
- Custom extension builders no longer need another builder created and released separately.

* GitHub Issue: #17682

Authored-by: Joel Lubinitsky <joellubi@gmail.com>
Signed-off-by: Joel Lubinitsky <joellubi@gmail.com>
felipecrv pushed a commit that referenced this issue Aug 21, 2024
### Rationale for this change

C++ and Python implementations of #43234

### What changes are included in this PR?

- Implement C++ `Bool8Type`, `Bool8Array`, `Bool8Scalar`, and tests
- Implement Python bindings to C++, as well as zero-copy numpy conversion methods
- TODO: docs waiting for rebase on #43458

### Are these changes tested?

Yes

### Are there any user-facing changes?

Bool8 extension type will be available in C++ and Python libraries

* GitHub Issue: #17682

Authored-by: Joel Lubinitsky <joellubi@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants