Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): BinaryView/Utf8View IPC support #13464

Merged
merged 5 commits into from
Jan 6, 2024
Merged

feat(rust): BinaryView/Utf8View IPC support #13464

merged 5 commits into from
Jan 6, 2024

Conversation

ritchie46
Copy link
Member

@ritchie46 ritchie46 commented Jan 5, 2024

Implements IPC for BinaryView and Utf8View. The project isn't certain yet where to store buffer lengths. https://lists.apache.org/thread/5c7lzg9x1t4bbtvd3sxq4pwxtbkw5pxq

For now we store the buffers like this:

[validity, views, buffer_lengths, *variadic_buffers]

I believe the buffer_lengths should always be before the variadic_buffers otherwise we don't know the length with which to read them and the we don't know at which offset we should read the buffer_lengths if they were last. If the arrow projects comes up with something else, we can always change. For now we will support this as an opt-in Polars flavored IPC.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 5, 2024
@ritchie46 ritchie46 changed the title feat: BinaryView IPC support [skip ci] feat(rust): BinaryView IPC support Jan 5, 2024
@ritchie46 ritchie46 removed the python Related to Python Polars label Jan 5, 2024
@ritchie46 ritchie46 changed the title feat(rust): BinaryView IPC support feat(rust): BinaryView/Utf8View IPC support Jan 6, 2024
@ritchie46 ritchie46 merged commit f22e7fd into main Jan 6, 2024
17 checks passed
@ritchie46 ritchie46 deleted the binview branch January 6, 2024 10:15
@jorisvandenbossche
Copy link

The project isn't certain yet where to store buffer lengths. https://lists.apache.org/thread/5c7lzg9x1t4bbtvd3sxq4pwxtbkw5pxq

FWIW, this has been decided (and implemented in Arrow C++) for a while: see apache/arrow#38443 (but I see that update wasn't shared on the mailing list thread ..)

So the Arrow spec decided to store the lengths as the last buffer, instead of before the variadic buffers (https://arrow.apache.org/docs/dev/format/CDataInterface.html#binary-view-arrays)

I believe the buffer_lengths should always be before the variadic_buffers otherwise we don't know the length with which to read them and the we don't know at which offset we should read the buffer_lengths if they were last.

The struct stores the number of buffers (ArrowArray.n_buffers), so I think you can always easily retrieve the last buffer to access the lengths, before accessing the variadic data buffers?

@jorisvandenbossche
Copy link

Also note that this is for the C Data Interface, not IPC. In the IPC message, my understanding is that there shouldn't be a additional buffer with lengths (the Buffer flatbuffer struct itself contains a length), there is only a variadicBufferCounts field in the RecordBatch message (https://arrow.apache.org/docs/dev/format/Columnar.html#variadic-buffers)

Now, didn't check the diff of the PR in detail to see if that's what you already did.
(but because you speak about "polars-flavored IPC" I am a bit confused)

@ritchie46
Copy link
Member Author

The struct stores the number of buffers (ArrowArray.n_buffers), so I think you can always easily retrieve the last buffer to access the lengths, before accessing the variadic data buffers?

Hey Joris, Good to know there has been a decision. The polars flavored IPC is only a intermediate step until we know how Apache arrow will deal with it. I will see if I can implement it in the way you describe (I wasn't aware the flatbuffer had the length per buffer).

Anywhere we can chat if I encounter anything?

@jorisvandenbossche
Copy link

Anywhere we can chat if I encounter anything?

There is a zulip instance where several of the Arrow (C++) developers are active: https://ursalabs.zulipchat.com/ (I think there is also a general Apache slack, but I don't know if there are many Arrow developers on there)

I am also on discord if you want to reach me personally.

@ritchie46
Copy link
Member Author

I am also on discord if you want to reach me personally.

Can you give me a ping? I don't know your handle. ^^. My handle is ritchie46.

PS. Works great with the buffers, thanks for the nudge: #13842

@jorisvandenbossche
Copy link

I should have sent a friend request yesterday (but not super familiar with how discord works ;))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants