-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally require alignment when reading IPC, respect alignment when writing #5554
Conversation
arrow-ipc/src/writer.rs
Outdated
|
||
#[test] | ||
fn test_decimal128_alignment8_is_unaligned() { | ||
const IPC_ALIGNMENT: u8 = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is equivalent to the current behavior without the changes in this PR, it's equivalent to using pad_to_8
everywhere.
Because the Decimal128Array is insufficiently aligned, this prevents readers from doing a true zero-copy read (e.g. mmapping), because they need to copy the block into a more-aligned buffer.
} | ||
|
||
/// Calculate an 8-byte boundary and return the number of bytes needed to pad to 8 bytes | ||
const PADDING: [u8; 64] = [0; 64]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously a Vec was being allocated every time padding was needed
I'm travelling for the next few days, but will try to get to this on my return if nobody else gets there first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow-ipc/src/reader.rs
Outdated
/// Specify whether to enforce zero copy, which means the array data inside the record batches | ||
/// produced by this decoder will always reference the input buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about calling this concept / field require_alignment
which describes what is required (rather than how it is enforced)?
It might also help make the intent clearer to users (that the IPC decoder will error if the buffers are not aligned properly)
I don't feel super strongly about this, but I was a little confused about what enforce_zero_copy
meant until I found this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree, will make this change.
My original thinking was that there might be a variety of reasons why IPC decoders may need to create a copy of the input data, and this would be a catch-all so that the users don't have to understand the intricacies of why or why not copies need to be made.
But agree that it ends up being kinda vague, and in the future if there are other reasons copies need to be made, probably better to expose each of the "reasons" as a granular "requirement" - if there ends up being a ton, we can expose a catch-all, but no reason to anticipate that today at the expense of clarity.
arrow-integration-testing/src/flight_server_scenarios/integration_test.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Thanks again @hzuo
so close! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, my only concern is the breaking API change. We have had some community feedback requesting we reduce the frequency of these - #5368
## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - Arrow will overreport the size of a buffer if the underlying buffers are shared (apache/arrow-rs#5969). - apache/arrow-rs#5554 Exposes the ability to enforce alignment at write time. This PR enables this option explicitly and upgrades arrow to take advantage of it. We don't change from the default alignment but this is defensive. - See the comments in get_size() for further understanding of this PR (https://github.com/chroma-core/chroma/pull/2426/files#diff-03bcd4f01acfa68c46fcc974d3722fa621056b4e0f908d708a2d15028b0e99b1R410) - Cleans up various error handling and documentation - adding explicit errors and removing panics as needed - New functionality - None ## Test plan *How are these changes tested?* Updated all tests in delta.rs to save, load and check the sizes match - [x] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Documentation Changes None
Which issue does this PR close?
Closes #5553
Rationale for this change
FileWriter
andStreamWriter
should ensure that the data is written with appropriate alignment such that arrays can be used without copying to a more-aligned buffer.In particular, as of Rust 1.77.0 and LLVM 18,
i128
now has a 16-byte alignment requirement even on x86 (ARM always had this requirement), i.e.std::mem::align_of::<i128> == 16
. SoDecimal128Array
s must be aligned to a 16-byte boundary when serialized into an IPC buffer. Thepad_to_8
used everywhere in the IPC code causes it to pad insufficiently.This prevents readers of the IPC data generated by this crate from doing true zero-copy reads (e.g. mmapping) since the data may be insufficiently aligned. See #5165, #5153.
On some platforms, SIMD may also be significantly slower if the beginning of the IPC block isn't aligned to a 16-, 32-, or 64- byte boundary (as discussed in the Arrow spec document).
(copied from issue)
What changes are included in this PR?
alignment
specified inIpcWriteOptions
require_alignment
option toFileDecoder
,read_record_batch
, andread_dictionary
, which enforces that the read is a zero-copy read. Previously if the buffer was unaligned then the code automatically allocates an aligned buffer and performs a memcpy (build_aligned
=>align_buffers
), which users of Arrow as a zero-copy format may not expect / may want to enforce does not happen.Are there any user-facing changes?
Yes:
alignment
inIpcWriteOptions::try_with
cannot be larger than 64 bytes - previously unbounded (breaking but minor)require_alignment
argument onread_record_batch
andread_dictionary
(breaking but minor)with_require_alignment
to FileDecoder (non-breaking)