Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add range and ObjectMeta to GetResult (#4352) (#4495) #4677

Merged
merged 4 commits into from
Aug 14, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #4352
Relates to #4495

Rationale for this change

Not including the byte range results in unexpected behaviour of GetResult::bytes.

Additionally it is beneficial to return the ObjectMeta alongside the returned data, as this is effectively free, and can be useful for additional data validation

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the object-store Object Store Interface label Aug 10, 2023
/// The [`ObjectMeta`] for this object
pub meta: ObjectMeta,
/// The range of bytes returned by this request
pub range: Range<usize>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to make this required, as it allows for accurate buffer sizing among other things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternate would be to make it optional (and read file length directly from the metadata if it was set to None)?

If that is the tradeoff I agree that always including the sometimes redundant range is a good choice

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in this PR looks very good to me. Thank you @tustvold

Is this a breaking API change to object_store (as it changes GetResult)

I also didn't see any tests for this new feature -- I think we should add some, both to cover the chunking as I mentioned inline as well as to ensure that we don't accidentally break this API or its implementation during future refactors

/// The [`ObjectMeta`] for this object
pub meta: ObjectMeta,
/// The range of bytes returned by this request
pub range: Range<usize>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternate would be to make it optional (and read file length directly from the metadata if it was set to None)?

If that is the tradeoff I agree that always including the sometimes redundant range is a good choice

@@ -729,54 +719,64 @@ impl GetOptions {
}

/// Result for a get request
#[derive(Debug)]
pub struct GetResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for making the fields in this struct pub? If they are all pub we can't add fields to GetResult in the future (such as optional object store specific metadata, for example) without it being a breaking change.

What do you think about leave the fields as non pub and add accessors / and a

fn into_parts(self) -> (GetResultPayload, ObjectMeta) {
...
}

🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the various implementations need to be able to construct this, and so this just seemed simpler

)
.boxed()
GetResultPayload::File(file, path) => {
local::chunked_stream(file, path, self.range, 8 * 1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping the name of CHUNK_SIZE for the 8 * 1024 would increase this code's readability

range: Range<usize>,
chunk_size: usize,
) -> BoxStream<'static, Result<Bytes, super::Error>> {
futures::stream::once(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about using tokio::fs but it seems like the warnings on that page are still fairly significant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at least currently tokio::fs has pretty terrible performance charateristics, I would not recommend using it for anything really. Perhaps at some point io_uring will get sufficiently stable, but that will be Linux specific

})
.await?;

let stream = futures::stream::try_unfold(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if the object_store tests have coverage for files that are greater than 8KB in size? Aka is this code covered by tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunked store tests should provide good coverage of this and make use of various chunk sizes smaller than 8KB - https://github.com/apache/arrow-rs/blob/master/object_store/src/chunked.rs#L210


let (range, data) = match options.range {
Some(range) => {
ensure!(range.end <= data.len(), OutOfRangeSnafu);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it occurs to me that these errors would be improved if the included the ranges and lengths as values. I understand that this PR doesn't change the behavior

@tustvold tustvold added the api-change Changes to the arrow API label Aug 11, 2023
@tustvold tustvold requested a review from alamb August 11, 2023 15:19
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you @tustvold

#[snafu(display(
"Requested range {}..{} is out of bounds for object with length {}", range.start, range.end, len
))]
OutOfRange { range: Range<usize>, len: usize },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Range to GetResult::File
2 participants