-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
object_store: range request with suffix #4611
Comments
I seem to remember GCS being the only cloud store that supports this, but I could be mistaken. Have you done any research into this? |
The S3 docs say they support the Range header, and specifically states that they don't support |
Have you tested this, I seem to remember it did not but my memory is hazy |
Suffix range requests are supported by all major cloud providers. Here is an example against S3 (proxied via Cloudfront): curl -H 'Range: bytes=-524292' https://static.webknossos.org/data/zarr_v3/l4_sample/color/1/c/0/3/3/1 | wc -c |
I would be happy to review a PR making this change. The use of
Good to know, I'm not sure where I got the impression otherwise. Ultimately it isn't a massive problem if they don't support them, we just report this as an error to the user. |
Sounds like users might consider that a problem 😁 |
Well yes, but if the store itself doesn't support them there isn't really all that much we can do 😄 |
It would appear that Azure Blob Storage does not support suffix range headers, although it does support prefix range headers |
I think I'm closing in on an implementation here. Allowing suffixes could mean that some bytes are fetched twice but I don't think that's too much of a problem, as it's on the user to have knowledge of the resource they're looking into. I've been using I think that replacing references to What we do with suffixes on azure is another question - probably just a |
This would still be a breaking API change, something I am rather hesitant about. It would also not be object safe, which is likely more problematic. Did you see, it would appear at least one of the major cloud providers do not support this, and I would not be surprised if there are other implementations that do not - #4611 (comment) |
Understood. If we added The difficult bit would be getting 3rd party implementations to implement
Yes, tried to address that in my comment above. Sometimes people need the suffix, and on azure, the best those people can do is HEAD and GET range, but logging it so that users can change their access pattern if necessary is probably the right thing to do. |
Actually, just noticed we control this too - it's just a case of updating |
Perhaps you could expand upon why you do not know the sizes of the files, I would have expected the files to have been identified by either a catalog or a listing call, both of which could provide this information? I dunno, generally the approach of this crate is to encourage people towards patterns that behave equally well across all backends, as opposed to ones that will have store-specific performance pitfalls. Perhaps I am just trying to avoid another breaking API change as they end up taking up an inordinate amount of my and others time, something I am somewhat struggling to justify here...
|
I mentioned our use case in another issue; copied below
We don't want to list all existing chunks ahead of time as there could easily be many millions, and this could even change under our feet if we're writing the tensor as we go. As chunks may be compressed with arbitrary codecs, we can't predict how many bytes they'll be even if we know how large the chunks are; we just need to read the footer (which indexes sub-chunks) so that we then know which bits of the object to read. I suppose in this use case we never need to read the suffix at the same time as the rest of the chunk, so we could have a separate method for suffix-getting with a default implementation of using a HEAD then GET which is documented as possibly being slow.
Patterns, yes, but I hope we've demonstrated that sometimes people actually need a suffix. All stores can do it (with 2 requests), some stores can just do it better (with 1) - should we refuse to use optimisations which are only available to certain stores? If people already know the length (from listing or whatever), then they don't need to use the method documented as being possibly slow.
Ah, yes, that's unfortunate. So 3rd party stores currently just wrangle their own options? I think the minimal-impact course is to keep everything as it is and just add something like pub trait ObjectStore: std::fmt::Display + Send + Sync + Debug + 'static {
...
/// Get the last `nbytes` of an object.
///
/// If the object size is not known, the default implementation first finds out with a HEAD request.
/// Stores which support suffix requests directly should override this behaviour.
async fn get_suffix(&self, location: &Path, nbytes: usize, object_size: Option<usize>) -> Result<GetResult> {
// if size is None, find out with a head request
// then do self.get_range
}
} Instantly works for everyone, the performance concerns are well-documented, there's an ergonomic path for people who need a suffix and already know the size, and an easy optimisation path for stores which do support it. |
Thank you for the context. The get_suffix idea makes sense to me, and seems like a pragmatic way to avoid an API break, and is consistent with methods like rename or list_with_offset. We can always revisit this at a later date if/when looking to make other breaking changes. I'm curious why to include the size though, if the size is known they could just use get_range, no? Perhaps it could just be?
|
Yes, and that's all we'd do internally. I suppose it's a bit defensive, affording the same ergonomics/ aesthetics to users who do or don't know the size so that they can think about it in terms of the suffix length rather than needing |
I can see your point, but I think I would prefer to keep it simple and provide docs to encourage people to use get_range if available. |
Just for future reference: the fact that The |
Agreed, I view this as a potentially short-term solution to avoid an API break at this time. I suspect when we next look to do a major release we might revisit this. |
With default implementation using a HEAD and then GET request. See apache#4611.
With default implementation using a HEAD and then GET request. See apache#4611.
We need to read the last
n
bytes of a file whose size we do not know. This is supported by HTTP range requests, and access to local files (viaSeek
), canobject_store
support it?object_store::GetOptions::range
could take, instead of acore::ops::Range
, something like this: https://github.com/clbarnes/byteranges-rs/blob/0a953e7c580e96b65fe28e61ed460d6e221dcd8d/src/request.rs#L6-L51 . So long asFrom<RangeBounds>
is supported this may not have to break any existing code.The alternative is a HEAD request to find the length and then a range request using the offset from 0, which is twice the round trips.
The text was updated successfully, but these errors were encountered: