-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
object_store: full HTTP range support #5222
Conversation
- Support suffix and offset ranges in GetOptions and get_opts - Ensure that, if a range is requested, the response contains exactly that range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadly speaking I like where this is headed, nice work 👍
seemed like they could be useful but are not currently used, and could be deleted for simplicity.
I think my preference would be to keep the scope of this change down, we can always add methods down the line, removing them is much harder 😄
A quick note: some storage services like |
Yes, we mentioned azure's lack of support for this in the linked issue. This method has a default implementation of making 2 requests: one to find the length and one to get the suffix, which is then overridden in all of the stores which directly support suffix requests. |
Great, seems fine 👍 |
As we're now making a breaking change to support this use-case, I think I would prefer us to return a NotImplemented error in such cases. This is consistent with what we do for preconditions, etc... |
Is it also the case with preconditions that it's possible to get that information in an inefficient way, and we choose not to? I'm just thinking about the zarr use case where we want to use object_store so that we can write a library which is generic over the users' stores. If we don't build in the (inefficient) fallback behaviour, then all library authors wanting to make use of generic stores will have to do a lot of re-implementing that exact behaviour with matches on the response and make 3 requests (one to fail, one to get the length, one to actually get the range). An error is the current behaviour (but returned by the server rather than on our side). |
The aim of this crate is to be faithful to the APIs exposed by the object stores themselves, and I therefore think returning an error for something that isn't supported by the store is the correct thing to do. Yes workloads that wish to use suffix ranges against any store will have to do more work to support this, and will need to ascertain how it is they wish to achieve this, even if the solution is just to not use suffix ranges and to ensure they always have the object sizes on hand. FWIW this is the approach that DataFusion takes, and aligns with the fact catalogs store object sizes as these are important for correctly optimizing queries. Edit: perhaps we could make the fallback configurable and disabled by default 🤔 |
- Use idiomatic snafu error handling - fast-fail on azure suffix requests - remove unused GetRange utilities
Implemented all requested changes. For azure, I've checked whether a suffix was requested and then failed before sending the request. It would be valid to just send the suffix request, as the response will trigger an error anyway, and then it would immediately start working if/when microsoft pulls their socks up and implements it, but the error wouldn't be as informative and it would be slower. |
Makes sense, I suspect we would need to update the API version anyway for this to work if they added support for this.
It is EOD for me here, but I'll take a look first thing tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is almost ready to go, I pushed some commits to cleanup some stuff, I think the only question concerns if Display for GetRange should also include the bytes =
?
It would appear this is falling foul of #5272 |
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@clbarnes Ok I have now gotten this updated to the point where CI passes, could you perhaps take a look and check the changes I've made make sense |
Interestingly the spec has the following to say
Which would imply that the logic in this PR is still overly strict |
Ah, because I'm the author it doesn't let me suggest changes, it just puts them right in.
should be a
should also be a |
These were changed because they were resulting in failures of existing tests, running against actual object stores. |
True! If we want to allow the EndTooLarge case then we also need to update the
What it's running against shouldn't matter, should it? As it just catches a few extra cases where where we have requested the wrong thing. It's just a case of whether we want to allow 0-length requests, and whether we want to allow a range where the first byte is after the end of the resource. Are those tests fuzzed or are we explicitly making such requests in the tests? |
My point was we shouldn't break things that currently work, regardless of opinions on their relative correctness 😄 We should therefore establish what exactly the current behaviour is and look to preserve that, perhaps you might be able to look into this, as I am not likely to have time today, and I intend to cut the release tomorrow. |
I think both should result in a 416 RangeNotSatisfiable, because in one the offset exceeds the resource length, and in the other a |
- Raise an error before the request is made if the range has <= 0 bytes in it - `GetRange::as_range` now handles more out-of-bounds cases, although in most cases these should result in a 416 from the server anyway.
Would it be possible to allow the workflows on this PR to run without approval? As the failing tests are dependent on complex and possibly secret CI setup.
I think it's OK to be more strict than the HTTP spec on this check, because |
It doesn't require approvals once you have had code merged, I'm not sure there is a way to turn it off, it's Github enforced IIRC to stop bitcoin miners
So long as this is what it currently does that sounds fine to me |
The tests which were failing were all tests written as part of this PR, as far as I can tell. I think there were a couple of places where the errors' fields were ambiguously named and so some of them were the wrong way round, so I've improved the names and updated the tests for the new, stricter failure cases.
|
Does this case currently work, as if it doesn't that would be a pretty compelling argument for "it doesn't matter". |
It currently works in that the server returns some bytes, but it doesn't return the range of bytes which the user requested, and then the reported
The middle ground would be to return an error, but have it contain the |
I think the use-case of "give me up to the first 100 bytes" is a perfectly valid use-case, and is even called out in the HTTP spec. I therefore don't think we can/should break this if it is currently supported. We should also add additional tests for this, if they don't already exist.
Although one could argue the lack of support for zero-length ranges also breaks this 😅
Only if they don't know the size of the object they are requesting data from, otherwise the behaviour is no different from master? Sure an API could return less, much like it could also return gibberish, at some point you just have to trust that the servers implement the spec correctly |
Got it, I'll see if I can get that done this afternoon.
True, although there is no way to express "return a zero-length slice of bytes" in the HTTP spec because it's right-inclusive. We could bodge it with a HEAD request to fill out the |
Thank you, sorry to go back and forth on this, but whenever I've assumed "nobody could possibly rely on this", I'm proven wrong and everyone then has a bad day 😅 |
Relevant xkcd, of course! That's now implemented and tested, and the docs for |
I took the liberty of pushing e20b130 which makes some minor copy tweaks and also loosens the suffix restrictions. Aside from following the spec, this is important for suffix requests to be useful for systems, such as when dealing with parquet, where the exact footer size isn't necessarily known ahead of time and instead a guess is made. Edit: updating the tests |
(start..len, entry.data.slice(start..len)) | ||
} | ||
} | ||
let r = range.as_range(entry.data.len()).context(RangeSnafu)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed simpler to just have this logic in one place, as opposed to duplicating it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is now good go, unless you have any objections
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I think we are finally good to go 😅
I'll merge this tomorrow morning in case anyone has any further comments
Yes, this looks good to me! Thank you so much for your patience and willingness to wade into this. |
} | ||
|
||
/// Convert to a [`Range`] if valid. | ||
pub(crate) fn as_range(&self, len: usize) -> Result<Range<usize>, InvalidGetRange> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make this public? It looks like this is needed by anyone implementing an ObjectStore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we try to keep the public API as small as possible to allow for evolution
Which issue does this PR close?
Closes #4611 . Supercedes #5206 .
Rationale for this change
Motivated by fetching file suffixes without needing 2 requests to find the length first.
What changes are included in this PR?
GetRange
which represents bounded ranges, offsets, and suffixesGetOptions
now stores aGetRange
rather than aRange
ObjectStore
are updated with these changesAre there any user-facing changes?
GetOptions::range
is now acrate::utils::GetRange
rather than astd::ops::Range<usize>
. It implementsFrom<RangeBounds<usize>>
so this simply means adding.into()
to current code.There are some additional error cases which previously went undetected if a server didn't return exactly the requested range.
Open questions
LocalFileSystem is a special-case store whose
get_opts
returns a whole file which then must be sliced using theGetResult::range
. Other stores return a response containing only the requested bytes, with the range returned only for informational purposes. There is currently little need for users to deal with this complexity asget_range(s)
handles it internally, but this PR may guide more users towards having to do so.There are a bunch of utilities in the
GetRange
impl and associated free functions which seemed like they could be useful but are not currently used, and could be deleted for simplicity.