-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support S3 Express One Zone #5268
Conversation
da0603b
to
b76f8e7
Compare
use std::collections::BTreeMap; | ||
use std::sync::Arc; | ||
use std::time::{Duration, Instant}; | ||
use tracing::warn; | ||
use url::Url; | ||
|
||
#[derive(Debug, Snafu)] | ||
#[allow(clippy::enum_variant_names)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise clippy complains about the same prefix for the variants, this is future proofing for when we add further variants for the other error propagation in this module that is currently very ad-hoc
.with_sign_payload(self.config.sign_payload); | ||
|
||
if self.session_token { | ||
let token = HeaderName::from_static("x-amz-s3session-token"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not clearly documented anywhere I could find, I reverse engineered it from the boto library - https://github.com/boto/botocore/blob/develop/botocore/auth.py#L547
@Xuanwo perhaps you might be able to give this one a once over? |
Hey, I wanted to play around with some "real world" use cases of S3 Express 1Z serving Parquet, and this seemed like a good place to start. When I got a test project up and running with this branch, I noticed that the same datafusion queries executed against the same files in a normal bucket and a express1z bucket were taking roughly 2x as long with the express1z bucket. When I started to dive into why this was, I noticed that (in my code at least), the I'm new to this project, and I know this code is WIP, so it's very possible I was holding it wrong (does the caller need to set up credential caching somehow when initializing the object store?), and also very possible that I misunderstand how this code is supposed to be working. That said, I wanted to flag here in case it's unexpected that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM, but there are a few decision details I'd like to discuss.
Which naming to use in our project?
AWS S3 Express
is the product name of this feature, but all it's docs are using directory bucket
.
I feel that directory bucket
is more easy to understand and much close to it's feature set: directory based, /
as delimiter only, no start-after, ...
Should we expose s3 express
to users directly?
AWS S3 Express is now exclusive to AWS. Its unique designs, such as the --x-s3
endpoint modifier, ensure that other vendors will not adopt a similar approach.
Should we offer s3 express as a configurable option to users, or simply provide additional configurations along with documentation to assist users in setting up a directory bucket?
We can internally determine if is_s3_directory_bucket
by checking for --x-s3
in the endpoint and managing credential-related matters.
object_store/src/aws/credential.rs
Outdated
static DATE_HEADER: HeaderName = HeaderName::from_static("x-amz-date"); | ||
static HASH_HEADER: HeaderName = HeaderName::from_static("x-amz-content-sha256"); | ||
static TOKEN_HEADER: HeaderName = HeaderName::from_static("x-amz-security-token"); | ||
static AUTH_HEADER: HeaderName = HeaderName::from_static("authorization"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe we can use http::header::AUTHORIZATION
?
@@ -555,20 +581,20 @@ struct AssumeRoleResponse { | |||
#[derive(Debug, Deserialize)] | |||
#[serde(rename_all = "PascalCase")] | |||
struct AssumeRoleResult { | |||
credentials: AssumeRoleCredentials, | |||
credentials: SessionCredentials, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change, since AssumeRoleWithWebIdentity
returns the same struct.
Ok(TemporaryToken { | ||
token: Arc::new(creds.into()), | ||
// Credentials last 5 minutes | ||
expiry: Some(Instant::now() + Duration::from_secs(5 * 60)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused. Is there another consideration preventing us from using something like SessionCredentials.expiration - Duration::from_secs(600)
? Default of the expiration could last 3600 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporary security credentials are scoped to the bucket and expire after 5 minutes.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateSession.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lesson learnt!
object_store/src/aws/mod.rs
Outdated
@@ -255,6 +255,16 @@ impl ObjectStore for AmazonS3 { | |||
prefix: Option<&Path>, | |||
offset: &Path, | |||
) -> BoxStream<'_, Result<ObjectMeta>> { | |||
if self.client.config.session_provider.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relationship between session_provider
and S3 Express is not immediately clear at first glance. I'm not sure if it worth to add an is_s3_express_bucket
, or more formally, is_s3_directory_bucket
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a private member function to make this more clear
Naming is a good point, it is certainly unhelpful how AWS have chosen to document this. IMO it seems wrong to call something a directory bucket that doesn't actually have a concept of directories... That aside, I think it important to make it clear it is an AWS specific thing, given the number of S3-compatible stores. This is also the reason I'm hesitant to automatically match based on the bucket name, in case you aren't talking to S3. Edit: although you raise a good point about the --x-s3 suffix... |
Credentials are cached for 5 minutes and reused, is it possible you are creating a new object store instance per request, or making requests less frequently than this? Edit: In fact you are totally correct, the timeout the TokenProvider is using to decide when to get new credentials is the same 5 minutes that the credentials are valid for... - nice spot 👍 |
Makes sense, LGTM to me now. |
Which issue does this PR close?
Closes #5140
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?