Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace azure sdk with custom implementation #2509

Merged
merged 20 commits into from
Aug 24, 2022
Merged

Conversation

roeap
Copy link
Contributor

@roeap roeap commented Aug 19, 2022

Which issue does this PR close?

closes #2176

Rationale for this change

See #2176

What changes are included in this PR?

Replaces azure sdk with a custom implementation based on reqwest. So far this is a rough draft, and surely needs cleanup and some more work on that auth part. I tried to make the aws and azure implementations look as comparable as can be. I also pulled in a new dependency on oauth2 crate. Will evaluate a bit more whzen cleaning up auth, but my feeling was that implementing the oauth flows manually could be another significant piece of work.

Any feedback is highly welcome.

cc @tustvold @alamb

Are there any user-facing changes?

Not that I'm aware of, but there is a possibility

@github-actions github-actions bot added the object-store Object Store Interface label Aug 19, 2022
@tustvold
Copy link
Contributor

I intend to review this later today or tomorrow, very exiting! R.e. OAuth I've already added an OAuth implementation to this crate for GCP, it would be nice if we could use that. I'd be happy to help out with that if you like?

@roeap
Copy link
Contributor Author

roeap commented Aug 19, 2022

After chasing a bug with signing via access key for a while, implementation was quite straight forward, and I could lean on most of the logic from the AWS implementation. Generally authorization needs to be cleaned up. Aside from getting rid of the OAuth crate, we have some fallible operations in the with_azure_authorization function, but I think we can be fairly confident or validate on build as to "unwrap" those?

Also I tried to activate the stream_get tests also for Azure and there are some bugs to be fixed, that should be doable.

Implementations of the ObjectStore trait are very similar for azure and aws, main difference being if-not-exists support, Thought about if these two should share code, but maybe that's better for a follow up?

I'd be happy to help out with that if you like?

So far I only worked with OAuth via higher level APIs, so help is highly welcome :)

@tustvold
Copy link
Contributor

So far I only worked with OAuth via higher level APIs, so help is highly welcome :)

Ok I'll try to find some time to have a play over the weekend

@roeap
Copy link
Contributor Author

roeap commented Aug 19, 2022

Turns out getting an oauth token, especially with a client secret is not all that complex... we have a working implementation now, that is to say it is working, how we go abut that can likely be improved quite a bit.

The important part though, the oauth2 crate is also no longer needed. One more auth flow would be interesting for production deployments - managed identities - but that should work very similar to the aws credentials an ec2, and can certainly wait for a follow-up :). Implementing interactive flows is from what i understand now quite a different thing, but these are also likely not needed all that much?

@roeap roeap marked this pull request as ready for review August 20, 2022 11:38
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really nice, fantastic work, mostly just some minor nits.

I want to validate this against Azure, and give it another pass (at is rather large 😆). Unfortunately I am out for the next two days, but I will get back to this on my return. I definitely intend for this to make the next object_store release.

object_store/src/azure/client.rs Outdated Show resolved Hide resolved
object_store/src/azure/credential.rs Outdated Show resolved Hide resolved
object_store/src/azure/credential.rs Outdated Show resolved Hide resolved
object_store/src/azure/credential.rs Outdated Show resolved Hide resolved
object_store/src/azure/credential.rs Show resolved Hide resolved
object_store/src/azure/credential.rs Show resolved Hide resolved
object_store/src/azure/mod.rs Show resolved Hide resolved
object_store/src/azure/mod.rs Outdated Show resolved Hide resolved
object_store/src/azure/mod.rs Outdated Show resolved Hide resolved
object_store/src/azure/mod.rs Show resolved Hide resolved
@Xuanwo
Copy link
Member

Xuanwo commented Aug 21, 2022

Hello, I'm building a lib reqsign to help signing API requests to different services, maybe you will be interested to take a look?

So far, reqsign has supported the following services:

  • AWS services (SigV4): reqsign::services::aws::v4::Signer
  • Azure Storage services: reqsign::services::azure::storage::Signer
  • Google services: reqsign::services::google::Signer
  • Huawei Cloud OBS: reqsign::services::huaweicloud::obs::Singer

A quick example for azure storage:

use reqsign::services::azure::storage::Signer;
use reqwest::{Client, Request, Url};
use anyhow::Result;

#[tokio::main]
async fn main() -> Result<()>{
    // Signer will load region and credentials from environment by default.
    let signer = Signer::builder()
        .account_name("account_name")
        .account_key("YWNjb3VudF9rZXkK")
        .build()?;
    // Construct request
    let url = Url::parse("https://test.blob.core.windows.net/testbucket/testblob")?;
    let mut req = reqwest::Request::new(http::Method::GET, url);
    // Signing request with Signer
    signer.sign(&mut req)?;
    // Sending already signed request.
    let resp = Client::new().execute(req).await?;
    println!("resp got status: {}", resp.status());
    Ok(())
}

@tustvold
Copy link
Contributor

Hi @Xuanwo very cool idea, that being said I think I would prefer to keep this logic in tree to keep the dependency burden low. We've been stung repeatedly by this in the past. The implementation here has no dependencies beyond ring, which is already a dependency of rustls.

roeap and others added 2 commits August 21, 2022 09:27
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@tustvold
Copy link
Contributor

Testing this out now

@tustvold
Copy link
Contributor

tustvold commented Aug 23, 2022

So I've spent a decent length of time trying to get this to work. I couldn't get token auth to work, I was just getting 403 errors, but removing the DATE header makes it work. However, stream_get hangs indefinitely. I've not gotten to testing anything else, but perhaps you could take a look?

@roeap
Copy link
Contributor Author

roeap commented Aug 23, 2022

Yes, I will look into it and make sure the oauth example is working ... I did run it before, but then it likely regressed.

Also the infinite hang is a known issue, this is why the is_block_op parameter in put_request was introduced. But not sure i tested this also against an actual account. then again I may have broken it along the way... I'll make sure!

@roeap
Copy link
Contributor Author

roeap commented Aug 23, 2022

I have the service principal auth working again - turns out the custom date parsing did make a difference after all, from what I had read though, your comment was correct in that RFC2822 and RFC1123 should be equivalent when it comes to parsing dates.

AS for the hangs, locally it works, but I never seem to have actually tested it against an account. Will keep on digging :)

@roeap
Copy link
Contributor Author

roeap commented Aug 23, 2022

@tustvold, the latest version I pushed worked for me including service principal auth and multipart against an actual service. However the test against the service ran much longer then I expected, well over 1 min.

Also while playing around with sizes of chunks etc while working with it i got some intermittent failures of individual parts. not sure, but we may also have to check for max package size, right now I think we just enforce a min.

@tustvold
Copy link
Contributor

Awesome, I'm wrapping up for today but will check it out tomorrow

@tustvold
Copy link
Contributor

The logic for SAS does not appear to be correct, in particular it seems to need some sort of signature setup - https://docs.microsoft.com/en-us/rest/api/storageservices/service-sas-examples. It seems to be similar to S3's concept of a pre-signed URL which makes me think it might not be a good fit for this crate anyway?

Perhaps we could remove this functionality for now, and get this in without it? Everything else appears to be working now 💪

@roeap
Copy link
Contributor Author

roeap commented Aug 24, 2022

Thanks! w.r.t. SAS keys. The main way i know this is to be used, is that the actual, already signed query pairs are shared with the consumer. i.e. it usually looks like sv=<signed-key>&st=.... So the idea was not that we generate the SAS, but rather these query pairs can be used directly, and then just need to be added for every request. The signatures you get can also be fairly long-lived. In the end though you are right, its very similar to pre-signed urls. As I never worked with them not sure if this is a difference, but sas keys can be coped to whole containers as well.

I am happy to remove this, we do however have an ask in delta-rs to support this kind of auth mechanism.

@tustvold
Copy link
Contributor

tustvold commented Aug 24, 2022

The part that isn't clear to me is that if the path is encoded in the SAS token, which I think it is as part of the signature, how you could use this with an API such as ObjectStore which addresses multiple paths?

TBC trying to use the SAS support as currently implemented just returns invalid signature errors. Although it is possible I am doing something wrong

@roeap
Copy link
Contributor Author

roeap commented Aug 24, 2022

SAS tokens can be issued for a specific blob, or on a container level, it will then be valid for all requests to that container. It can also be scoped to a single blob/path, in which case there is not much use for use in this crate..

The way i tested it was to use the azure storage explorer to generate a query sting. This comes with multiple values. It would be up to the user to split the total query string provided by e.g. storage explorer into the key value pairs. This is then what the builder expects. Providing this as separate pairs rathen then a single string is something I adopted from the SDK - and a little bitr of lazyness making sure the string is formatted correrctly :)

If you want I can set up a short lived example and share the query pairs here...

@tustvold
Copy link
Contributor

The way i tested it was to use the azure storage explorer to generate a query sting. This comes with multiple values. It would be up to the user to split the total query string provided by e.g. storage explorer into the key value pairs. This is then what the builder expects. 

I tried this and it didn't work... 🙁 Let me double check I'm not doing something stupid

@roeap
Copy link
Contributor Author

roeap commented Aug 24, 2022

Let me double check I'm not doing something stupid

Or I somehow broke it along the way.. much like what happened with the SP support :). I'll make sure again ...

@tustvold
Copy link
Contributor

tustvold commented Aug 24, 2022

Apologies it was PEBCAC, I needed to URL parse the string instead of splitting it manually - I was percent encoding the signature by accident 😅

Edit: There is something interesting going on with it returning an error that a file already exists, which is fun though...

) -> Result<()> {
let credential = self.get_credential().await?;
let url = self.config.path_url(to);
let source = self.config.path_url(from);
Copy link
Contributor

@tustvold tustvold Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read operation on a source blob in the same storage account can be authorized via shared key. Beginning with version 2017-11-09, you can also use Azure Active Directory (Azure AD) to authorize the read operation on the source blob. However, if the source is a blob in another storage account, the source blob must be public, or access to it must be authorized via a shared access signature. If the source blob is public, no authorization is required to perform the copy operation.

When the source object is a file in Azure Files, the source URL uses the following format. Note that the URL must include a valid SAS token for the file.

From - https://docs.microsoft.com/en-us/rest/api/storageservices/copy-blob#request-headers

TLDR - we must include the SAS credentials in the copy source, if we are using SAS credentials

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense :D - will take care of this.

object_store/src/azure/client.rs Outdated Show resolved Hide resolved
Comment on lines 313 to 317
let query = pairs
.iter()
.map(|pair| format!("{}={}", pair.0, pair.1))
.join("&");
source = format!("{}?{}", source, query);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has issues with escaping, perhaps we could use query_pairs_mut

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wanted to avoid parsing a url, but yes, better safe than sorry :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought path_url was a URL 😅

I think you should be able to just percent encode the inputs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that without this change SAS credentials don't work correctly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will clean up and push the code that I have working

roeap and others added 2 commits August 24, 2022 14:02
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@@ -713,7 +713,7 @@ mod tests {
let location = Path::from("test_dir/test_upload_file.txt");

// Can write to storage
let data = get_vec_of_bytes(5_000_000, 10);
let data = get_vec_of_bytes(5_000, 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped this as the tests are otherwise incredibly slow

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to go now, there is some further cleanup possible but lets do that as follow up PRs. I will merge this once the CI finishes. Fantastic work on this 👍

@roeap
Copy link
Contributor Author

roeap commented Aug 24, 2022

Agreed! There is still some work to be done, but finally the SDKs are gone 🎉.

There are two things I was hoping to have a look up in follow ups. One is managed identity auth and certificate support for azure, but lots of that should almost already work with the gcs and s3 things in the crate.

The other topic is around the Url parsing and ObjectStoreProvider we discussed some time ago. I experimented in other crates, and maybe have something as a starting point for continuing the discussion...

@tustvold tustvold merged commit d36f072 into apache:master Aug 24, 2022
@ursabot
Copy link

ursabot commented Aug 24, 2022

Benchmark runs are scheduled for baseline = f11fc1f and contender = d36f072. d36f072 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb
Copy link
Contributor

alamb commented Aug 31, 2022

🎉

@alamb alamb mentioned this pull request Aug 31, 2022
7 tasks
@tustvold tustvold added the api-change Changes to the arrow API label Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

object_store: Move Away From SDKs
5 participants