-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix double encoding of spaces in storage prefix #1274
Conversation
seems I'll have to get my hands on windows machine to verify the fix |
@mrjoe7 as far as I can tell this is using the |
@rtyler This is what I got when I was testing with my local s3. It's showing the files after and before my proposed fix.
My code to test with localstack s3: #[tokio::test]
async fn test_spaces_in_s3() {
let store_url = Url::parse("s3://test-bucket/delta test directory").unwrap();
dbg!(store_url.path());
let store = DeltaObjectStore::try_new(
store_url,
HashMap::from([
(
"AWS_ACCESS_KEY_ID".to_string(),
"TESTACCESSKEY12345".to_string(),
),
(
"AWS_SECRET_ACCESS_KEY".to_string(),
"ABCSECRETKEY".to_string(),
),
("AWS_REGION".to_string(), "us-east-1".to_string()),
(
"AWS_ENDPOINT_URL".to_string(),
"http://localhost:4566".to_string(),
),
("AWS_STORAGE_ALLOW_HTTP".to_string(), "TRUE".to_string()),
]),
)
.unwrap();
assert!(store
.put(&"test with spaces.txt".to_string().into(), Bytes::from("0123456789"))
.await
.is_ok())
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this.
Since this affects the object stores in addition to the local filesystem, I'd like for us to have integration tests that validate the behavior for each of those. It should be sufficient to create an object in each one whose path contains special characters (spaces, but also ideally other characters as well) via the CLI utilities, and then verifies we can access them with the resolved object store. There are utilities in
delta-rs/rust/src/test_utils.rs
Line 289 in c8371b3
pub mod s3_cli { |
Windows build failed because
|
Thats usually caused by caches taking up too much space. Cleared some caches and re-triggered that task, so hopefully should succeed now. |
seems like windows tests are actually failing now due to an illegal character in the working directory. |
object_store does not allow short windows directory names to be used in |
Apparently, this fails not only on windows, but on linux too.
With this error:
Though it worked in previous versions (e.g. 0.7.0) |
Looks like object_store::path::Path::from_url_path will not accept any character that does not match the following criteria: if !b.is_ascii() || should_percent_encode(b) { Wrote a test to illustrate the options: #[test]
fn test_paths() {
assert_eq!("space space", object_store::path::Path::from("space space").to_string());
assert_eq!("tilda%7Etilda", object_store::path::Path::from("tilda~tilda").to_string());
assert!(object_store::path::Path::from_url_path("space space").is_ok());
assert_eq!("space space", object_store::path::Path::from_url_path("space space").unwrap().to_string());
assert!(object_store::path::Path::from_url_path("tilda~tilda").is_err());
} |
I was playing around with object_store crate and found that their support for special characters in S3 is broken. First I uploaded a file using aws cli: awslocal s3 cp localfile.txt s3://test-bucket/~test/uploaded-by-awscli.txt Then I craeted a simple test to validate object_store functionality: #[tokio::test]
async fn test_spaces_in_s3() {
let options = HashMap::from([
(
"aws_access_key_id".to_string(),
"TESTACCESSKEY12345".to_string(),
),
(
"aws_secret_access_key".to_string(),
"ABCSECRETKEY".to_string(),
),
("aws_region".to_string(), "us-east-1".to_string()),
(
"aws_endpoint_url".to_string(),
"http://localhost:4566".to_string(),
),
]);
let amazon_s3 = AmazonS3Builder::from_env()
.with_url("s3://test-bucket")
.try_with_options(&options).unwrap()
.with_allow_http(true)
.build().unwrap();
amazon_s3.put(&object_store::path::Path::from("~test/object_store_1.xml"), Bytes::from("0123456789")).await.unwrap();
let store = PrefixStore::new(amazon_s3, "~test");
store.put(&object_store::path::Path::from("object_store_2.xml"), Bytes::from("0123456789")).await.unwrap();
} Then I checked what files were created in S3:
You can see the |
Hmm ... I am not sure if this is not the intended behaviour. To my understanding the aim of the object_store crate is to provide a uniform interface to all backends, so a path should always be valid in all backend implementations. The documented behaviour is that paths are always encoded according to RFC 1738. There definitely is something off on our end as well, as shown with the double encoding of spaces fixed in this PR. Where I am less certain is how this should translate to the paths created in e.g. S3. Looking at the AWS docs, they do state that in principle UTF-8 encoded characters are allowed, but also specify a set of recommended characters to be safe in all cases. There To complicate things a bit further, I guess this also has implications on how partitioned tables are handled, as it seems other writers might handle encoding paths differently, and the delta protocol just states "No translation required" when it comes to serializing string partition values. To move forward, maybe its best to focus on the double encoding of spaces in this PR and open some tickets for follow ups? |
Well remember, partition values in Delta Lake are meant to be stored in the log JSON, which can be UTF-8 encoded. It doesn't dictate the format of the partition directory paths; they are just encoded live HIVE tables by convention. |
I think we need to have robust tests for these paths. I've started on that in #1278 |
@mrjoe7 I think the encoding of S3 keys is intentional. S3 recommends always percent encoding keys, otherwise you risk breaking many integrations. So I think the correct handling looks like it should be split between local filesystems and object stores:
Does this seem acceptable to you? I think this is consistent with how Spark works as well. |
Not sure if we need to separately check this, but the first segment, when referring to S3 buckets has stricter rules, essentially just lowercase alphanumeric characters, dots and hyphens are permitted in bucket names. I guess this underlines @wjones127 comment, that local and remote paths should be treated separately. |
Not sure if that is even possible since object_store::path::Path will always either try to encode |
My suggestion is to urldecode just the |
Yea, for local stores I think we need to not encode the table Uri so we can pass it to |
If we go for In case the target directory is not created before new_with_prefix() is called it will fail:
|
We have some code for this to also support relative file paths. It likely needs adjustment, but may serve as a starting point. Lines 348 to 371 in 4f9665d
|
When storage was built from URL "file:///data/dir with spaces" all files were created in a directory named "/data/dir%2520with%2520spaces". Name of directory was not only incorrect, but also was causing errors on some filesystems.
Sorry it's been a bit before I could take a look at this. This is getting close, but it needs tests. I've created a PR to merge some into this branch; but they don't all pass yet. I'm still looking into why. |
When storage was built from URL "file:///data/dir with spaces" all files were created in a directory named "/data/dir%2520with%2520spaces". Name of directory was not only incorrect, but also was causing errors on some filesystems.
Description
Path in URL is already url encoded when passed to prefix store where it is encoded one more time.
Related
Fail to read delta table on mounted disk #1189