Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double encoding of spaces in storage prefix #1274

Closed
wants to merge 1 commit into from
Closed

Fix double encoding of spaces in storage prefix #1274

wants to merge 1 commit into from

Conversation

mrjoe7
Copy link
Contributor

@mrjoe7 mrjoe7 commented Apr 10, 2023

When storage was built from URL "file:///data/dir with spaces" all files were created in a directory named "/data/dir%2520with%2520spaces". Name of directory was not only incorrect, but also was causing errors on some filesystems.

Description

Path in URL is already url encoded when passed to prefix store where it is encoded one more time.

Related

Fail to read delta table on mounted disk #1189

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels Apr 10, 2023
@mrjoe7
Copy link
Contributor Author

mrjoe7 commented Apr 10, 2023

seems I'll have to get my hands on windows machine to verify the fix

@rtyler
Copy link
Member

rtyler commented Apr 10, 2023

@mrjoe7 as far as I can tell this is using the url_prefix_handler for all object stores correct? I didn't think we had any issues with spaces in URLs for S3 for example, to where I'm a little apprehensive that this might result in double-encoding of URLs for remote object stores 🤔

@mrjoe7
Copy link
Contributor Author

mrjoe7 commented Apr 10, 2023

@rtyler This is what I got when I was testing with my local s3. It's showing the files after and before my proposed fix.

> awslocal s3 ls s3://test-bucket/ --recursive
2023-04-10 18:28:21         10 delta test directory/test with spaces.txt
2023-04-10 18:20:01         10 delta%2520test%2520directory/test with spaces.txt

My code to test with localstack s3:

    #[tokio::test]
    async fn test_spaces_in_s3() {
        let store_url = Url::parse("s3://test-bucket/delta test directory").unwrap();
        dbg!(store_url.path());
        let store = DeltaObjectStore::try_new(
            store_url,
            HashMap::from([
                (
                    "AWS_ACCESS_KEY_ID".to_string(),
                    "TESTACCESSKEY12345".to_string(),
                ),
                (
                    "AWS_SECRET_ACCESS_KEY".to_string(),
                    "ABCSECRETKEY".to_string(),
                ),
                ("AWS_REGION".to_string(), "us-east-1".to_string()),
                (
                    "AWS_ENDPOINT_URL".to_string(),
                    "http://localhost:4566".to_string(),
                ),
                ("AWS_STORAGE_ALLOW_HTTP".to_string(), "TRUE".to_string()),
            ]),
        )
        .unwrap();

        assert!(store
            .put(&"test with spaces.txt".to_string().into(), Bytes::from("0123456789"))
            .await
            .is_ok())
    }

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this.

Since this affects the object stores in addition to the local filesystem, I'd like for us to have integration tests that validate the behavior for each of those. It should be sufficient to create an object in each one whose path contains special characters (spaces, but also ideally other characters as well) via the CLI utilities, and then verifies we can access them with the resolved object store. There are utilities in

pub mod s3_cli {

rust/Cargo.toml Outdated Show resolved Hide resolved
rust/src/storage/config.rs Show resolved Hide resolved
rust/src/storage/config.rs Outdated Show resolved Hide resolved
rust/src/storage/config.rs Outdated Show resolved Hide resolved
@mrjoe7
Copy link
Contributor Author

mrjoe7 commented Apr 11, 2023

Windows build failed because

 LLVM ERROR: IO failure on output stream: no space on device

@roeap
Copy link
Collaborator

roeap commented Apr 11, 2023

Thats usually caused by caches taking up too much space. Cleared some caches and re-triggered that task, so hopefully should succeed now.

@roeap
Copy link
Collaborator

roeap commented Apr 11, 2023

seems like windows tests are actually failing now due to an illegal character in the working directory.

@mrjoe7
Copy link
Contributor Author

mrjoe7 commented Apr 11, 2023

seems like windows tests are actually failing now due to an illegal character in the working directory.

object_store does not allow short windows directory names to be used in Path::from_url_path because they contain ~ character. I do not want to complicate code in url_prefix_handler function so I will update my test case to be more windows friendly.
I used a short name for my windows user so I was not running into the directory name error.

@kuksag
Copy link

kuksag commented Apr 12, 2023

seems like windows tests are actually failing now due to an illegal character

Apparently, this fails not only on windows, but on linux too.
For example, the following code fails (same with other chars like #):

write_deltalake('illegal~char/table', df)

With this error:

deltalake.PyDeltaTableError: Generic error: Error parsing Path "/data/home/gkuksa/PycharmProjects/db-fetcher/illegal~char/table/": Encountered illegal character sequence "~" whilst parsing path segment "illegal~char"

Though it worked in previous versions (e.g. 0.7.0)

@mrjoe7
Copy link
Contributor Author

mrjoe7 commented Apr 12, 2023

seems like windows tests are actually failing now due to an illegal character

Apparently, this fails not only on windows, but on linux too. For example, the following code fails (same with other chars like #):

write_deltalake('illegal~char/table', df)

With this error:

deltalake.PyDeltaTableError: Generic error: Error parsing Path "/data/home/gkuksa/PycharmProjects/db-fetcher/illegal~char/table/": Encountered illegal character sequence "~" whilst parsing path segment "illegal~char"

Though it worked in previous versions (e.g. 0.7.0)

Looks like object_store::path::Path::from_url_path will not accept any character that does not match the following criteria:

if !b.is_ascii() || should_percent_encode(b) {

Wrote a test to illustrate the options:

    #[test]
    fn test_paths() {
        assert_eq!("space space", object_store::path::Path::from("space space").to_string());
        assert_eq!("tilda%7Etilda", object_store::path::Path::from("tilda~tilda").to_string());

        assert!(object_store::path::Path::from_url_path("space space").is_ok());
        assert_eq!("space space", object_store::path::Path::from_url_path("space space").unwrap().to_string());
        assert!(object_store::path::Path::from_url_path("tilda~tilda").is_err());
    }

@mrjoe7
Copy link
Contributor Author

mrjoe7 commented Apr 12, 2023

I was playing around with object_store crate and found that their support for special characters in S3 is broken.

First I uploaded a file using aws cli:

awslocal s3 cp localfile.txt s3://test-bucket/~test/uploaded-by-awscli.txt

Then I craeted a simple test to validate object_store functionality:

    #[tokio::test]
    async fn test_spaces_in_s3() {
        let options = HashMap::from([
            (
                "aws_access_key_id".to_string(),
                "TESTACCESSKEY12345".to_string(),
            ),
            (
                "aws_secret_access_key".to_string(),
                "ABCSECRETKEY".to_string(),
            ),
            ("aws_region".to_string(), "us-east-1".to_string()),
            (
                "aws_endpoint_url".to_string(),
                "http://localhost:4566".to_string(),
            ),
        ]);
        let amazon_s3 = AmazonS3Builder::from_env()
            .with_url("s3://test-bucket")
            .try_with_options(&options).unwrap()
            .with_allow_http(true)
            .build().unwrap();
        amazon_s3.put(&object_store::path::Path::from("~test/object_store_1.xml"), Bytes::from("0123456789")).await.unwrap();

        let store = PrefixStore::new(amazon_s3, "~test");
        store.put(&object_store::path::Path::from("object_store_2.xml"), Bytes::from("0123456789")).await.unwrap();
    }

Then I checked what files were created in S3:

> awslocal s3 ls s3://test-bucket/ --recursive
2023-04-12 21:48:56         10 %7Etest/object_store_1.xml
2023-04-12 21:48:56         10 %7Etest/object_store_2.xml
2023-04-12 21:47:20       4852 ~test/uploaded-by-awscli.txt
> awslocal s3 ls s3://test-bucket/~test --recursive
2023-04-12 21:47:20       4852 ~test/uploaded-by-awscli.txt
> awslocal s3 ls s3://test-bucket/%7Etest --recursive
2023-04-12 21:48:56         10 %7Etest/object_store_1.xml
2023-04-12 21:48:56         10 %7Etest/object_store_2.xml

You can see the ~ is incorrectly encoded as %7E.

@roeap
Copy link
Collaborator

roeap commented Apr 12, 2023

Hmm ... I am not sure if this is not the intended behaviour. To my understanding the aim of the object_store crate is to provide a uniform interface to all backends, so a path should always be valid in all backend implementations. The documented behaviour is that paths are always encoded according to RFC 1738.

There definitely is something off on our end as well, as shown with the double encoding of spaces fixed in this PR. Where I am less certain is how this should translate to the paths created in e.g. S3. Looking at the AWS docs, they do state that in principle UTF-8 encoded characters are allowed, but also specify a set of recommended characters to be safe in all cases. There ~ is not included, but rather explicitly listed as a character to avoid. So here object_store takes the conservative, yet generally save approach.

To complicate things a bit further, I guess this also has implications on how partitioned tables are handled, as it seems other writers might handle encoding paths differently, and the delta protocol just states "No translation required" when it comes to serializing string partition values.

To move forward, maybe its best to focus on the double encoding of spaces in this PR and open some tickets for follow ups?

@wjones127
Copy link
Collaborator

the delta protocol just states "No translation required" when it comes to serializing string partition values.

Well remember, partition values in Delta Lake are meant to be stored in the log JSON, which can be UTF-8 encoded. It doesn't dictate the format of the partition directory paths; they are just encoded live HIVE tables by convention.

@wjones127
Copy link
Collaborator

I think we need to have robust tests for these paths. I've started on that in #1278

@wjones127
Copy link
Collaborator

wjones127 commented Apr 13, 2023

@mrjoe7 I think the encoding of S3 keys is intentional. S3 recommends always percent encoding keys, otherwise you risk breaking many integrations.

So I think the correct handling looks like it should be split between local filesystems and object stores:

  • Load table at "file://my table" -> looks for "/my table/_delta_log",
  • Load table at "file://my-table/with:$?/" -> looks for "/my-table/with:$?/_delta_log",
  • Load table at "s3://my table" -> looks for "s3://my%20table/_delta_log"
  • Load table at "s3://my-table/with:$?/" -> looks for "s3://my-table/with%3A%24%3F/_delta_log"

Does this seem acceptable to you? I think this is consistent with how Spark works as well.

@roeap
Copy link
Collaborator

roeap commented Apr 13, 2023

Not sure if we need to separately check this, but the first segment, when referring to S3 buckets has stricter rules, essentially just lowercase alphanumeric characters, dots and hyphens are permitted in bucket names. I guess this underlines @wjones127 comment, that local and remote paths should be treated separately.

@mrjoe7
Copy link
Contributor Author

mrjoe7 commented Apr 13, 2023

@WJones

* Load table at `"file://my-table/with:$?/"` -> looks for `"/my-table/with:$?/_delta_log"`,

Not sure if that is even possible since object_store::path::Path will always either try to encode ? as %3F or will fail with BadSegment error depending which api function you use.

@mrjoe7
Copy link
Contributor Author

mrjoe7 commented Apr 13, 2023

My suggestion is to urldecode just the %20 into the space character. That should work because object_store::path::Path won't encode it back to %20.

@wjones127
Copy link
Collaborator

wjones127 commented Apr 13, 2023

Yea, for local stores I think we need to not encode the table Uri so we can pass it to LocalFilesystem::new_from_prefix() (docs), which avoids encoding the prefix entirely. Then we can support arbitrary paths on local file systems.

@mrjoe7
Copy link
Contributor Author

mrjoe7 commented Apr 13, 2023

If we go for LocalFileSystem::new_with_prefix then code in this repo will have to ensure the target directory exists. If you are Ok with that I can update my PR.

In case the target directory is not created before new_with_prefix() is called it will fail:

panicked at 'called `Result::unwrap()` on an `Err` value: Generic { store: "LocalFileSystem", source: UnableToCanonicalize { path: "file:///tmp/delta/~test", source: Os { code: 2, kind: NotFound, message: "No such file or directory" } } }

@roeap
Copy link
Collaborator

roeap commented Apr 14, 2023

If we go for LocalFileSystem::new_with_prefix then code in this repo will have to ensure the target directory exists.

We have some code for this to also support relative file paths. It likely needs adjustment, but may serve as a starting point.

pub(crate) fn ensure_table_uri(table_uri: impl AsRef<str>) -> DeltaResult<Url> {
let table_uri = table_uri.as_ref();
if let Ok(path) = std::fs::canonicalize(table_uri) {
return Url::from_directory_path(path)
.map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()));
}
if let Ok(url) = Url::parse(table_uri) {
return Ok(match url.scheme() {
"file" => url,
_ => {
let mut new_url = url.clone();
new_url.set_path(url.path().trim_end_matches('/'));
new_url
}
});
}
// The table uri still might be a relative paths that does not exist.
std::fs::create_dir_all(table_uri)
.map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()))?;
let path = std::fs::canonicalize(table_uri)
.map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()))?;
Url::from_directory_path(path)
.map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()))
}

When storage was built from URL "file:///data/dir with spaces" all files were created in a directory named "/data/dir%2520with%2520spaces".
Name of directory was not only incorrect, but also was causing errors on some filesystems.
@mrjoe7 mrjoe7 requested a review from wjones127 April 17, 2023 22:02
@wjones127
Copy link
Collaborator

Sorry it's been a bit before I could take a look at this. This is getting close, but it needs tests.

I've created a PR to merge some into this branch; but they don't all pass yet. I'm still looking into why.

https://github.com/mrjoe7/delta-rs/pull/10

wjones127 added a commit that referenced this pull request Apr 30, 2023
# Description

Continues work in #1274. Adds tests, and handles characters besides
spaces.

# Related Issue(s)

- closes #1189


# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Tomas Sedlak <tomas.sedlak@masterminds.sk>
@wjones127
Copy link
Collaborator

Closing in favor of #1311, which included the commits from this PR. Thanks @mrjoe7

@wjones127 wjones127 closed this Apr 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants