Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: more consistent handling of partition values and file paths #1661

Merged
merged 15 commits into from
Sep 25, 2023

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Sep 24, 2023

Description

This PR does some more cleanup in how we handle encoding of file paths and partition values.

specifically, the following should be fixed

  • we were sometimes writing url-encoded strings as partition values in add actions
  • we were writing un-encoded paths to add actions

Building on the updates in #1613, this PR moves the encoding/decoding for the path fields in add / remove / cdc actions directly to serde,which covers all current code paths where create actions.

Last, we never explictly encoded the values when creating a partition path, but let object_store::Path handle that for us, which looks a bit different to what spark and pyarrow are doing - not saying they are doing the same thing :).#

While debugging #1591, I think I came across some situations where I don not see a clear path how we can make that fully work. I.e. object_store is more restrictive in the caracters it allows in a path then sparks implementation is. While file paths are somewhat arbitrary in delta, we cannot process paths, that contain characters object, store deems illegal. As a consequence - I think - there are certain paths that spark will write, that object store will not process, Some example characters include: }|<>~.

Similarly pyarrows partitioning can produce paths the we cannot process - i.e. if a partition value contains some of the chars above. If we create the same table using the rust writer, we can load it again, but for some reason spark cannot.

As a follow-up we should investigate a bit deeper. While we may not be able to cover all cases, I believe there is room for improvements by finetuning what chararcters we encode and where. When we do this we can also cover all of #1215 - at tlest I think we can.

Related Issue(s)

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate rust labels Sep 24, 2023
@roeap roeap marked this pull request as ready for review September 24, 2023 14:37
@roeap roeap added this to the Rust v0.16 milestone Sep 24, 2023
@ion-elgreco
Copy link
Collaborator

@roeap could https://github.com/apache/incubator-opendal perhaps be an alternative to object_store with regards to that url parsing?

@roeap
Copy link
Collaborator Author

roeap commented Sep 24, 2023

could https://github.com/apache/incubator-opendal perhaps be an alternative to object_store with regards to that url parsing?

Maybe :) - generally speaking though are we very happy with obejct_store and there is a high likelyhood that similar things would arise with opendal, as there are even more systems that are supposed to behave the same behind a sahred API ... so getting it fully integrated would likely be a quite massive amount of work.

@rtyler rtyler merged commit 56e1e87 into delta-io:main Sep 25, 2023
21 checks passed
@roeap roeap deleted the partition-values branch September 25, 2023 18:14
@natinimni
Copy link

Hi @roeap, it seems delta-rs does not encode file paths, at least not completely - when trying to add a file with path: /str_col=hello world/file.parquet, the space character in the path is not encoded (nor the '=' character), and unfortunately when the path is provided "pre encoded", i.e. /str_col=hello%20world/file.parquet the result is double encoding, as the '%' character is getting encoded to "%25", resulting with: /str_col=hello%2520world/file.parquet.
I am writing this in the context of this PR, because it seems like spaces and some other characters are omitted from the list of "INVALID" characters, and I wonder if that was somehow on purpose?
AFAIK, the delta protocol dictates that add/remove actions' 'path' field should be kept URI-encoded:

A relative path to a data file from the root of the table or an absolute path to a file that should be added to the table. The path is a URI as specified by RFC 2396 URI Generic Syntax, which needs to be decoded to get the data file path.

@roeap
Copy link
Collaborator Author

roeap commented Jan 29, 2024

Thanks for the report @natinimni. The handling of path encoding has recently been at least been consolidated a bit, but to improve testing I opened this issue.

One of the issues is, that the object_store crate which actually does the encoding took a different approach focussing on an encoding that would produce valid paths for all object store variants. In the next release with the next version of object store included, the encoding was moved a bit closed to at least what spark does...

That said, there is one thought I had that I'd appreciate your opinion on. Since delta actually does not use the partition values encoded in the file paths for anything, I planned on making the partitioning scheme configurable and defaulting to not including any partitions values in the file paths - thereby avoiding the issue all together...

@natinimni
Copy link

Thank you @roeap for your prompt reply and for opening a new issue on this.
Indeed I see that the "serde_path" code was moved to under "/core/src/kernel/models/actions.rs".

Would you consider approving a PR that adds a few key characters to the "INVALID" list (the characters that undergo encoding)?
I understand that it won't be a complete solution, mainly because "string" partition can assume any value, which may certainly contain specials characters that are ought to be encoded, but I think that even just adding the : and (whitespace) characters to that list would go a long way. WDYT?

As for you suggestion of not using partition values in paths - it absolutely makes sense. We actually use delta-rs only to read/write the delta log, while using other proprietary tools to write the data files and manage the table, so we won't directly benefit from such a change. I did however considered doing exactly that in our system, and decouple partition values from the path.
In any case, this would still be just a workaround as we do want users to be able to configure paths, and it sounds you intend to to the same in delta-rs.

@roeap
Copy link
Collaborator Author

roeap commented Jan 30, 2024

Hey @natinimni,

certainly would. Unfortunately also realized I may have made a mistake in a recent refactor w.r.t. path encoding. Essentially we also need to take care of these lines

pub fn object_store_path(&self) -> Path {
let path = self.path();
// Try to preserve percent encoding if possible
match Path::parse(path.as_ref()) {
Ok(path) => path,
Err(_) => Path::from(path.as_ref()),
}
}

If you are up for it two things would be great.

  1. running encoding through the same function for the Add action and in the location above.
  2. having a roundtrip tests with spark for these cases, since we had situations where we might end up not being able to read each others tables.

Since you mention : explicitly, I assume this may also be windows related where this is a legal (or illegal I don't remember) character in a path?

I am happy to help out if this is beyond what you were planning.

@natinimni
Copy link

Thanks @roeap - I can sure try to include these in a PR, but I would need some more context :)
Regarding # 1 - are you suggesting decorating path property of LogicalFile<'a> with serde_path.encode #[serde(with = "serde_path")] ?
Regarding # 2 - do we currently have similar spark-roundtrip tests I can base these new tests of?

My motivation of encoding : characters is actually for HTTP URI paths :)
By the way, do have an estimate on when the next delta-rs version would be available? the last one was published months ago..