Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ADLSLocation file parsing #11395

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

mrcnc
Copy link
Contributor

@mrcnc mrcnc commented Oct 25, 2024

After reviewing the concerns raised in #11344 about using java.net.URI for parsing in ADLSLocation, I contrived an example of a location that does not parse correctly. It also fails in the current implementation, so this PR adds a test and fix for the parsing code. Additionally it removes test cases that are invalid, since they don't test valid ABFS syntax

Motivation

The main reason to avoid using java.net.URI is that it parses according to RFC 2396 but object storage providers do not strictly follow this specification. Specifically, in standard URI syntax, the question mark ? separates the path component from the query component. However, Azure Blob Storage allows question marks in blob/file names, making these names incompatible with the RFC 2396 URI specification.

Another important point is that Azure Storage APIs are accessed via HTTP APIs, so the abfs and wasb location syntax serve as identifiers to blobs accessed through HTTP URLs. This is the motivation behind removing the tests that included query and fragment components, since they would only be used in the HTTP URLs and not in the ABFS URI-like syntax.

@github-actions github-actions bot added the AZURE label Oct 25, 2024
@mrcnc mrcnc marked this pull request as ready for review October 25, 2024 14:46
@RussellSpitzer
Copy link
Member

LGTM. @danielcweeks This adds in that test I was looking for where URI would fail, although looks like we have a bug in the current implementation anyway.

@danielcweeks
Copy link
Contributor

Thanks @mrcnc , though overall it's really unfortunate that we have notably different behavior between S3 and ADLS in the URI handling. S3 allows for query params (though they're not considered part of the key) them while ADLS appears to have a non-standard handling.

The one think I'm not clear about is the linked documentation doesn't actually go into what the valid path characters are. Is that documented somewhere that we can reference? It would be great to include that in the javadoc for future reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants