Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more s3 fs schemes (s3a://) #269

Merged
merged 3 commits into from
Nov 28, 2019

Conversation

fhoering
Copy link
Contributor

@fhoering fhoering commented Nov 26, 2019

s3a:// is the official schema used by Hadoop (s3 and s3n are deprecated in hadoop-aws jar)
https://cwiki.apache.org/confluence/display/HADOOP2/AmazonS3

Motivation: Be able to keep a consistent fs scheme across different PySpark, s3fs/PyArrow calls

s3a:// is the official schema used by Hadoop
https://cwiki.apache.org/confluence/display/HADOOP2/AmazonS3

Be able to keep a consistent fs scheme across different PySpark, s3fs/PyArrow calls
@martindurant
Copy link
Member

I am not opposed to this idea, but it should be pointed out that the protocols have specific meanings for the java runtime, so I'm not sure that they should be treated as strictly identical

@fhoering
Copy link
Contributor Author

fhoering commented Nov 26, 2019

To be honest I'm not a big s3 expert but there seems to be some inconsistency somewhere.

Seems like Amazon always uses s3:// now
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

And Hadoop always uses s3a://. Not sure why

Are those different meanings still true in 2019 ?
I have the impression that those meanings are mostly obsolete now. Both (hdfs & s3) seem to behave the same way - block storage as a backend and access via a filesystem api. And S3 consistency issue is being fixed too (https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3guard.md)

Obviously I could convert those paths on my side all the time but it would be nice to be able to use Spark and Pyarrow with hdfs/s3fs filesystem in a consistent way (which could also mean s3:// all the time but implies fixing hadoop s3 connectors)

s3fs/core.py Outdated
@@ -31,6 +31,14 @@

_VALID_FILE_MODES = {'r', 'w', 'a', 'rb', 'wb', 'ab'}

S3_SCHEMES = ['s3://', 's3a://', 's3n://']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can instead set the S3FileSystem protocol attribute and use the ._strip_protocol() class method. This nicely removes duplicate code which you have found in this PR.

(@TomAugspurger , did the change to a .protocols property go in?)

Copy link
Contributor Author

@fhoering fhoering Nov 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
I Get the idea. Change .protocol to a list. Indeed it would be better. It would also permit to support viewfs for example which is currently missing.
Currently this is not merged in https://github.com/intake/filesystem_spec and there is no PR.
Are you currently working on that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about the code factorization.

What about split_path function ? The best would be to move this inside S3FileSystem and also use ._strip_protocol but this would break backward compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @TomAugspurger has written this code somewhere already; I am not working on it.

split_path is a little different, since it cares about bucket / key, but principled no opinion on whether it is a function or method of S3FileSystem. It would make sense for it to make use of ._strip_protocol, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.
Turns out setting a list is actually working on .protocol (I was looking for the .protocols property before). I pushed the review again. Also removed s3n as I'm not using this and it is deprecated.

@martindurant
Copy link
Member

lgtm

@martindurant martindurant merged commit 87e5149 into fsspec:master Nov 28, 2019
@fhoering fhoering changed the title Add more s3 fs schemes (s3a://, s3n://) Add more s3 fs schemes (s3a://) Nov 28, 2019
@fhoering
Copy link
Contributor Author

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants