-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more s3 fs schemes (s3a://) #269
Conversation
s3a:// is the official schema used by Hadoop https://cwiki.apache.org/confluence/display/HADOOP2/AmazonS3 Be able to keep a consistent fs scheme across different PySpark, s3fs/PyArrow calls
I am not opposed to this idea, but it should be pointed out that the protocols have specific meanings for the java runtime, so I'm not sure that they should be treated as strictly identical |
To be honest I'm not a big s3 expert but there seems to be some inconsistency somewhere. Seems like Amazon always uses s3:// now And Hadoop always uses s3a://. Not sure why Are those different meanings still true in 2019 ? Obviously I could convert those paths on my side all the time but it would be nice to be able to use Spark and Pyarrow with hdfs/s3fs filesystem in a consistent way (which could also mean s3:// all the time but implies fixing hadoop s3 connectors) |
s3fs/core.py
Outdated
@@ -31,6 +31,14 @@ | |||
|
|||
_VALID_FILE_MODES = {'r', 'w', 'a', 'rb', 'wb', 'ab'} | |||
|
|||
S3_SCHEMES = ['s3://', 's3a://', 's3n://'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can instead set the S3FileSystem protocol attribute and use the ._strip_protocol()
class method. This nicely removes duplicate code which you have found in this PR.
(@TomAugspurger , did the change to a .protocols
property go in?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
I Get the idea. Change .protocol to a list. Indeed it would be better. It would also permit to support viewfs
for example which is currently missing.
Currently this is not merged in https://github.com/intake/filesystem_spec and there is no PR.
Are you currently working on that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question about the code factorization.
What about split_path
function ? The best would be to move this inside S3FileSystem and also use ._strip_protocol but this would break backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought @TomAugspurger has written this code somewhere already; I am not working on it.
split_path
is a little different, since it cares about bucket / key, but principled no opinion on whether it is a function or method of S3FileSystem. It would make sense for it to make use of ._strip_protocol
, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback.
Turns out setting a list is actually working on .protocol
(I was looking for the .protocols
property before). I pushed the review again. Also removed s3n as I'm not using this and it is deprecated.
lgtm |
Thanks |
s3a:// is the official schema used by Hadoop (s3 and s3n are deprecated in hadoop-aws jar)
https://cwiki.apache.org/confluence/display/HADOOP2/AmazonS3
Motivation: Be able to keep a consistent fs scheme across different PySpark, s3fs/PyArrow calls