Add more s3 fs schemes (s3a://) #269

fhoering · 2019-11-26T17:04:46Z

s3a:// is the official schema used by Hadoop (s3 and s3n are deprecated in hadoop-aws jar)
https://cwiki.apache.org/confluence/display/HADOOP2/AmazonS3

Motivation: Be able to keep a consistent fs scheme across different PySpark, s3fs/PyArrow calls

s3a:// is the official schema used by Hadoop https://cwiki.apache.org/confluence/display/HADOOP2/AmazonS3 Be able to keep a consistent fs scheme across different PySpark, s3fs/PyArrow calls

martindurant · 2019-11-26T17:07:01Z

I am not opposed to this idea, but it should be pointed out that the protocols have specific meanings for the java runtime, so I'm not sure that they should be treated as strictly identical

fhoering · 2019-11-26T17:19:06Z

To be honest I'm not a big s3 expert but there seems to be some inconsistency somewhere.

Seems like Amazon always uses s3:// now
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

And Hadoop always uses s3a://. Not sure why

Are those different meanings still true in 2019 ?
I have the impression that those meanings are mostly obsolete now. Both (hdfs & s3) seem to behave the same way - block storage as a backend and access via a filesystem api. And S3 consistency issue is being fixed too (https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3guard.md)

Obviously I could convert those paths on my side all the time but it would be nice to be able to use Spark and Pyarrow with hdfs/s3fs filesystem in a consistent way (which could also mean s3:// all the time but implies fixing hadoop s3 connectors)

martindurant · 2019-11-26T21:46:18Z

s3fs/core.py

@@ -31,6 +31,14 @@

 _VALID_FILE_MODES = {'r', 'w', 'a', 'rb', 'wb', 'ab'}

+S3_SCHEMES = ['s3://', 's3a://', 's3n://']


You can instead set the S3FileSystem protocol attribute and use the ._strip_protocol() class method. This nicely removes duplicate code which you have found in this PR.

(@TomAugspurger , did the change to a .protocols property go in?)

OK.
I Get the idea. Change .protocol to a list. Indeed it would be better. It would also permit to support viewfs for example which is currently missing.
Currently this is not merged in https://github.com/intake/filesystem_spec and there is no PR.
Are you currently working on that ?

One question about the code factorization.

What about split_path function ? The best would be to move this inside S3FileSystem and also use ._strip_protocol but this would break backward compatibility.

I thought @TomAugspurger has written this code somewhere already; I am not working on it.

split_path is a little different, since it cares about bucket / key, but principled no opinion on whether it is a function or method of S3FileSystem. It would make sense for it to make use of ._strip_protocol, though.

Thanks for the feedback.
Turns out setting a list is actually working on .protocol (I was looking for the .protocols property before). I pushed the review again. Also removed s3n as I'm not using this and it is deprecated.

martindurant · 2019-11-28T15:29:47Z

lgtm

fhoering · 2019-11-28T16:02:19Z

Thanks

Add more s3 fs schemes (s3a://, s3n://)

becabe2

s3a:// is the official schema used by Hadoop https://cwiki.apache.org/confluence/display/HADOOP2/AmazonS3 Be able to keep a consistent fs scheme across different PySpark, s3fs/PyArrow calls

martindurant reviewed Nov 26, 2019

View reviewed changes

fhoering added 2 commits November 28, 2019 10:23

Use self._strip_protocol

848f31e

Remove S3_SCHEMES

abcbcbb

martindurant merged commit 87e5149 into fsspec:master Nov 28, 2019

fhoering changed the title ~~Add more s3 fs schemes (s3a://, s3n://)~~ Add more s3 fs schemes (s3a://) Nov 28, 2019

ianthomas23 mentioned this pull request Feb 2, 2023

Use read the docs v2 config #697

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more s3 fs schemes (s3a://) #269

Add more s3 fs schemes (s3a://) #269

fhoering commented Nov 26, 2019 •

edited

Loading

martindurant commented Nov 26, 2019

fhoering commented Nov 26, 2019 •

edited

Loading

martindurant Nov 26, 2019

fhoering Nov 27, 2019 •

edited

Loading

fhoering Nov 27, 2019

martindurant Nov 27, 2019

fhoering Nov 28, 2019

martindurant commented Nov 28, 2019

fhoering commented Nov 28, 2019

		@@ -31,6 +31,14 @@

		_VALID_FILE_MODES = {'r', 'w', 'a', 'rb', 'wb', 'ab'}

		S3_SCHEMES = ['s3://', 's3a://', 's3n://']

Add more s3 fs schemes (s3a://) #269

Add more s3 fs schemes (s3a://) #269

Conversation

fhoering commented Nov 26, 2019 • edited Loading

martindurant commented Nov 26, 2019

fhoering commented Nov 26, 2019 • edited Loading

martindurant Nov 26, 2019

Choose a reason for hiding this comment

fhoering Nov 27, 2019 • edited Loading

Choose a reason for hiding this comment

fhoering Nov 27, 2019

Choose a reason for hiding this comment

martindurant Nov 27, 2019

Choose a reason for hiding this comment

fhoering Nov 28, 2019

Choose a reason for hiding this comment

martindurant commented Nov 28, 2019

fhoering commented Nov 28, 2019

fhoering commented Nov 26, 2019 •

edited

Loading

fhoering commented Nov 26, 2019 •

edited

Loading

fhoering Nov 27, 2019 •

edited

Loading