Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty files with trailing slash are sometimes treated as directories and sometimes treated as regular files #439

Open
isidentical opened this issue Mar 15, 2021 · 2 comments

Comments

@isidentical
Copy link
Member

import boto3
from s3fs import S3FileSystem
from pprint  import pprint
TEST_AWS_S3_PORT = 5555
TEST_AWS_ENDPOINT_URL = f'http://127.0.0.1:{TEST_AWS_S3_PORT}/'

boto_client = boto3.client('s3', endpoint_url=TEST_AWS_ENDPOINT_URL)
fs = S3FileSystem(client_kwargs={'endpoint_url': TEST_AWS_ENDPOINT_URL})

boto_client.create_bucket(Bucket='test-bucket')
boto_client.put_object(
    Bucket='test-bucket', Key='empty-dir/', Body='',
)

pprint(fs.ls('test-bucket', detail=True))
pprint(fs.info('test-bucket/empty-dir/'))
print(fs.isdir('test-bucket/empty-dir/'))
print(fs.ls('test-bucket/empty-dir/'))

The code above first creates an empty file using that ends with a trailing slash. Then it tries to run s3fs's ls on the parent directory, which identifies that file as a directory;

[{'Key': 'test-bucket/empty-dir',
  'Size': 0,
  'StorageClass': 'DIRECTORY',
  'name': 'test-bucket/empty-dir',
  'size': 0,
  'type': 'directory'}]

Also the second and the third calls (info() and isdir()) claims it is a directory;

{'Key': 'test-bucket/empty-dir',
 'Size': 0,
 'StorageClass': 'DIRECTORY',
 'name': 'test-bucket/empty-dir',
 'size': 0,
 'type': 'directory'}
True

though when we try to do ls/walk etc it behaves like a file. The following is the result of .ls('bucket/empty-dir/');

['test-bucket/empty-dir/']

instead I would have expected it to return an empty list

@mvashishtha
Copy link
Contributor

mvashishtha commented Oct 18, 2022

@martindurant today I was bitten by a similar issue in s3fs.core.S3FileSystem.isfile. I had an s3 bucket like the (currently existing) bucket modin-datasets and it had an empty file testing/ in it, i.e. an object at s3://modin-datasets/testing/. There were also objects like modin-datasets/testing/test_data.parquet.

When I list the contents of 'modin-datasets/testing/', I see my object at 'modin-datasets/testing/':

from fsspec.core import url_to_fs

fs, path = url_to_fs("s3://modin-datasets/testing/")
# this prints a list including 'modin-datasets/testing/',  'modin-datasets/testing/test_data.parquet', ...
fs.ls('modin-datasets/testing/')

but my filesystem doesn't recognize modin-datasets/testing/ as a file!

assert not fs.isfile('modin-datasets/testing/')

The consequence was that I spent a long time trying to debug why s3fs was trying to treat my directory as a file, until I finally realized it was just trying to open a file it correctly found, but then could no longer recognize as a file! Indeed, fs.open('modin-datasets/testing/').read() gives me valid contents, b''.

Is this a bug in s3fs? Is it a separate issue? How does it relate to #562?

@martindurant
Copy link
Member

There are a few ideas in conflict with this kind of thing, where a file and directory have exactly the same name, including the trailing "/". This situation could not, of course, happen on a posix FS.

The ls method is designed to provide a list of outputs, and so the same name can appear twice, with different details. However, info only fetches one of these, and isfile/dir uses info.

  • there is an assumption that a pat must be one or the other, not both; wrong in this case
  • ls is often much slower; in this case, the directory cache should be doing it's thing to make this not a problem.
  • s3fs's info is quite complex and contains logic to fall back to HEAD when the user doesn't have list permission and may return versions - it finds files first and only checks for a directory if that fails.
  • the convention that a 0-length file with trailing "/" is used by the AWS console and accepted by the CLI tool, but not part of the official API. Should we accept this convention everywhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants