Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcsfs doesn't properly handle gzipped files, ignoring content-encoding #461

Open
jimmywan opened this issue Mar 25, 2022 · 4 comments · May be fixed by #635
Open

gcsfs doesn't properly handle gzipped files, ignoring content-encoding #461

jimmywan opened this issue Mar 25, 2022 · 4 comments · May be fixed by #635

Comments

@jimmywan
Copy link

I have a file "foo.txt.gz" that has been uploaded with the following metadata:

Content-Type: text/plain
Content-Encoding: gzip

I'm trying to copy its contents to a new file in cloud storage that is uncompressed to workaround a bug where my tooling (gcloud) can't properly handle gzip input.

If I try to pass the compression flag on read, it complains about the file not being a gzip file, implying that transcoding is occurring:

with fs.open('gcs://jw-sandbox/uploads.txt.gz', 'rb', compression='gzip') as read_file:
...     with fs.open('gcs://jw-sandbox/uploads.txt', 'wb') as write_file:
...             shutil.copyfileobj(read_file, write_file)

gzip.BadGzipFile: Not a gzipped file (b'gs')

If I try to read the file without the compression flag and just dump contents to stdout, I only get the first N bytes of the decompressed contents where N is the compressed size:

with fs.open('gcs://jw-sandbox/uploads.txt.gz', 'rb', compression='gzip') as read_file:
...     for f in read_file:
...             print(f)
>>> print(gcsfs.__version__)
2022.02.0
@mhfrantz
Copy link

Is this a duplicate of #233?

@martindurant
Copy link
Member

Right, the linked issue suggests that the best thing to do is not set the content-encoding, which relates to the transfer rather than the status of the remote file. The transfer is compressed anyway, which for a gzip file will make little difference either way. I believe this isn't how GCS should hand'e this, but nothing to be done about that.

@martindurant
Copy link
Member

I wonder, what happens if you .cat() your file with start=10, end=20?

the-xentropy added a commit to the-xentropy/gcsfs that referenced this issue Aug 8, 2024
This fixes fsspec#461 and fsspec#233  without needing users to change their existing code.

The bugs were caused by assumptions in fsspec about the veracity and non-ambiguity of 'size' information returned by AbstractBufferedFile subclasses like GCSFile and AbstractFileSystem subclasses like GCSFileSystem  (e.g. `self.size = self.details["size"]` in `AbstractBufferedFile`, which is used by all base caches to truncate requests and responses). Since in GCS if compression-at-rest/compression transcoding is used there's no way to retrieve the real size of the object's *content* without decompressing the whole thing either server or client side, fixing these issues required overriding some behaviors in the underlying base classes. Care was taken to preserve behavior for storage objects not using compression at rest, however.

This commit:
1) adds a read() implementation in GCSFile which allows calls to succeed even when size isn't well-defined. It's 
2) adds a TranscodingReadAheadCache, which is mostly identical to the readahead cache that GCSFile already uses but allows end = None to read until the end of the file, while still handling cached data prefixes.
3) changes FileSystem _info() to set size = None if contentEncoding is gzip.

The fix keeps the data handling for non-gzip GCS files identical, while adding new control flow to detect when transcoding is done and adding some logic for handling those edge-cases. This did unfortunately mean implementing implementing variant methods with only minor changes to how they perform underlying operations (e.g. read() in GCSFile) which were previously just inherited from AbstractBufferedFile.

It does introduce two new semantic changes though. First off, [in line with fsspec's ArchiveFileSystem](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.archive.AbstractArchiveFileSystem.info) semantics, GCSFs will return size = None when the file can not be determined fully in advance.

The only possible performance overhead seen by non-users of compressive decoding is a single HEAD get request done before the point where we create the GCSFile object in GCSFilesystem, because we need to swap out the cache to one compatible with the lack of concrete file size but do not yet have the information to make that control flow decision.
the-xentropy added a commit to the-xentropy/gcsfs that referenced this issue Aug 8, 2024
This fixes fsspec#461 and fsspec#233  without needing users to change their existing code.

The bugs were caused by assumptions in fsspec about the veracity and non-ambiguity of 'size' information returned by AbstractBufferedFile subclasses like GCSFile and AbstractFileSystem subclasses like GCSFileSystem  (e.g. `self.size = self.details["size"]` in `AbstractBufferedFile`, which is used by all base caches to truncate requests and responses). Since in GCS if compression-at-rest/compression transcoding is used there's no way to retrieve the real size of the object's *content* without decompressing the whole thing either server or client side, fixing these issues required overriding some behaviors in the underlying base classes. Care was taken to preserve behavior for storage objects not using compression at rest, however.

This commit:
1) adds a read() implementation in GCSFile which allows calls to succeed even when size isn't well-defined. It's
2) adds a TranscodingReadAheadCache, which is mostly identical to the readahead cache that GCSFile already uses but allows end = None to read until the end of the file, while still handling cached data prefixes.
3) changes FileSystem _info() to set size = None if contentEncoding is gzip.
4) changes _cat_file() to fetch information on the object we want to
   cat, and if it uses compressive transcoding then the resulting
   GCSFile uses the GCSTranscodingReadAhead cache instead of the incompatible ReadAhead
   cache. We could probably use the new cache for everything since it
   should function equivalently for files which have a well-defined size,
   but this lowers the risk of having missed an edge case.

The fix keeps the data handling for GCS files which do not use compression at rest/compressive trnscoding identical, while adding new control flow to detect when transcoding is done and adding some logic for handling those edge-cases. This did unfortunately mean implementing implementing variant methods with only minor changes to how they perform underlying operations (e.g. read() in GCSFile) which were previously just inherited from AbstractBufferedFile.

It does introduce one new semantic to GCSFs. [In line with fsspec's ArchiveFileSystem](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.archive.AbstractArchiveFileSystem.info) semantics, GCSFs will return size = None when the file can not be determined fully in advance. This allows us to distinguish known zero size andunknown size, which was a major issue.

The only new performance overhead seen by non-users of compressive decoding is a single info() call resulting in a HEAD request done before the point where we create the GCSFile object in GCSFilesystem, because we need to swap out the cache to one compatible with the lack of concrete file size but do not yet have the information to make that control flow decision. This means we make two requests instead of one. We can probably switch to the new transcoding cache wholesale in the future when we have faith it holds up to eliminate this call though, but I made it work this way to keep the data and control flow the same for the base case where users are not using compressive transcoding. Since the compression-at-rest case was completely broken, doing it this way means that even if these changes end up disasterous (it shouldn't though) it'll only break something which is already broken.
the-xentropy added a commit to the-xentropy/gcsfs that referenced this issue Aug 8, 2024
This fixes fsspec#461 and fsspec#233  without needing users to change their existing code.

The bugs were caused by assumptions in fsspec about the veracity and non-ambiguity of 'size' information returned by AbstractBufferedFile subclasses like GCSFile and AbstractFileSystem subclasses like GCSFileSystem  (e.g. `self.size = self.details["size"]` in `AbstractBufferedFile`, which is used by all base caches to truncate requests and responses). Since in GCS if compression-at-rest/compression transcoding is used there's no way to retrieve the real size of the object's *content* without decompressing the whole thing either server or client side, fixing these issues required overriding some behaviors in the underlying base classes. Care was taken to preserve behavior for storage objects not using compression at rest, however.

This commit:
1) adds a read() implementation in GCSFile which allows calls to succeed even when size isn't well-defined. It's
2) adds a TranscodingReadAheadCache, which is mostly identical to the readahead cache that GCSFile already uses but allows end = None to read until the end of the file, while still handling cached data prefixes.
3) changes FileSystem _info() to set size = None if contentEncoding is gzip.
4) changes _cat_file() to fetch information on the object we want to
   cat, and if it uses compressive transcoding then the resulting
   GCSFile uses the GCSTranscodingReadAhead cache instead of the incompatible ReadAhead
   cache. We could probably use the new cache for everything since it
   should function equivalently for files which have a well-defined size,
   but this lowers the risk of having missed an edge case.

The fix keeps the data handling for GCS files which do not use compression at rest/compressive trnscoding identical, while adding new control flow to detect when transcoding is done and adding some logic for handling those edge-cases. This did unfortunately mean implementing implementing variant methods with only minor changes to how they perform underlying operations (e.g. read() in GCSFile) which were previously just inherited from AbstractBufferedFile.

It does introduce one new semantic to GCSFs. [In line with fsspec's ArchiveFileSystem](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.archive.AbstractArchiveFileSystem.info) semantics, GCSFs will return size = None when the file can not be determined fully in advance. This allows us to distinguish known zero size andunknown size, which was a major issue.

The only new performance overhead seen by non-users of compressive decoding is a single info() call resulting in a HEAD request done before the point where we create the GCSFile object in GCSFilesystem, because we need to swap out the cache to one compatible with the lack of concrete file size but do not yet have the information to make that control flow decision. This means we make two requests instead of one. We can probably switch to the new transcoding cache wholesale in the future when we have faith it holds up to eliminate this call though, but I made it work this way to keep the data and control flow the same for the base case where users are not using compressive transcoding. Since the compression-at-rest case was completely broken, doing it this way means that even if these changes end up disasterous (it shouldn't though) it'll only break something which is already broken.
@the-xentropy
Copy link

the-xentropy commented Aug 8, 2024

Hey friends, I created a pull request which fixes this issue and won't need any code changes once it lands, as well as the variant. It was caused by us inheriting from AbstractBufferedFile and reporting the metadata 'size' to be the object size. AbstractBufferedFile makes a lot of assumptions about the size we returning being absolutely definitely correct and that ends up being used in a few different places (like all fsspec caches) to only fetch up until that byte range at most, which meant we truncated responses.

Basically, when GCSFilesystem open the GCSFile object it grabs the metadata size with an info() call, which ends up being propagated in the GCSFile's underlying readahead cache to limit the fetch request to the reported max size of the object (even the fall-back fsspec base cache which is not supposed to do anything but pass through requests has this issue), and this max file size which is the compressed file size ultimately ends up in _cat_file to set the Range: bytes=0-<compressed size> and we only fetch the compressed size even from the transitively decompressed data.

If you want to live on the edge in the meanwhile and are OK with functional but hacky solutions, you can jerryrig GCSFile's _fetch_range to pass end=None to cat_file() when end = self.size. This is going to break things if for some reason you're trying to get N decompressed bytes from a file which happens to be N bytes compressed in size, but in practice this is rare since almost all consuming of objects happens relative to the reported object size and not hypothetical decompressed content size and since most open()'s and read()'s are grabbing the whole file this will make everything from pandas to fsspec chug along happily in consuming filesystems.

 def _fetch_range(self, start=None, end=None):
        """Get data from GCS

        start, end : None or integers
            if not both None, fetch only given range
        """
        try:
           # look away, horrible hack
            if self.size == end:
                end = None
            return self.gcsfs.cat_file(self.path, start=start, end=end)
        except RuntimeError as e:
            if "not satisfiable" in str(e):
                return b""
            raise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants