Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header #18390

kennknowles · 2022-06-03T18:29:27Z

We have gzipped text files in Google Cloud Storage that have the following metadata headers set:


Content-Encoding: gzip
Content-Type: application/octet-stream

Trying to read these with apache_beam.io.ReadFromText yields the following error:


ERROR:root:Exception while fetching 341565 bytes from position 0 of gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz:
Cannot have start index greater than total size
Traceback (most recent call last):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 585, in _fetch_to_queue
    value = func(*args)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 610, in _get_segment
    downloader.GetRange(start, end)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
line 477, in GetRange
    progress, end_byte = self.__NormalizeStartEnd(start, end)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
line 340, in __NormalizeStartEnd
    'Cannot have start index greater than total size')
TransferInvalidError:
Cannot have start index greater than total size

WARNING:root:Task failed: Traceback (most recent
call last):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
line 300, in __call__
    result = evaluator.finish_bundle()
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
line 206, in finish_bundle
    bundles = _read_values_to_bundles(reader)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
line 196, in _read_values_to_bundles
    read_result = [GlobalWindows.windowed_value(e) for e in reader]

 File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
line 79, in read
    range_tracker.sub_range_tracker(source_ix)):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 155, in read_records
    read_buffer)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 245, in _read_record
    sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)

 File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 190, in _find_separator_bounds
    file_to_read, read_buffer, current_pos + 1):
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
line 212, in _try_to_ensure_num_bytes_in_buffer
    read_data = file_to_read.read(self._buffer_size)

 File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
line 460, in read
    self._fetch_to_internal_buffer(num_bytes)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
line 420, in _fetch_to_internal_buffer
    buf = self._file.read(self._read_size)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 472, in read
    return self._read_inner(size=size, readline=False)
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 516, in _read_inner
    self._fetch_next_if_buffer_exhausted()
  File "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
line 577, in _fetch_next_if_buffer_exhausted
    raise exn
TransferInvalidError: Cannot have start
index greater than total size

After removing the Content-Encoding header the read works fine.

Imported from Jira BEAM-1874. Original Jira may contain additional context.
Reported by: smphhh.

The text was updated successfully, but these errors were encountered:

linamartensson · 2022-07-07T17:34:12Z

Is there an update on this? It looks like it has been an issue for years, and while there is a workaround, it's not very satisfying and we don't want to set the content-encoding to the wrong value on GCS.

kennknowles · 2022-07-08T19:50:59Z

Bringing over some context from https://cloud.google.com/storage/docs/transcoding it seems like there are the following consistent situations:

GCS transcodes and Beam works with this transparently.
- Content-encoding: gzip
- Content-type: X
- Beam's IO reads it expecting contents to be X. I believe the problem is that GCS serves metadata that results in wrong splits.
GCS does not transcode because the metadata is set to not transcode (current recommendation)
- Content-encoding: <empty>
- Content-typ: gzip
- Beam's IO reads and the user specifies gzip or it is autodetected by the IO
GCS does not transcode because the Beam IO requests no transcoding
- Content-encoding: gzip
- Content-type: X
- Beam's IO passes the header Accept-Encoding: gzip

I believe 2 is the only one that works today. I am not sure if 1 is possible. I do think that 3 should be able to work, but needs some implementation.

sqlboy · 2022-11-26T19:48:31Z

Guys this is a major issue.

daniels-cysiv · 2023-01-10T18:52:46Z

This is still an issue with 2.43.0. Does anyone have a workaround that does not require changing metadata in GCS, and isn't "use the Java SDK"?

sqlboy · 2023-01-10T23:01:20Z

The way to fix this is to just use the python GCS library and not use the GCS client in beam, this is assuming you can and it’s not some internal usage by beam. Also, unlike the beam implementation of the official GCS client is thread safe, looks like it’s been moved off httplib2.

kennknowles · 2023-01-11T19:15:12Z

Thanks for the updates. Seems like the thing that would make this "just work", at some cost on the Dataflow side but saving bandwidth, would be option 3. This should be a fairly easy thing for someone to do as a first issue without knowing Beam too much.

chavdaparas · 2023-02-08T14:59:11Z

you can upload the object to GCS with the Content-Type set to indicate compression and NO Content-Encoding at all, according to best practices.

Content-encoding: application/gzip
Content-type:

in this case the only thing immediately known about the object is that it is gzip-compressed, with no information regarding the underlying object type. Moreover, the object is not eligible for decompressive transcoding.
reference : https://cloud.google.com/storage/docs/transcoding

beam's ReadFromText with compression_type=CompressionTypes.GZIP works fine with above option

p | "Read GCS File" >> beam.io.ReadFromText(file_pattern=file_path,compression_type=CompressionTypes.GZIP, skip_header_lines=int(skip_header))

Ways to compress the file

Implicitly by specifying gsutil cp -Z <filename> <bucket>
Explicitly by compressing the file first like gzip <filename> and load it to GCS

For more details around which combination works please see the table below :

Murli16 · 2023-02-10T03:24:13Z

Hi @kennknowles @sqlboy ,

The option that works correctly so far is as below

Do a explicit compression of the file - gzip
Upload the file to GCS with correct content type - application/gzip

gsutil -h "Content-Type:application/gzip" cp sample.csv.gz gs://gcp-sandbox-1-359004/scn4/

Content encoding will not be set

gcloud storage objects describe gs://gcp-sandbox-1-359004/scn4/sample.csv.gz

bucket: gcp-sandbox-1-359004
contentType: application/gzip
crc32c: v1lNUQ==
etag: CLnDx+CIif0CEAE=
generation: '1675967308358073'

The only caveat here is user will not be able to have benefit of transcoding as when the user attempts to download from the bucket, he will get a .gz file.

While we explore this caveat with the client, we wanted to check if Option 1 mentioned in the comment (#18390 (comment)) can be fixed.

As this option will give best of both worlds, dataflow will be able to read a compressed file and user can take benefit of transcoding.

Please let me know if any alternate suggestion.

BjornPrime · 2023-04-11T20:30:35Z

.take-issue

liferoad · 2023-04-11T20:33:49Z

@BjornPrime is working on fixing #25676, which might fix this issue as well.

BjornPrime · 2023-09-06T18:40:06Z

In encountering this while migrating the GCS client, I do not believe the migration will resolve this issue on its own. It seems to be related to how GCSFileSystem handles compressed files.

kennknowles · 2023-09-08T13:06:13Z

I haven't thought about this in a while, but is there a problem with always passing Accept-encoding: gzip ?

chaitanya1293 · 2024-03-27T20:46:48Z

I am encountering similar issue when uploading my SQL files from Github via CI. not sure if this issue is still fixed. I tried having paramter: headers: |-
content-type: application/octet-stream
but it did't make any change in the error.

liferoad · 2024-05-18T10:30:43Z

same as #31040

kennknowles added bug P3 sdk-py-core labels Jun 3, 2022

damccorm added core py python and removed sdk-py-core labels Jun 16, 2022

kennknowles added good first issue P2 google-cloud-platform-core files and removed P3 core labels Jan 11, 2023

github-actions bot assigned BjornPrime Apr 11, 2023

liferoad unassigned BjornPrime May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header #18390

Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header #18390

kennknowles commented Jun 3, 2022

linamartensson commented Jul 7, 2022

kennknowles commented Jul 8, 2022

sqlboy commented Nov 26, 2022

daniels-cysiv commented Jan 10, 2023

sqlboy commented Jan 10, 2023 •

edited

Loading

kennknowles commented Jan 11, 2023

chavdaparas commented Feb 8, 2023

Murli16 commented Feb 10, 2023

BjornPrime commented Apr 11, 2023

liferoad commented Apr 11, 2023

BjornPrime commented Sep 6, 2023

kennknowles commented Sep 8, 2023

chaitanya1293 commented Mar 27, 2024

liferoad commented May 18, 2024

Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header #18390

Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header #18390

Comments

kennknowles commented Jun 3, 2022

linamartensson commented Jul 7, 2022

kennknowles commented Jul 8, 2022

sqlboy commented Nov 26, 2022

daniels-cysiv commented Jan 10, 2023

sqlboy commented Jan 10, 2023 • edited Loading

kennknowles commented Jan 11, 2023

chavdaparas commented Feb 8, 2023

Murli16 commented Feb 10, 2023

BjornPrime commented Apr 11, 2023

liferoad commented Apr 11, 2023

BjornPrime commented Sep 6, 2023

kennknowles commented Sep 8, 2023

chaitanya1293 commented Mar 27, 2024

liferoad commented May 18, 2024

sqlboy commented Jan 10, 2023 •

edited

Loading