-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header #18390
Comments
Is there an update on this? It looks like it has been an issue for years, and while there is a workaround, it's not very satisfying and we don't want to set the content-encoding to the wrong value on GCS. |
Bringing over some context from https://cloud.google.com/storage/docs/transcoding it seems like there are the following consistent situations:
I believe 2 is the only one that works today. I am not sure if 1 is possible. I do think that 3 should be able to work, but needs some implementation. |
Guys this is a major issue. |
This is still an issue with 2.43.0. Does anyone have a workaround that does not require changing metadata in GCS, and isn't "use the Java SDK"? |
The way to fix this is to just use the python GCS library and not use the GCS client in beam, this is assuming you can and it’s not some internal usage by beam. Also, unlike the beam implementation of the official GCS client is thread safe, looks like it’s been moved off httplib2. |
Thanks for the updates. Seems like the thing that would make this "just work", at some cost on the Dataflow side but saving bandwidth, would be option 3. This should be a fairly easy thing for someone to do as a first issue without knowing Beam too much. |
in this case the only thing immediately known about the object is that it is gzip-compressed, with no information regarding the underlying object type. Moreover, the object is not eligible for decompressive transcoding. beam's
Ways to compress the file
For more details around which combination works please see the table below : |
Hi @kennknowles @sqlboy , The option that works correctly so far is as below
The only caveat here is user will not be able to have benefit of transcoding as when the user attempts to download from the bucket, he will get a .gz file. While we explore this caveat with the client, we wanted to check if Option 1 mentioned in the comment (#18390 (comment)) can be fixed. As this option will give best of both worlds, dataflow will be able to read a compressed file and user can take benefit of transcoding. Please let me know if any alternate suggestion. |
.take-issue |
@BjornPrime is working on fixing #25676, which might fix this issue as well. |
In encountering this while migrating the GCS client, I do not believe the migration will resolve this issue on its own. It seems to be related to how GCSFileSystem handles compressed files. |
I haven't thought about this in a while, but is there a problem with always passing |
I am encountering similar issue when uploading my SQL files from Github via CI. not sure if this issue is still fixed. I tried having paramter: headers: |- |
same as #31040 |
We have gzipped text files in Google Cloud Storage that have the following metadata headers set:
Trying to read these with apache_beam.io.ReadFromText yields the following error:
After removing the Content-Encoding header the read works fine.
Imported from Jira BEAM-1874. Original Jira may contain additional context.
Reported by: smphhh.
The text was updated successfully, but these errors were encountered: