-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: ReadAllFiles does not fully read gzipped files from GCS #31040
Comments
Thanks for reporting. Agree this is a P1 bug as it causes data loss. |
Is it possible to provide a working example that reproduce the issue, which could help triage. |
@shunping FYI |
@Abacn I don't have a working example however the steps to reproduce are:
EDIT: This issue will probably appear for any compression type. I just encountered it with gzip but did not test with other compression algorithms. |
I uploaded one test file here: # standard libraries
import logging
# third party libraries
import apache_beam as beam
from apache_beam import Create, Map
from apache_beam.io.textio import ReadAllFromText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms.combiners import Count
logger = logging.getLogger()
logger.setLevel(logging.INFO)
elements = [
"gs://apache-beam-samples/gcs/bigfile.txt.gz",
]
options = PipelineOptions()
with beam.Pipeline(options=options) as p:
(
p
| Create(elements)
| "Read File from GCS" >> ReadAllFromText()
| Count.Globally()
| "Log" >> Map(lambda x: logging.info("Total lines %d", x))
) This shows:
|
So I double checked and there are differences between your example and our case.
Furthermore, after removing encoding type from our file and using |
For quick patch we use following solution:
|
This is expected, as I mentioned earlier |
I see. We need to check decompressive transcoding for the GCS file to determine whether the content is compressed rather than relying on the file extension.
This only loads 75,601 lines. #19413 could be related for uploading the file to GCS. |
.take-issue |
Have we reproduced this? |
Yes, see my above link: #31040 (comment) |
Is there a hope of a fix for 2.57.0 cherry pick? I would guess this is a longstanding issue so getting it fixed in a very thorough way for 2.58.0 is actually the best thing to do. I recall we had decompressive transcoding bugs in the past. So we should make sure we really get it right this time. And the user can mitigate by configuring GCS to not do the transcoding. |
Moved this to 2.58.0. Thanks! |
Has any progress been made on this? |
Not yet. We can move this to 2.59.0. |
Has any progress been made on this? |
Moved to 2.60.0 |
Based on this getting pushed from release to release, it is clearly not a true release-blocker. |
What happened?
Since the refactor of gcsio (2.52?) ReadAllFiles does not fully read gzipped files from GCS. Part of the file will be correctly returned but rest will go missing.
I presume this is caused by the fact that GCS performs decompressive transcoding while
_ExpandIntoRanges
uses the GCS objects metadata to determine the read range. This means that the file size we receive is larger than the maximum of the read range.For example, a gzip on GCS might have a file size of 1 MB and this will be the object size in the metadata. Thus the maximum of the read range will be 1 MB. However, when beam opens the file it's already decompressed by GCS so the file size will be 1.5 MB and we won't read 0.5 MB out of it thus causing data loss.
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
The text was updated successfully, but these errors were encountered: