Python SDK unable to download file due to checksum mismatch #204

cloudryder · 2021-03-03T18:39:56Z

Object download failed complaining about checksum mismatch. Downloading the object through gsutils works fine.

./gcs-download-object.py
Traceback (most recent call last):
  File "./gcs-download-object.py", line 29, in <module>
    download_blob('##REDACTED##',
  File "./gcs-download-object.py", line 20, in download_blob
    blob.download_to_filename(destination_file_name)
  File "/usr/local/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 1184, in download_to_filename
    client.download_blob_to_file(
  File "/usr/local/lib/python3.8/site-packages/google/cloud/storage/client.py", line 719, in download_blob_to_file
    blob_or_uri._do_download(
  File "/usr/local/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 956, in _do_download
    response = download.consume(transport, timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/google/resumable_media/requests/download.py", line 171, in consume
    self._write_to_stream(result)
  File "/usr/local/lib/python3.8/site-packages/google/resumable_media/requests/download.py", line 120, in _write_to_stream
    raise common.DataCorruption(response, msg)
google.resumable_media.common.DataCorruption: Checksum mismatch while downloading:
  ##REDACTED##
The X-Goog-Hash header indicated an MD5 checksum of:
  lAhluFgTEwcNJDvTSap2fQ==
but the actual MD5 checksum of the downloaded contents was:
  61Kz/FQdqRvwqacGuwuFIA==

The Code itself is pretty straight forward:

#!/usr/bin/env python3.8
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    # bucket_name = "your-bucket-name"
    # source_blob_name = "storage-object-name"
    # destination_file_name = "local/path/to/file"
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    # Construct a client side representation of a blob.
    # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
    # any content from Google Cloud Storage. As we don't need additional data,
    # using `Bucket.blob` is preferred here.
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)
    print(
        "Blob {} downloaded to {}.".format(
            source_blob_name, destination_file_name
        )
    )
download_blob('##REDACTED##',
              'remedia/mezzanines/Live/2018-06-24/M31_POL-COL_ESFUHD_06_24.mov', 'M31_POL-COL_ESFUHD_06_24.mov')

The file size is 2.3TB if that matters.

Following are the plugin versions

pip3.8 list
Package                  Version
------------------------ ---------
boto3                    1.17.13
botocore                 1.20.13
cachetools               4.2.1
certifi                  2020.12.5
cffi                     1.14.5
chardet                  4.0.0
google-api-core          1.26.0
google-auth              1.27.0
google-cloud-core        1.6.0
google-cloud-storage     1.36.0
google-crc32c            1.1.2
google-resumable-media   1.2.0
googleapis-common-protos 1.52.0
idna                     2.10
jmespath                 0.10.0
packaging                20.9
pip                      19.2.3
protobuf                 3.15.1
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pycparser                2.20
pyparsing                2.4.7
python-dateutil          2.8.1
pytz                     2021.1
requests                 2.25.1
rsa                      4.7.1
s3transfer               0.3.4
setuptools               41.2.0
six                      1.15.0
urllib3                  1.26.3

I'm able to reproduce this issue for this file. I had downloaded several hundred objects with the same SDK. Not sure why its failing on this file.

The text was updated successfully, but these errors were encountered:

andrewsg · 2021-03-03T19:45:29Z

Thanks for this report. Since it's reproducible, do you mind performing an experiment to see if the other checksum type also disagrees? You can change the checksum type with checksum="crc32c" or checksum=None. If you select checksum="crc32c" at download time, does that also fail?

andrewsg · 2021-03-04T00:33:44Z

Also, could you please download the entire file with checksum=None (not with gsutil) and run md5sum on the command line, and see if you get the lAhl... checksum (server-reported) or the 61Kz... checksum (client-reported).

I'll be presumably unable to reproduce on my side until we know more about where the error lies, assuming the file in question is not public.

cloudryder · 2021-03-04T21:04:45Z

I was able to download it with CRC32. The md5sum of the file downloaded through gsutil and Python SDK with CRC32 Checksum match. I haven't tried checksum=None. Is that something you still require? And you're right, the object is not publicly available.

$ ./gcs-download-object.py
Blob remedia/mezzanines/Live/2018-06-24/M31_POL-COL_ESFUHD_06_24.mov downloaded to gcs_crc32_M31_POL-COL_ESFUHD_06_24.mov.

$ md5sum gcs_crc32_M31_POL-COL_ESFUHD_06_24.mov
940865b8581313070d243bd349aa767d  gcs_crc32_M31_POL-COL_ESFUHD_06_24.mov

$ md5sum M31_POL-COL_ESFUHD_06_24.mov
940865b8581313070d243bd349aa767d  M31_POL-COL_ESFUHD_06_24.mov

andrewsg · 2021-03-05T19:04:41Z

Thanks, that's very helpful. The md5sum hash starting with 9408... corresponds to the base64-encoded hash lAhluFgTEwcNJDvTSap2fQ==, which implies the server hash is correct and the computed hash is incorrect. That said, it is still quite a mystery, given that the crc32c checksum strategy worked without a hitch.

How many times have you reproduced this problem with checksum=md5? You have a 100% success rate with gsutil and with python-storage and checksum=crc32c, but a 0% success rate over multiple tries with python-storage and checksum=md5?

cloudryder · 2021-03-05T19:30:58Z

The python-storage download failed twice. I tried CRC32 and gsutil only once and they both succeeded. Would you like me to run any of them more times? Its a 2.3T object, so it takes a while to download each time :) On Friday, March 5, 2021, 02:04:56 PM EST, Andrew Gorcester <notifications@github.com> wrote: Thanks, that's very helpful. The md5sum hash starting with 9408... corresponds to the base64-encoded hash lAhluFgTEwcNJDvTSap2fQ==, which implies the server hash is correct and the computed hash is incorrect. That said, it is still quite a mystery, given that the crc32c checksum strategy worked without a hitch. How many times have you reproduced this problem with checksum=md5? You have a 100% success rate with gsutil and with python-storage and checksum=crc32c, but a 0% success rate over multiple tries with python-storage and checksum=md5? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

andrewsg · 2021-03-05T22:04:40Z

You received the same erroneous MD5 sum in the error message in both tries? That is quite surprising.

Based on what we know now I can try to look through our code for possible issues. I'll let you know if I need more info. Thanks!

cloudryder · 2021-03-05T22:55:16Z

You received the same erroneous MD5 sum in the error message in both tries? That is quite surprising.

Based on what we know now I can try to look through our code for possible issues. I'll let you know if I need more info. Thanks!

Yes, exact same error and the md5sum displayed in both error messages are same too. Another piece of information is, this is one of the several objects in the bucket that were transferred from S3 to GCS, using GCP Transfer Service. I was downloading about 1% of Random objects from both S3 and GCS and comparing their checksums(md5 and sha256) using python to make sure that the data was still intact after the transfer, which when I ran across this issue. This object is one of the objects in that random 1%.

The md5sum of the object in question, downloaded through gsutil matches with the md5sum of the object downloaded from S3. Which makes me think it has to be something with the Python SDK.

andrewsg · 2021-03-06T01:10:00Z

I'm still investigating but haven't been able to reproduce anything similar yet, so I have some further questions. Thanks for all of your patience so far.

Is this file larger than any other file you tested successfully? Or have you tested even larger files with checksum=md5 on the Python SDK and not had issues?

Could you share any non-private info in the object's metadata, as seen with gsutil ls -L? (Please review to ensure there is no sensitive info, or, if there is semiprivate info you are comfortable sharing for debugging purposes but don't want on this bug, please email them to me at gorcester@google.com). I'm not sure what metadata we're looking for yet but anything related to compression or a content-type that is unexpected would be a good clue.

cloudryder · 2021-03-08T22:28:26Z

I sent you the Metadata of the object through email. I continued running the checksum script and I hit one more file. What kind of details would you like on that file? This file is 2.4TB.

andrewsg · 2021-03-08T22:57:36Z

Thanks, I got your email re: the metadata. You're saying you've found another large file that trips the md5sum check, just like the first one? Interesting! Do you have any other files over 2TB that work properly, without any checksum issues?

cloudryder · 2021-03-09T16:50:10Z

Thats right, one other object that trips the md5sum check. Also, I found one object that is 2.53T and downloaded successfully.

andrewsg · 2021-03-09T22:14:16Z

Thanks. Assuming your objects are not available to be shared with Google engineers, we'll have to try to reproduce the issue with some artificially created similarly-sized test objects. It's unfortunate that you have a 2.53T object that downloaded without issue, as that suggests that if it's a property of the files themselves, then size alone is not enough to cause the issue.

If you have this info, could you please share, for instance via email, a timestamp of the last time this error occurred for you, and the full path to that object including the bucket name? It's a long shot since it seems like a client issue, but I will look at the logs on the API side for potential anomalies.

While we are investigating, I recommend the crc32c checksum solution as a workaround. Given the size of your files, it may also improve your CPU utilization.

cloudryder · 2021-03-10T01:30:05Z

Thanks Andrew, I emailed you the object that failed as the 2.53T object that succeeded. On Tuesday, March 9, 2021, 05:14:31 PM EST, Andrew Gorcester <notifications@github.com> wrote: Thanks. Assuming your objects are not available to be shared with Google engineers, we'll have to try to reproduce the issue with some artificially created similarly-sized test objects. It's unfortunate that you have a 2.53T object that downloaded without issue, as that suggests that if it's a property of the files themselves, then size alone is not enough to cause the issue. If you have this info, could you please share, for instance via email, a timestamp of the last time this error occurred for you, and the full path to that object including the bucket name? It's a long shot since it seems like a client issue, but I will look at the logs on the API side for potential anomalies. While we are investigating, I recommend the crc32c checksum solution as a workaround. Given the size of your files, it may also improve your CPU utilization. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

andrewsg · 2021-03-30T16:23:47Z

The investigation is ongoing and I have some outstanding requests to Storage engineering; will update here when we know more. Thanks for all of the info you've provided so far.

andrewsg · 2021-07-08T18:13:48Z

Are you still experiencing this issue? Despite some significant stress tests on our side, checksum issues seem very rare for us.

cloudryder · 2021-07-12T15:35:15Z

We're done with the Project we were working on and not doing mass downloads with Python API Anymore. On Thursday, July 8, 2021, 02:14:00 PM EDT, Andrew Gorcester ***@***.***> wrote: Are you still experiencing this issue? Despite some significant stress tests on our side, checksum issues seem very rare for us. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

LoicEm · 2022-10-17T11:44:31Z

I am coming back to you as it seems that I am having the same issue, with google-resumable-media

However, I get the error with both crc32c and md5 checksum.

My code:

storage=storage.client()
bucket = storage.client.get_bucket("REDACTED")
blob = bucket.blob("REDACTED")
data = blob.download_as_bytes()
print(data)

My versions:

poetry show google-resumable-media
 name         : google-resumable-media                                     
 version      : 2.3.3                                                      
 description  : Utilities for Google Media Downloads and Resumable Uploads 

dependencies
 - google-crc32c >=1.0,<2.0dev

required by
 - google-cloud-bigquery >=0.6.0,<3.0dev
 - google-cloud-storage >=1.3.0

Additional information:

gsutil stat <file>
Creation time:          Mon, 17 Oct 2022 07:45:27 GMT
    Update time:            Mon, 17 Oct 2022 07:45:27 GMT
    Storage class:          STANDARD
    Content-Encoding:       br
    Content-Length:         357
    Content-Type:           application/json
    Hash (crc32c):          HCuYPw==
    Hash (md5):             C9Y1+P/begCJMHXWZiLTyA==
    ETag:                   COzZ7MXi5voCEAE=
    Generation:             1665992727669996
    Metageneration:         1

Current workarounds

I have found two ways to go around this:

First one is as suggested here, using checksum=None. Note that even though the file is brotli-encoded, the result of blob.download_as_bytes(checksum=None) is the decompressed content of the file.
Second one is to use raw_download=True: in which case, using checksum="md5`` or checksum="crc32c"` works fine, and the downloaded content is still brotli-encoded

zpz · 2023-04-16T20:50:07Z

I want to report similar issues as of April 2023, with latest versions of things, downloading parquet files around 17Mb in size. It's not deterministic. It happens sometimes. I can't see culprit of it. The error msg is like this

File "/usr/local/lib/python3.10/dist-packages/biglist/_parquet.py", line 171, in __len__
    return self.num_rows
  File "/usr/local/lib/python3.10/dist-packages/biglist/_parquet.py", line 203, in num_rows
    return self.metadata.num_rows
  File "/usr/local/lib/python3.10/dist-packages/biglist/_parquet.py", line 199, in metadata
    return self.file.metadata
  File "/usr/local/lib/python3.10/dist-packages/biglist/_parquet.py", line 194, in file
    self._file = self.load_file(self.path, lazy=self.lazy)
  File "/usr/local/lib/python3.10/dist-packages/biglist/_parquet.py", line 99, in load_file
    data = io.BytesIO(path.read_bytes())
  File "/usr/local/lib/python3.10/dist-packages/upathlib/gcs.py", line 418, in read_bytes
    self._read_into_buffer(buffer)
  File "/usr/local/lib/python3.10/dist-packages/upathlib/gcs.py", line 406, in _read_into_buffer
    self._blob().download_to_file(file_obj, client=self._client())
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/blob.py", line 1129, in download_to_file
    client.download_blob_to_file(
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/client.py", line 1091, in download_blob_to_file
    blob_or_uri._do_download(
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/blob.py", line 984, in _do_download
    response = download.consume(transport, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/download.py", line 237, in consume
    return _request_helpers.wait_and_retry(
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/_request_helpers.py", line 148, in wait_and_retry
    response = func()
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/download.py", line 233, in retriable_request
    self._write_to_stream(result)
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/download.py", line 141, in _write_to_stream
    raise common.DataCorruption(response, msg)
google.resumable_media.common.DataCorruption: Checksum mismatch while downloading:

  https://storage.googleapis.com/download/storage/v1/b/<bucket-name>/<path>....parquet?alt=media

The X-Goog-Hash header indicated an MD5 checksum of:

  mXaiciG3yLEuayoUHvJltg==

but the actual MD5 checksum of the downloaded contents was:

  KHYwdj0PYK1NUeFQuxzmbQ==

andrewsg · 2023-05-16T17:58:21Z

@zpz sorry, I missed this initially as it was on a closed issue. As the previous commenter could workaround if raw_download was set to True, could you also try setting raw_download to True and report if the problem continues? That would help us diagnose a potential issue. Thank you!

zpz · 2023-05-18T15:27:01Z

To me using raw_download=True in blob.download_to_file in upathlib.gcs seems to solve the problem.

andrewsg · 2023-05-23T20:00:18Z

Thank you for checking that. That's very useful in diagnosing the issue.

andrewsg · 2023-06-12T21:41:55Z

@zpz An additional question: how exactly is your file compressed, assuming it is compressed? And how often does this error occur?

andrewsg · 2023-06-12T23:00:12Z

Verified that at least some of the failures discussed in this thread are due to the requests library adding support for "br" encoding if the brotli or brotli-cffi libraries are installed. requests is automatically decoding these similar to what it does for gzip, but we don't have the special-casing that lets us do checksum comparisons before decoding that we have with gzip.

zpz · 2023-06-14T08:43:52Z

I used pyarrow parquet's default compression which appears to be snappy. The issue happened very often. I don't think it happened to every file, but did happen every time I ran the program, which is weird. As I said, 'raw_download=True' solved it. I consider that a solution rather than a workaround. I don't understand the auto-detection and auto-decomp features. I'm downloading the binary file meaning the bytes. If it's compressed, I or the file reading code can do that. It's not a downloader's job. Give me the bytes.

andrewsg · 2023-06-20T17:03:35Z

We'll resolve this by focusing on the "br" header/encryption issue, then. Thanks for your input.

… GCP Work around googleapis/google-resumable-media-python#204

frankyn · 2023-10-19T15:39:09Z

Hi @andrewsg is there an update on this issue?

…ytes from GCP Exactly the same as d348115, when downloading as bytes. Work around googleapis/google-resumable-media-python#204

Fixes #204 🦕

product-auto-label bot added the api: storage Issues related to the googleapis/google-resumable-media-python API. label Mar 3, 2021

cloudryder changed the title ~~Python SDK unable to download file due to checksum mismath~~ Python SDK unable to download file due to checksum mismatch Mar 3, 2021

andrewsg self-assigned this Mar 3, 2021

andrewsg removed the needs more info This issue needs more information from the customer to proceed. label Mar 5, 2021

yoshi-automation added the 🚨 This issue needs some love. label Mar 11, 2021

andrewsg closed this as completed Aug 5, 2021

zpz mentioned this issue Apr 16, 2023

error downloading from GCS zpz/upathlib#116

Closed

andrewsg reopened this May 16, 2023

andrewsg added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. labels Jun 20, 2023

marco-c added a commit to mozilla/code-coverage that referenced this issue Sep 21, 2023

Use raw_download option when downloading code coverage artifacts from…

d348115

… GCP Work around googleapis/google-resumable-media-python#204

andrewsg mentioned this issue Oct 21, 2023

feat: support brotli encoding #403

Merged

gcf-merge-on-green bot closed this as completed in #403 Oct 27, 2023

gcf-merge-on-green bot pushed a commit that referenced this issue Oct 27, 2023

feat: support brotli encoding (#403)

295e40a

Fixes #204 🦕

adrian-codecov mentioned this issue Jan 9, 2024

[Worker] - Handle Checksum Error Mismatch codecov/engineering-team#1029

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python SDK unable to download file due to checksum mismatch #204

Python SDK unable to download file due to checksum mismatch #204

cloudryder commented Mar 3, 2021 •

edited

Loading

andrewsg commented Mar 3, 2021

andrewsg commented Mar 4, 2021

cloudryder commented Mar 4, 2021 •

edited

Loading

andrewsg commented Mar 5, 2021

cloudryder commented Mar 5, 2021 via email

andrewsg commented Mar 5, 2021

cloudryder commented Mar 5, 2021 •

edited

Loading

andrewsg commented Mar 6, 2021

cloudryder commented Mar 8, 2021

andrewsg commented Mar 8, 2021

cloudryder commented Mar 9, 2021

andrewsg commented Mar 9, 2021

cloudryder commented Mar 10, 2021 via email

andrewsg commented Mar 30, 2021

andrewsg commented Jul 8, 2021

cloudryder commented Jul 12, 2021 via email

LoicEm commented Oct 17, 2022 •

edited

Loading

zpz commented Apr 16, 2023

andrewsg commented May 16, 2023

zpz commented May 18, 2023 •

edited

Loading

andrewsg commented May 23, 2023

andrewsg commented Jun 12, 2023

andrewsg commented Jun 12, 2023

zpz commented Jun 14, 2023

andrewsg commented Jun 20, 2023

frankyn commented Oct 19, 2023

Python SDK unable to download file due to checksum mismatch #204

Python SDK unable to download file due to checksum mismatch #204

Comments

cloudryder commented Mar 3, 2021 • edited Loading

andrewsg commented Mar 3, 2021

andrewsg commented Mar 4, 2021

cloudryder commented Mar 4, 2021 • edited Loading

andrewsg commented Mar 5, 2021

cloudryder commented Mar 5, 2021 via email

andrewsg commented Mar 5, 2021

cloudryder commented Mar 5, 2021 • edited Loading

andrewsg commented Mar 6, 2021

cloudryder commented Mar 8, 2021

andrewsg commented Mar 8, 2021

cloudryder commented Mar 9, 2021

andrewsg commented Mar 9, 2021

cloudryder commented Mar 10, 2021 via email

andrewsg commented Mar 30, 2021

andrewsg commented Jul 8, 2021

cloudryder commented Jul 12, 2021 via email

LoicEm commented Oct 17, 2022 • edited Loading

Current workarounds

zpz commented Apr 16, 2023

andrewsg commented May 16, 2023

zpz commented May 18, 2023 • edited Loading

andrewsg commented May 23, 2023

andrewsg commented Jun 12, 2023

andrewsg commented Jun 12, 2023

zpz commented Jun 14, 2023

andrewsg commented Jun 20, 2023

frankyn commented Oct 19, 2023

cloudryder commented Mar 3, 2021 •

edited

Loading

cloudryder commented Mar 4, 2021 •

edited

Loading

cloudryder commented Mar 5, 2021 •

edited

Loading

LoicEm commented Oct 17, 2022 •

edited

Loading

zpz commented May 18, 2023 •

edited

Loading