Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python SDK unable to download file due to checksum mismatch #204

Closed
cloudryder opened this issue Mar 3, 2021 · 26 comments · Fixed by #403
Closed

Python SDK unable to download file due to checksum mismatch #204

cloudryder opened this issue Mar 3, 2021 · 26 comments · Fixed by #403
Assignees
Labels
api: storage Issues related to the googleapis/google-resumable-media-python API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@cloudryder
Copy link

cloudryder commented Mar 3, 2021

Object download failed complaining about checksum mismatch. Downloading the object through gsutils works fine.

./gcs-download-object.py
Traceback (most recent call last):
  File "./gcs-download-object.py", line 29, in <module>
    download_blob('##REDACTED##',
  File "./gcs-download-object.py", line 20, in download_blob
    blob.download_to_filename(destination_file_name)
  File "/usr/local/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 1184, in download_to_filename
    client.download_blob_to_file(
  File "/usr/local/lib/python3.8/site-packages/google/cloud/storage/client.py", line 719, in download_blob_to_file
    blob_or_uri._do_download(
  File "/usr/local/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 956, in _do_download
    response = download.consume(transport, timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/google/resumable_media/requests/download.py", line 171, in consume
    self._write_to_stream(result)
  File "/usr/local/lib/python3.8/site-packages/google/resumable_media/requests/download.py", line 120, in _write_to_stream
    raise common.DataCorruption(response, msg)
google.resumable_media.common.DataCorruption: Checksum mismatch while downloading:
  ##REDACTED##
The X-Goog-Hash header indicated an MD5 checksum of:
  lAhluFgTEwcNJDvTSap2fQ==
but the actual MD5 checksum of the downloaded contents was:
  61Kz/FQdqRvwqacGuwuFIA==

The Code itself is pretty straight forward:

#!/usr/bin/env python3.8
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    # bucket_name = "your-bucket-name"
    # source_blob_name = "storage-object-name"
    # destination_file_name = "local/path/to/file"
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    # Construct a client side representation of a blob.
    # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
    # any content from Google Cloud Storage. As we don't need additional data,
    # using `Bucket.blob` is preferred here.
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)
    print(
        "Blob {} downloaded to {}.".format(
            source_blob_name, destination_file_name
        )
    )
download_blob('##REDACTED##',
              'remedia/mezzanines/Live/2018-06-24/M31_POL-COL_ESFUHD_06_24.mov', 'M31_POL-COL_ESFUHD_06_24.mov')

The file size is 2.3TB if that matters.

Following are the plugin versions

pip3.8 list
Package                  Version
------------------------ ---------
boto3                    1.17.13
botocore                 1.20.13
cachetools               4.2.1
certifi                  2020.12.5
cffi                     1.14.5
chardet                  4.0.0
google-api-core          1.26.0
google-auth              1.27.0
google-cloud-core        1.6.0
google-cloud-storage     1.36.0
google-crc32c            1.1.2
google-resumable-media   1.2.0
googleapis-common-protos 1.52.0
idna                     2.10
jmespath                 0.10.0
packaging                20.9
pip                      19.2.3
protobuf                 3.15.1
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pycparser                2.20
pyparsing                2.4.7
python-dateutil          2.8.1
pytz                     2021.1
requests                 2.25.1
rsa                      4.7.1
s3transfer               0.3.4
setuptools               41.2.0
six                      1.15.0
urllib3                  1.26.3

I'm able to reproduce this issue for this file. I had downloaded several hundred objects with the same SDK. Not sure why its failing on this file.

@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/google-resumable-media-python API. label Mar 3, 2021
@cloudryder cloudryder changed the title Python SDK unable to download file due to checksum mismath Python SDK unable to download file due to checksum mismatch Mar 3, 2021
@andrewsg
Copy link
Contributor

andrewsg commented Mar 3, 2021

Thanks for this report. Since it's reproducible, do you mind performing an experiment to see if the other checksum type also disagrees? You can change the checksum type with checksum="crc32c" or checksum=None. If you select checksum="crc32c" at download time, does that also fail?

@andrewsg andrewsg added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. needs more info This issue needs more information from the customer to proceed. labels Mar 3, 2021
@andrewsg andrewsg self-assigned this Mar 3, 2021
@andrewsg
Copy link
Contributor

andrewsg commented Mar 4, 2021

Also, could you please download the entire file with checksum=None (not with gsutil) and run md5sum on the command line, and see if you get the lAhl... checksum (server-reported) or the 61Kz... checksum (client-reported).

I'll be presumably unable to reproduce on my side until we know more about where the error lies, assuming the file in question is not public.

@cloudryder
Copy link
Author

cloudryder commented Mar 4, 2021

I was able to download it with CRC32. The md5sum of the file downloaded through gsutil and Python SDK with CRC32 Checksum match. I haven't tried checksum=None. Is that something you still require? And you're right, the object is not publicly available.

$ ./gcs-download-object.py
Blob remedia/mezzanines/Live/2018-06-24/M31_POL-COL_ESFUHD_06_24.mov downloaded to gcs_crc32_M31_POL-COL_ESFUHD_06_24.mov.

$ md5sum gcs_crc32_M31_POL-COL_ESFUHD_06_24.mov
940865b8581313070d243bd349aa767d  gcs_crc32_M31_POL-COL_ESFUHD_06_24.mov

$ md5sum M31_POL-COL_ESFUHD_06_24.mov
940865b8581313070d243bd349aa767d  M31_POL-COL_ESFUHD_06_24.mov

@andrewsg
Copy link
Contributor

andrewsg commented Mar 5, 2021

Thanks, that's very helpful. The md5sum hash starting with 9408... corresponds to the base64-encoded hash lAhluFgTEwcNJDvTSap2fQ==, which implies the server hash is correct and the computed hash is incorrect. That said, it is still quite a mystery, given that the crc32c checksum strategy worked without a hitch.

How many times have you reproduced this problem with checksum=md5? You have a 100% success rate with gsutil and with python-storage and checksum=crc32c, but a 0% success rate over multiple tries with python-storage and checksum=md5?

@cloudryder
Copy link
Author

cloudryder commented Mar 5, 2021 via email

@andrewsg
Copy link
Contributor

andrewsg commented Mar 5, 2021

You received the same erroneous MD5 sum in the error message in both tries? That is quite surprising.

Based on what we know now I can try to look through our code for possible issues. I'll let you know if I need more info. Thanks!

@andrewsg andrewsg removed the needs more info This issue needs more information from the customer to proceed. label Mar 5, 2021
@cloudryder
Copy link
Author

cloudryder commented Mar 5, 2021

You received the same erroneous MD5 sum in the error message in both tries? That is quite surprising.

Based on what we know now I can try to look through our code for possible issues. I'll let you know if I need more info. Thanks!

Yes, exact same error and the md5sum displayed in both error messages are same too. Another piece of information is, this is one of the several objects in the bucket that were transferred from S3 to GCS, using GCP Transfer Service. I was downloading about 1% of Random objects from both S3 and GCS and comparing their checksums(md5 and sha256) using python to make sure that the data was still intact after the transfer, which when I ran across this issue. This object is one of the objects in that random 1%.

The md5sum of the object in question, downloaded through gsutil matches with the md5sum of the object downloaded from S3. Which makes me think it has to be something with the Python SDK.

@andrewsg
Copy link
Contributor

andrewsg commented Mar 6, 2021

I'm still investigating but haven't been able to reproduce anything similar yet, so I have some further questions. Thanks for all of your patience so far.

Is this file larger than any other file you tested successfully? Or have you tested even larger files with checksum=md5 on the Python SDK and not had issues?

Could you share any non-private info in the object's metadata, as seen with gsutil ls -L? (Please review to ensure there is no sensitive info, or, if there is semiprivate info you are comfortable sharing for debugging purposes but don't want on this bug, please email them to me at gorcester@google.com). I'm not sure what metadata we're looking for yet but anything related to compression or a content-type that is unexpected would be a good clue.

@cloudryder
Copy link
Author

I sent you the Metadata of the object through email. I continued running the checksum script and I hit one more file. What kind of details would you like on that file? This file is 2.4TB.

@andrewsg
Copy link
Contributor

andrewsg commented Mar 8, 2021

Thanks, I got your email re: the metadata. You're saying you've found another large file that trips the md5sum check, just like the first one? Interesting! Do you have any other files over 2TB that work properly, without any checksum issues?

@cloudryder
Copy link
Author

Thats right, one other object that trips the md5sum check. Also, I found one object that is 2.53T and downloaded successfully.

@andrewsg
Copy link
Contributor

andrewsg commented Mar 9, 2021

Thanks. Assuming your objects are not available to be shared with Google engineers, we'll have to try to reproduce the issue with some artificially created similarly-sized test objects. It's unfortunate that you have a 2.53T object that downloaded without issue, as that suggests that if it's a property of the files themselves, then size alone is not enough to cause the issue.

If you have this info, could you please share, for instance via email, a timestamp of the last time this error occurred for you, and the full path to that object including the bucket name? It's a long shot since it seems like a client issue, but I will look at the logs on the API side for potential anomalies.

While we are investigating, I recommend the crc32c checksum solution as a workaround. Given the size of your files, it may also improve your CPU utilization.

@cloudryder
Copy link
Author

cloudryder commented Mar 10, 2021 via email

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Mar 11, 2021
@andrewsg
Copy link
Contributor

The investigation is ongoing and I have some outstanding requests to Storage engineering; will update here when we know more. Thanks for all of the info you've provided so far.

@andrewsg
Copy link
Contributor

andrewsg commented Jul 8, 2021

Are you still experiencing this issue? Despite some significant stress tests on our side, checksum issues seem very rare for us.

@cloudryder
Copy link
Author

cloudryder commented Jul 12, 2021 via email

@andrewsg andrewsg closed this as completed Aug 5, 2021
@LoicEm
Copy link

LoicEm commented Oct 17, 2022

I am coming back to you as it seems that I am having the same issue, with google-resumable-media

However, I get the error with both crc32c and md5 checksum.

My code:

storage=storage.client()
bucket = storage.client.get_bucket("REDACTED")
blob = bucket.blob("REDACTED")
data = blob.download_as_bytes()
print(data)

My versions:

poetry show google-resumable-media
 name         : google-resumable-media                                     
 version      : 2.3.3                                                      
 description  : Utilities for Google Media Downloads and Resumable Uploads 

dependencies
 - google-crc32c >=1.0,<2.0dev

required by
 - google-cloud-bigquery >=0.6.0,<3.0dev
 - google-cloud-storage >=1.3.0

Additional information:

gsutil stat <file>
Creation time:          Mon, 17 Oct 2022 07:45:27 GMT
    Update time:            Mon, 17 Oct 2022 07:45:27 GMT
    Storage class:          STANDARD
    Content-Encoding:       br
    Content-Length:         357
    Content-Type:           application/json
    Hash (crc32c):          HCuYPw==
    Hash (md5):             C9Y1+P/begCJMHXWZiLTyA==
    ETag:                   COzZ7MXi5voCEAE=
    Generation:             1665992727669996
    Metageneration:         1

Current workarounds

I have found two ways to go around this:

  • First one is as suggested here, using checksum=None. Note that even though the file is brotli-encoded, the result of blob.download_as_bytes(checksum=None) is the decompressed content of the file.
  • Second one is to use raw_download=True: in which case, using checksum="md5`` or checksum="crc32c"` works fine, and the downloaded content is still brotli-encoded

@zpz
Copy link

zpz commented Apr 16, 2023

I want to report similar issues as of April 2023, with latest versions of things, downloading parquet files around 17Mb in size. It's not deterministic. It happens sometimes. I can't see culprit of it. The error msg is like this

File "/usr/local/lib/python3.10/dist-packages/biglist/_parquet.py", line 171, in __len__
    return self.num_rows
  File "/usr/local/lib/python3.10/dist-packages/biglist/_parquet.py", line 203, in num_rows
    return self.metadata.num_rows
  File "/usr/local/lib/python3.10/dist-packages/biglist/_parquet.py", line 199, in metadata
    return self.file.metadata
  File "/usr/local/lib/python3.10/dist-packages/biglist/_parquet.py", line 194, in file
    self._file = self.load_file(self.path, lazy=self.lazy)
  File "/usr/local/lib/python3.10/dist-packages/biglist/_parquet.py", line 99, in load_file
    data = io.BytesIO(path.read_bytes())
  File "/usr/local/lib/python3.10/dist-packages/upathlib/gcs.py", line 418, in read_bytes
    self._read_into_buffer(buffer)
  File "/usr/local/lib/python3.10/dist-packages/upathlib/gcs.py", line 406, in _read_into_buffer
    self._blob().download_to_file(file_obj, client=self._client())
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/blob.py", line 1129, in download_to_file
    client.download_blob_to_file(
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/client.py", line 1091, in download_blob_to_file
    blob_or_uri._do_download(
  File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/blob.py", line 984, in _do_download
    response = download.consume(transport, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/download.py", line 237, in consume
    return _request_helpers.wait_and_retry(
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/_request_helpers.py", line 148, in wait_and_retry
    response = func()
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/download.py", line 233, in retriable_request
    self._write_to_stream(result)
  File "/usr/local/lib/python3.10/dist-packages/google/resumable_media/requests/download.py", line 141, in _write_to_stream
    raise common.DataCorruption(response, msg)
google.resumable_media.common.DataCorruption: Checksum mismatch while downloading:

  https://storage.googleapis.com/download/storage/v1/b/<bucket-name>/<path>....parquet?alt=media

The X-Goog-Hash header indicated an MD5 checksum of:

  mXaiciG3yLEuayoUHvJltg==

but the actual MD5 checksum of the downloaded contents was:

  KHYwdj0PYK1NUeFQuxzmbQ==

@andrewsg
Copy link
Contributor

@zpz sorry, I missed this initially as it was on a closed issue. As the previous commenter could workaround if raw_download was set to True, could you also try setting raw_download to True and report if the problem continues? That would help us diagnose a potential issue. Thank you!

@andrewsg andrewsg reopened this May 16, 2023
@zpz
Copy link

zpz commented May 18, 2023

To me using raw_download=True in blob.download_to_file in upathlib.gcs seems to solve the problem.

@andrewsg
Copy link
Contributor

Thank you for checking that. That's very useful in diagnosing the issue.

@andrewsg
Copy link
Contributor

@zpz An additional question: how exactly is your file compressed, assuming it is compressed? And how often does this error occur?

@andrewsg
Copy link
Contributor

Verified that at least some of the failures discussed in this thread are due to the requests library adding support for "br" encoding if the brotli or brotli-cffi libraries are installed. requests is automatically decoding these similar to what it does for gzip, but we don't have the special-casing that lets us do checksum comparisons before decoding that we have with gzip.

@zpz
Copy link

zpz commented Jun 14, 2023

I used pyarrow parquet's default compression which appears to be snappy. The issue happened very often. I don't think it happened to every file, but did happen every time I ran the program, which is weird. As I said, 'raw_download=True' solved it. I consider that a solution rather than a workaround. I don't understand the auto-detection and auto-decomp features. I'm downloading the binary file meaning the bytes. If it's compressed, I or the file reading code can do that. It's not a downloader's job. Give me the bytes.

@andrewsg andrewsg added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. labels Jun 20, 2023
@andrewsg
Copy link
Contributor

We'll resolve this by focusing on the "br" header/encryption issue, then. Thanks for your input.

@frankyn
Copy link
Contributor

frankyn commented Oct 19, 2023

Hi @andrewsg is there an update on this issue?

marco-c added a commit to mozilla/code-coverage that referenced this issue Oct 23, 2023
gcf-merge-on-green bot pushed a commit that referenced this issue Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/google-resumable-media-python API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants