-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python SDK unable to download file due to checksum mismatch #204
Comments
Thanks for this report. Since it's reproducible, do you mind performing an experiment to see if the other checksum type also disagrees? You can change the checksum type with checksum="crc32c" or checksum=None. If you select checksum="crc32c" at download time, does that also fail? |
Also, could you please download the entire file with checksum=None (not with gsutil) and run md5sum on the command line, and see if you get the lAhl... checksum (server-reported) or the 61Kz... checksum (client-reported). I'll be presumably unable to reproduce on my side until we know more about where the error lies, assuming the file in question is not public. |
I was able to download it with CRC32. The md5sum of the file downloaded through
|
Thanks, that's very helpful. The md5sum hash starting with 9408... corresponds to the base64-encoded hash How many times have you reproduced this problem with checksum=md5? You have a 100% success rate with gsutil and with python-storage and checksum=crc32c, but a 0% success rate over multiple tries with python-storage and checksum=md5? |
The python-storage download failed twice. I tried CRC32 and gsutil only once and they both succeeded. Would you like me to run any of them more times? Its a 2.3T object, so it takes a while to download each time :)
On Friday, March 5, 2021, 02:04:56 PM EST, Andrew Gorcester <notifications@github.com> wrote:
Thanks, that's very helpful. The md5sum hash starting with 9408... corresponds to the base64-encoded hash lAhluFgTEwcNJDvTSap2fQ==, which implies the server hash is correct and the computed hash is incorrect. That said, it is still quite a mystery, given that the crc32c checksum strategy worked without a hitch.
How many times have you reproduced this problem with checksum=md5? You have a 100% success rate with gsutil and with python-storage and checksum=crc32c, but a 0% success rate over multiple tries with python-storage and checksum=md5?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
You received the same erroneous MD5 sum in the error message in both tries? That is quite surprising. Based on what we know now I can try to look through our code for possible issues. I'll let you know if I need more info. Thanks! |
Yes, exact same error and the md5sum displayed in both error messages are same too. Another piece of information is, this is one of the several objects in the bucket that were transferred from S3 to GCS, using GCP Transfer Service. I was downloading about 1% of Random objects from both S3 and GCS and comparing their checksums(md5 and sha256) using python to make sure that the data was still intact after the transfer, which when I ran across this issue. This object is one of the objects in that random 1%. The md5sum of the object in question, downloaded through gsutil matches with the md5sum of the object downloaded from S3. Which makes me think it has to be something with the Python SDK. |
I'm still investigating but haven't been able to reproduce anything similar yet, so I have some further questions. Thanks for all of your patience so far. Is this file larger than any other file you tested successfully? Or have you tested even larger files with checksum=md5 on the Python SDK and not had issues? Could you share any non-private info in the object's metadata, as seen with |
I sent you the Metadata of the object through email. I continued running the checksum script and I hit one more file. What kind of details would you like on that file? This file is 2.4TB. |
Thanks, I got your email re: the metadata. You're saying you've found another large file that trips the md5sum check, just like the first one? Interesting! Do you have any other files over 2TB that work properly, without any checksum issues? |
Thats right, one other object that trips the md5sum check. Also, I found one object that is 2.53T and downloaded successfully. |
Thanks. Assuming your objects are not available to be shared with Google engineers, we'll have to try to reproduce the issue with some artificially created similarly-sized test objects. It's unfortunate that you have a 2.53T object that downloaded without issue, as that suggests that if it's a property of the files themselves, then size alone is not enough to cause the issue. If you have this info, could you please share, for instance via email, a timestamp of the last time this error occurred for you, and the full path to that object including the bucket name? It's a long shot since it seems like a client issue, but I will look at the logs on the API side for potential anomalies. While we are investigating, I recommend the crc32c checksum solution as a workaround. Given the size of your files, it may also improve your CPU utilization. |
Thanks Andrew, I emailed you the object that failed as the 2.53T object that succeeded.
On Tuesday, March 9, 2021, 05:14:31 PM EST, Andrew Gorcester <notifications@github.com> wrote:
Thanks. Assuming your objects are not available to be shared with Google engineers, we'll have to try to reproduce the issue with some artificially created similarly-sized test objects. It's unfortunate that you have a 2.53T object that downloaded without issue, as that suggests that if it's a property of the files themselves, then size alone is not enough to cause the issue.
If you have this info, could you please share, for instance via email, a timestamp of the last time this error occurred for you, and the full path to that object including the bucket name? It's a long shot since it seems like a client issue, but I will look at the logs on the API side for potential anomalies.
While we are investigating, I recommend the crc32c checksum solution as a workaround. Given the size of your files, it may also improve your CPU utilization.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
The investigation is ongoing and I have some outstanding requests to Storage engineering; will update here when we know more. Thanks for all of the info you've provided so far. |
Are you still experiencing this issue? Despite some significant stress tests on our side, checksum issues seem very rare for us. |
We're done with the Project we were working on and not doing mass downloads with Python API Anymore.
On Thursday, July 8, 2021, 02:14:00 PM EDT, Andrew Gorcester ***@***.***> wrote:
Are you still experiencing this issue? Despite some significant stress tests on our side, checksum issues seem very rare for us.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I am coming back to you as it seems that I am having the same issue, with However, I get the error with both My code:
My versions:
Additional information:
Current workaroundsI have found two ways to go around this:
|
I want to report similar issues as of April 2023, with latest versions of things, downloading parquet files around 17Mb in size. It's not deterministic. It happens sometimes. I can't see culprit of it. The error msg is like this
|
@zpz sorry, I missed this initially as it was on a closed issue. As the previous commenter could workaround if raw_download was set to True, could you also try setting raw_download to True and report if the problem continues? That would help us diagnose a potential issue. Thank you! |
To me using |
Thank you for checking that. That's very useful in diagnosing the issue. |
@zpz An additional question: how exactly is your file compressed, assuming it is compressed? And how often does this error occur? |
Verified that at least some of the failures discussed in this thread are due to the |
I used pyarrow parquet's default compression which appears to be snappy. The issue happened very often. I don't think it happened to every file, but did happen every time I ran the program, which is weird. As I said, 'raw_download=True' solved it. I consider that a solution rather than a workaround. I don't understand the auto-detection and auto-decomp features. I'm downloading the binary file meaning the bytes. If it's compressed, I or the file reading code can do that. It's not a downloader's job. Give me the bytes. |
We'll resolve this by focusing on the "br" header/encryption issue, then. Thanks for your input. |
Hi @andrewsg is there an update on this issue? |
…ytes from GCP Exactly the same as d348115, when downloading as bytes. Work around googleapis/google-resumable-media-python#204
Object download failed complaining about checksum mismatch. Downloading the object through gsutils works fine.
The Code itself is pretty straight forward:
The file size is 2.3TB if that matters.
Following are the plugin versions
I'm able to reproduce this issue for this file. I had downloaded several hundred objects with the same SDK. Not sure why its failing on this file.
The text was updated successfully, but these errors were encountered: