-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
push: RAW file considered as text file (bad MD5) #6253
Comments
Potentially relevant with: #4658 |
@atekoa Are you using our md5s(which are not really md5s) to validate data outside dvc? |
We use the path in the cache folder (which is supposedly the md5) as a header to check the md5 of the uploaded file, so in this case, the path in the cache is But the main problem we detect is that it is considering the file as a text file and it is a binary file, so we cannot run a dos2unix to correct the md5 calculation. |
@atekoa Could you elaborate on why you have to use our hash to verify uploads yourself? As I understand everything works correctly, except your custom additional verification that relies on our hash, right? Or do you run into actual problems with our hash collisions? Our hash is in the current format for historical reasons, as described in #4658 , we will be switching to a proper sha* in the future along with some general object format change to provide better deduplication and performance. |
I use the DVC calculated hash to avoid having to rewrite the file in our proxy, recalculate the hash and send the correct hash and the file to the gocloud.dev library, which is ultimately responsible for uploading the file and verifying the md5. |
@atekoa Right, so you have to compute real md5 yourself if you need one. You are doing something rather advanced and relatively hacky, so there doesn't seem to be much we can do on our side right now 🙁 Also note that we will be switching to sha* in the future, so md5 will no longer be there anyway. Is calculating real md5 by yourself feasible in your scenario? |
Not really. Instead of a "drive through" proxy, it will be a "stop and go" :( |
@atekoa Sorry to hear that 🙁 , but looks like there is no other official way to do that. Maybe modifying dvc for your use case to not use Closing for now. |
Bug Report
Description
The RAW file has a md5sum
fd0de1350b92b00d60afd53b015f6aea 214089_JAI.raw
But DVC calculates it as
md5: 0b4d86bc06ee3260e8172b2196805382 size: 63232000 path: 214089_JAI.raw
This happens because it identifies it as a text file and runs the dos2unix replacement:
https://github.com/iterative/dvc/blob/1.11/dvc/utils/__init__.py#L39 -> https://github.com/iterative/dvc/blob/1.11/dvc/istextfile.py#L34
It still happens in version 2.4.3
https://github.com/iterative/dvc/blob/2.4.3/dvc/utils/__init__.py#L37 -> https://github.com/iterative/dvc/blob/2.4.3/dvc/istextfile.py#L22
When uploading it through the gocloud.dev library, it fails due to the MD5 check, since the one calculated by DVC and the real one of the file is not the same:
https://github.com/google/go-cloud/blob/v0.23.0/blob/blob.go#L328
Reproduce
Expected
The file is expected to upload correctly, but since the md5 of the file and the one sent by DVC do not match, the upload is canceled
Environment information
Output of
dvc doctor
:Additional Information (if any):
https://github.com/atekoa/dvc-rawfile
The text was updated successfully, but these errors were encountered: