Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push: RAW file considered as text file (bad MD5) #6253

Closed
themaikelman opened this issue Jun 30, 2021 · 8 comments
Closed

push: RAW file considered as text file (bad MD5) #6253

themaikelman opened this issue Jun 30, 2021 · 8 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@themaikelman
Copy link

themaikelman commented Jun 30, 2021

Bug Report

Description

The RAW file has a md5sum

fd0de1350b92b00d60afd53b015f6aea 214089_JAI.raw

But DVC calculates it as

md5: 0b4d86bc06ee3260e8172b2196805382 size: 63232000 path: 214089_JAI.raw

This happens because it identifies it as a text file and runs the dos2unix replacement:
https://github.com/iterative/dvc/blob/1.11/dvc/utils/__init__.py#L39 -> https://github.com/iterative/dvc/blob/1.11/dvc/istextfile.py#L34

It still happens in version 2.4.3
https://github.com/iterative/dvc/blob/2.4.3/dvc/utils/__init__.py#L37 -> https://github.com/iterative/dvc/blob/2.4.3/dvc/istextfile.py#L22

When uploading it through the gocloud.dev library, it fails due to the MD5 check, since the one calculated by DVC and the real one of the file is not the same:
https://github.com/google/go-cloud/blob/v0.23.0/blob/blob.go#L328

Reproduce

  1. dvc init
  2. dvc remote modify --local our-proxy password 123123
  3. Copy 214089_JAI.raw to the directory
  4. dvc add 214089_JAI.raw
  5. dvc push

Expected

The file is expected to upload correctly, but since the md5 of the file and the one sent by DVC do not match, the upload is canceled

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 1.11.16 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-5.4.0-65-generic-x86_64-with-glibc2.29
Supports: http, https, ssh
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git

Additional Information (if any):
https://github.com/atekoa/dvc-rawfile

@isidentical
Copy link
Contributor

Potentially relevant with: #4658

@efiop
Copy link
Contributor

efiop commented Jun 30, 2021

@atekoa Are you using our md5s(which are not really md5s) to validate data outside dvc?

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jun 30, 2021
@themaikelman
Copy link
Author

We use the path in the cache folder (which is supposedly the md5) as a header to check the md5 of the uploaded file, so in this case, the path in the cache is
0b/4d86bc06ee3260e8172b2196805382 -> expected md5 0b4d86bc06ee3260e8172b2196805382
-> uploaded md5 fd0de1350b92b00d60afd53b015f6aea

But the main problem we detect is that it is considering the file as a text file and it is a binary file, so we cannot run a dos2unix to correct the md5 calculation.

@efiop
Copy link
Contributor

efiop commented Jul 1, 2021

@atekoa Could you elaborate on why you have to use our hash to verify uploads yourself? As I understand everything works correctly, except your custom additional verification that relies on our hash, right? Or do you run into actual problems with our hash collisions?

Our hash is in the current format for historical reasons, as described in #4658 , we will be switching to a proper sha* in the future along with some general object format change to provide better deduplication and performance.

@themaikelman
Copy link
Author

I use the DVC calculated hash to avoid having to rewrite the file in our proxy, recalculate the hash and send the correct hash and the file to the gocloud.dev library, which is ultimately responsible for uploading the file and verifying the md5.
The gocloud.dev library is the one that requires me to send the md5 of the file to verify that have written the data correctly, but I don't have the correct md5 if I don't generate it myself, right?

@efiop
Copy link
Contributor

efiop commented Jul 1, 2021

@atekoa Right, so you have to compute real md5 yourself if you need one. You are doing something rather advanced and relatively hacky, so there doesn't seem to be much we can do on our side right now 🙁 Also note that we will be switching to sha* in the future, so md5 will no longer be there anyway. Is calculating real md5 by yourself feasible in your scenario?

@themaikelman
Copy link
Author

Not really. Instead of a "drive through" proxy, it will be a "stop and go" :(
If they send me 10GB I will have to write them to calculate the MD5 and then send... It does not seem very interesting

@efiop
Copy link
Contributor

efiop commented Jul 1, 2021

@atekoa Sorry to hear that 🙁 , but looks like there is no other official way to do that. Maybe modifying dvc for your use case to not use dos2unix would be possible, but that will make it incompatible with official dvc. Looks like the proper way to do that is to indeed compute md5 in your proxy on the fly and then use it.

Closing for now.

@efiop efiop closed this as completed Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

3 participants