Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: acknowledge that text file might come from different OS #6314

Closed
pared opened this issue Jul 14, 2021 Discussed in #6313 · 5 comments
Closed

repro: acknowledge that text file might come from different OS #6314

pared opened this issue Jul 14, 2021 Discussed in #6313 · 5 comments
Labels
bug Did we break something? research

Comments

@pared
Copy link
Contributor

pared commented Jul 14, 2021

Discussed in #6313

Originally posted by SebbanSms July 14, 2021
I have some issues using a s3 bucket to push and pull input data with dvc.

I have a stage where the deps are outs of a prior stage.

When I reproduce the dvc.yaml and push the data on OSX(Mac)
and then pull it on another machine using Windows, I reproduce the dvc.yaml again,
dvc.lock shows the same hash but different file sizes for that stage deps and reruns the stage completely, then also showing different files sizes and hash on the outs

git marks the diff in my IDE in the deps of that stage in the .lock file only for sizes:

image

Any idea how the file size in deps could change if the hash is the same?

What I tried so far:
deleting the files on OSX, pull them again from s3, reproducing dvc.yaml -> no changes detected, dvc.lock stays the same
delteing the files on Windows, pull them again from s3, reproducing dvc.yaml -> changes detected, dvc.lock shows different file sizes for the files in deps of that stage

On both systems, I definitely use the same git commit.

It seems that running repro on different file system will retrigger existing stages just because file size from Unix system is different than one from Windows.
We should probably acknowledge that OS when adding the file (same as we do with calculating hash). The problem is that changing this behaviour now would probably affect all repositories on Windows that use text files.

Maybe we could do some additional check on Windows that would allow to verify that given file is unchanged even if the sizes do not match?

@pared pared added bug Did we break something? research labels Jul 14, 2021
@SebBanDev
Copy link

SebBanDev commented Jul 20, 2021

We had a few more cases where files pushed on Windows showed a change only in size when pulled and run with dvc repro on OSx. As mentioned in the discussion, probably only the line endings changed somehow

@Christoph-1
Copy link

I'm working with @SebbanSms on a project and we have different os. Would be great to find a solution here.

@efiop
Copy link
Contributor

efiop commented Jul 20, 2021

Caused by #4658 The workaround for now is to use core.autocrlf git config option and dos2unix files that you are adding to dvc. As noted by @pared , the proper solution for this will affect compatibility for everyone, so that's why we are putting it off for a so-called 3.0 cache, where we'll likely switch to another proper hash (maybe sha256) that will no longer be affected by this.

@SebBanDev
Copy link

SebBanDev commented Jul 20, 2021

The workaround for now is to use core.autocrlf git config option and dos2unix files that you are adding to dvc.

@efiop Thanks for the reply. Do you have a guide in the dvc documentation how to set this up? I admit, I'm quite unfamiliar with git config.

Currently we use a .gitattributes file with the setting:
* text=auto
(I think this should make sure files in git repo use LF even if they use CRLF locally?)

And my guess is we would need to use dos2unix in some way to make sure .csv files use LF line endings all the time? (local and in s3 bucket)

@skshetry
Copy link
Member

skshetry commented Mar 25, 2024

DVC stopped using dos2unix hashes, so in 3.x, the hashes are different too.

We recommend users to ensure outputs are generated with proper line-endings.
See https://dvc.org/doc/user-guide/upgrade#file-hashing-changes.

Closing.

@skshetry skshetry closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? research
Projects
None yet
Development

No branches or pull requests

5 participants