Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: drop dos2unix behavior and move cache to files/md5/ prefix #9538

Merged
merged 4 commits into from
Jun 9, 2023

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jun 5, 2023

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Will close #4658

requires iterative/dvc-data#362

Internal changes:

  • Drops dos2unix behavior by default in 3.x
  • 3.x DVC data objects are now stored in <cache_dir>/files/md5
    • 3.x objects are hashed with real md5
    • 3.x objects are likewise stored in <remote_url>/files/md5 prefix for non-versioned remotes
  • Serialized 3.x outputs now include a hash: md5 field (in both .dvc and dvc.lock).
    • When de-serializing an output from a dvcfile that does not include the hash: field, the output will be treated as a legacy 2.x object and hashed with md5-dos2unix
  • Legacy 2.x DVC data objects are left in place in the existing cache/remote locations
    • Legacy objects will be read from their legacy locations when DVC encounters a .dvc or dvc.lock entries that do not have the 3.x hash: ... field

User-facing changes:

  • All pre-existing DVC-tracked data (from DVC 2.x) can be fetched and/or checked out as needed from its original 2.x cache/remote storage locations
    • pre-existing DVC-tracked data means all data or pipeline stage outs with DVC-committed entries in an existing .dvc or dvc.lock file which was generated in DVC 2.x. This does not include pipeline stage outs that are listed in a dvc.yaml but do not have a corresponding dvc.lock file entry.
  • Any operation which results in a DVC committed change (dvc add/dvc commit/dvc exp run/dvc repro) will generate a 3.x DVC output
    • All committed data will be written to the 3.x files/md5 cache location
    • Note that this only applies to commit operations which would result in writing data to cache
      • For pipeline runs, skipped stages and unchanged stages which are loaded from run-cache do not result in committing anything to cache (but repro -f/exp run -f will force writing to 3.x cache)
      • dvc commit when there are no changes (i.e. dvc status reports that everything is up-to-date) will not commit anything to cache (but dvc commit -f will force writing to 3.x cache)
  • No file de-duplication is done between 2.x and 3.x data
    • DVC does not attempt to identify whether or not an existing 2.x cache/remote object (hashed with md5-dos2unix) matches a 3.x cache/remote object (hashed with md5)
    • Essentially, legacy 2.x data is kept completely separate from 3.x data

Unchanged behavior:

  • Run-cache remains stored in the existing <cache_dir>/runs location (it is not moved to files/runs/ in 3.x)
  • Cloud-versioned data handling is unchanged from the user perspective

@pmrowla pmrowla added breaking-change A: data-management Related to dvc add/checkout/commit/move/remove labels Jun 5, 2023
@pmrowla pmrowla self-assigned this Jun 5, 2023
@pmrowla pmrowla force-pushed the 3.x-cache branch 2 times, most recently from a9b08aa to 06f398a Compare June 6, 2023 07:53
@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 6, 2023

example 3.x files:

.dvc

outs:
- md5: a2ead3516dd1be4a3c7f45716c0a0eb7
  size: 15
  hash: md5
  path: test.txt

dvc.lock

schema: '2.0'
stages:
  prepare:
    cmd: python src/prepare.py data/data.xml
    deps:
    - path: data/data.xml
      hash: md5
      md5: 22a1a2931c8370d3aeedd7183606fd7f
      size: 14445097
    - path: src/prepare.py
      hash: md5
      md5: f09ea0c15980b43010257ccb9f0055e2
      size: 1576
    params:
      params.yaml:
        prepare.seed: 20170428
        prepare.split: 0.2
    outs:
    - path: data/prepared
      hash: md5
      md5: 153aad06d376b6595932470e459ef42a.dir
      size: 8437363
      nfiles: 2

@pmrowla

This comment was marked as outdated.

@pmrowla pmrowla changed the title [WIP] cache: drop dos2unix behavior and move cache to files/ prefix cache: drop dos2unix behavior and move cache to files/ prefix Jun 6, 2023
Comment on lines +169 to +176
local_fs = self.repo.cache.legacy.fs
parent = local_fs.path.parent(path)
self.repo.cache.local.makedirs(parent)
self.repo.cache.legacy.makedirs(parent)
tmp = local_fs.path.join(parent, fs.utils.tmp_fname())
assert os.path.exists(parent)
assert os.path.isdir(parent)
dump_yaml(tmp, cache)
self.repo.cache.local.move(tmp, path)
self.repo.cache.legacy.move(tmp, path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think run-cache should really just be using it's own ODB instead of what we are doing now (starting with on local/legacy cache and then doing manual fs/transfer operations on top of that), but we can address that at some point in the future

@codecov
Copy link

codecov bot commented Jun 6, 2023

Codecov Report

Patch coverage: 86.31% and project coverage change: -0.05 ⚠️

Comparison is base (528eab0) 90.78% compared to head (c28511a) 90.73%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9538      +/-   ##
==========================================
- Coverage   90.78%   90.73%   -0.05%     
==========================================
  Files         470      470              
  Lines       35925    36102     +177     
  Branches     5172     5207      +35     
==========================================
+ Hits        32613    32757     +144     
- Misses       2727     2749      +22     
- Partials      585      596      +11     
Impacted Files Coverage Ξ”
dvc/info.py 88.77% <ΓΈ> (ΓΈ)
dvc/repo/open_repo.py 82.38% <ΓΈ> (ΓΈ)
dvc/schema.py 100.00% <ΓΈ> (ΓΈ)
tests/func/test_add.py 98.94% <ΓΈ> (ΓΈ)
tests/func/test_move.py 100.00% <ΓΈ> (ΓΈ)
tests/func/test_run.py 100.00% <ΓΈ> (ΓΈ)
dvc/testing/workspace_tests.py 51.98% <25.00%> (ΓΈ)
dvc/data_cloud.py 71.42% <62.50%> (-10.77%) ⬇️
dvc/repo/imports.py 87.69% <90.90%> (-1.97%) ⬇️
dvc/output.py 86.33% <97.77%> (+0.57%) ⬆️
... and 34 more

β˜” View full report in Codecov by Sentry.
πŸ“’ Do you have feedback about the report comment? Let us know in this issue.

@dberenbaum
Copy link
Collaborator

update:
After discussion with @efiop we decided it's preferable to just go with the original plan and keep all writes in the 3.x cache location. For cases where users with concerns about the duplicated local storage we can provide a migration script that rehashes and moves 2.x cache files to the 3.x locations and then hardlink from the 2.x cache -> 3.x cache

Thanks for the update. Is the main issue users are likely to face the scenario you laid out above?

If a user has a large 2.x directory, and then adds or modifies a single file in 3.x, we would end up needing to copy (and duplicate) the entire directory contents into 3.x cache.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 7, 2023

Thanks for the update. Is the main issue users are likely to face the scenario you laid out above?

If a user has a large 2.x directory, and then adds or modifies a single file in 3.x, we would end up needing to copy (and duplicate) the entire directory contents into 3.x cache.

Right. We can mitigate this locally with a migration script (where we move the data into 3.x cache locations and then just hardlink from the old 2.x location to the new one)

@dberenbaum
Copy link
Collaborator

Thanks, that's fine with me. Just wanted to make sure I wasn't missing some new issue caused by the updated plan.

@dberenbaum
Copy link
Collaborator

How will garbage collection work? If users do end up with 2 physical copies, do we have a way to dedupe them?

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 8, 2023

How will garbage collection work? If users do end up with 2 physical copies, do we have a way to dedupe them?

Garbage collection will work the same way it does now, it will just gc the 2.x cache and then gc the 3.0 cache independently. It will not do any deduplication

@skshetry
Copy link
Member

skshetry commented Jun 8, 2023

What happened to the discussion of allowing different hashes in the future or even switching to some other algorithms in 3.0? Should we namespace into files/md5/?

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 8, 2023

What happened to the discussion of allowing different hashes in the future or even switching to some other algorithms in 3.0? Should we namespace into files/md5/?

This was discussed between @efiop @dberenbaum and myself, it was decided to not address this in 3.0.

If/when we ever decide to support additional hashes in the future we will likely just go with the assumption that files/ with no prefix is where md5 lives, and then any additional hashes would live in files/<hash_name>/

@skshetry
Copy link
Member

skshetry commented Jun 8, 2023

If/when we ever decide to support additional hashes in the future we will likely just go with the assumption that files/ with no prefix is where md5 lives, and then any additional hashes would live in files/<hash_name>/

Why not do it today and be consistent with files/md5?

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 8, 2023

Why not do it today and be consistent with files/md5?

That's fine with me if no one else has any objections.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 8, 2023

One other thing to note is that pipeline stage code dependencies will almost always be reported as changed in a cross platform environment now.

If the user has something like train.py listed in their stage deps, on windows that file will likely have CRLF line endings, and on unix it will have LF line endings, and the DVC computed MD5 hashes for that file will be different on each platform. So if the user runs the pipeline on linux and git commits + git pushes everything, and then the user clones the repo on a windows machine, DVC will report the pipeline stage as needing to run due to the code dep being changed.

@dberenbaum
Copy link
Collaborator

It will not do any deduplication

In the scenario where someone changes 1 file in a large directory, there's likely to be lots of duplicates, right? Is there some migration script we can provide to clean up scenarios like this? It feels like the migration script we have discussed is only useful if you know to use it ahead of time AFAIU.

One other thing to note is that pipeline stage code dependencies will almost always be reported as changed in a cross platform environment now.

Is there any way to configure git so that this isn't the case?

Why not do it today and be consistent with files/md5?

No objection or opinion from me about that.

@efiop
Copy link
Contributor

efiop commented Jun 8, 2023

Why not do it today and be consistent with files/md5?

Honestly really doubt we will ever support other hashes officially. md5 is enough for us and supporting multiple hashes is a hustle that no one will probably ever use. I wouldn't bother with it too much. Just having .dvc/cache/files is what we really need right now and if multiple hashes will ever become a thing we can always do the <hash> subdir later. But both doing and not doing that is fine by me.

@skshetry
Copy link
Member

skshetry commented Jun 9, 2023

md5 is enough for us

I find this statement to be dishonest. MD5 is effectively broken at this point, even for integrity checks. A lot of users will disagree on this, as you can see in #3069. We may learn more about md5 in the future that may leave dvc completely broken, that we should better start migrating away from it soon.

Also, it's hard to define what "enough" is. DVC may end up being used in projects where there are security implications (or you thought it had no implications until you later realize it does). Or, some tiny feature of DVC that is used in a way that needs to be secure.

So, we cannot and should not define security for our users.

and supporting multiple hashes is a hustle that no one will probably ever use.

I'm against supporting multiple hashes too (except internally). Ideally, we should just support one that is secure and performant enough. Today, that would be blake3 or sha256 (if we want to be FIPS compliant), preferably blake3. :)

I see md5 as a temporary hashing algorithm until we migrate to a better solution. I am not saying that we do it today. :)

Just having .dvc/cache/files is what we really need right now and if multiple hashes will ever become a thing we can always do the subdir later. But both doing and not doing that is fine by me.

I'd prefer its architecture/design to be reflected in the namespace for consistency, i.e. that it supports multiple hashing algorithms internally instead. Also, we should be consistent with files/objects internally, so it should either be cache/files or cache/objects.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 9, 2023

I don't think we need to re-litigate #3069 in this PR. I'll go ahead and make the change in this PR to use files/md5/ since it doesn't really cost us anything and does make migrating hashes in the future cleaner if/when we ever decide to do so.

Also, we should be consistent with files/objects internally, so it should either be cache/files or cache/objects.

In this case for the current DVC cache/remotes we do want files. files here specifically means objects which are entire individual files that can be linked to a workspace on checkout.

We have also discussed moving dir objects into their own ODB (i.e. dirs/md5/...) for performance reasons (to allow faster remote lookups) but that won't be done in this PR. And in the same vein, if/when we ever support chunking, packs would probably also go into their own separate ODB

@pmrowla pmrowla changed the title cache: drop dos2unix behavior and move cache to files/ prefix cache: drop dos2unix behavior and move cache to files/md5/ prefix Jun 9, 2023
@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 9, 2023

Remaining remote plugin test failure is due to outdated import stage hash

All of the plugins will need to be updated with a 3.x stage hash which includes the new hash: md5 entry once this PR is merged

(see iterative/dvc-s3#43)

@pmrowla pmrowla marked this pull request as ready for review June 9, 2023 09:22
@efiop efiop merged commit 265fc58 into iterative:main Jun 9, 2023
@pmrowla pmrowla deleted the 3.x-cache branch June 9, 2023 14:10
@skshetry
Copy link
Member

skshetry commented Jun 11, 2023

Is this expected to change?

#! /bin/bash

set -x

# pipx install --suffix 2.58 dvc==2.58.2
pushd "$(mktemp -d)"

git init && dvc init
echo -ne "foo\r\n" > foo
dvc2.58 add foo

git add -A

dvc2.58 data status
dvc data status
+ dvc2.58 data status
No changes in an empty git repo.
(there are changes not tracked by dvc, use "git status" to see)
+ dvc data status
DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
  (use "dvc checkout <file>..." to discard changes)
        modified: foo
(there are other changes not tracked by dvc, use "git status" to see)
Change(
    typ='modify',
    old=DataIndexEntry(
        key=('foo',),
        meta=Meta(
            isdir=False,
            size=5,
            nfiles=None,
            isexec=False,
            version_id=None,
            etag=None,
            checksum=None,
            md5='d3b07384d113edec49eaa6238ad5ff00',
            inode=None,
            mtime=None,
            remote=None
        ),
        hash_info=HashInfo(name='md5', value='d3b07384d113edec49eaa6238ad5ff00', obj_name=None),
        loaded=None
    ),
    new=DataIndexEntry(
        key=('foo',),
        meta=Meta(
            isdir=False,
            size=5,
            nfiles=None,
            isexec=False,
            version_id=None,
            etag=None,
            checksum=None,
            md5=None,
            inode=2547,
            mtime=1686459999.367395,
            remote=None
        ),
        hash_info=HashInfo(name='md5', value='2145971cf82058b108229a3a2e3bff35', obj_name=None),
        loaded=None
    )
)

For reference,

$ dvc-data hash -n md5 foo
md5: 2145971cf82058b108229a3a2e3bff35
$ dvc-data hash -n md5-dos2unix foo 
md5-dos2unix: d3b07384d113edec49eaa6238ad5ff00

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 11, 2023

No, it should be reported as unchanged in both dvc status and dvc data status. I'll look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove breaking-change
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

drop dos2unix behaviour
4 participants