-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache: drop dos2unix behavior and move cache to files/md5/
prefix
#9538
Conversation
a9b08aa
to
06f398a
Compare
example 3.x files:
|
This comment was marked as outdated.
This comment was marked as outdated.
files/
prefixfiles/
prefix
local_fs = self.repo.cache.legacy.fs | ||
parent = local_fs.path.parent(path) | ||
self.repo.cache.local.makedirs(parent) | ||
self.repo.cache.legacy.makedirs(parent) | ||
tmp = local_fs.path.join(parent, fs.utils.tmp_fname()) | ||
assert os.path.exists(parent) | ||
assert os.path.isdir(parent) | ||
dump_yaml(tmp, cache) | ||
self.repo.cache.local.move(tmp, path) | ||
self.repo.cache.legacy.move(tmp, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think run-cache should really just be using it's own ODB instead of what we are doing now (starting with on local/legacy cache and then doing manual fs/transfer operations on top of that), but we can address that at some point in the future
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #9538 +/- ##
==========================================
- Coverage 90.78% 90.73% -0.05%
==========================================
Files 470 470
Lines 35925 36102 +177
Branches 5172 5207 +35
==========================================
+ Hits 32613 32757 +144
- Misses 2727 2749 +22
- Partials 585 596 +11
β View full report in Codecov by Sentry. |
Thanks for the update. Is the main issue users are likely to face the scenario you laid out above?
|
Right. We can mitigate this locally with a migration script (where we move the data into 3.x cache locations and then just hardlink from the old 2.x location to the new one) |
Thanks, that's fine with me. Just wanted to make sure I wasn't missing some new issue caused by the updated plan. |
How will garbage collection work? If users do end up with 2 physical copies, do we have a way to dedupe them? |
Garbage collection will work the same way it does now, it will just gc the 2.x cache and then gc the 3.0 cache independently. It will not do any deduplication |
What happened to the discussion of allowing different hashes in the future or even switching to some other algorithms in 3.0? Should we namespace into |
This was discussed between @efiop @dberenbaum and myself, it was decided to not address this in 3.0. If/when we ever decide to support additional hashes in the future we will likely just go with the assumption that |
Why not do it today and be consistent with |
That's fine with me if no one else has any objections. |
One other thing to note is that pipeline stage code dependencies will almost always be reported as changed in a cross platform environment now. If the user has something like |
In the scenario where someone changes 1 file in a large directory, there's likely to be lots of duplicates, right? Is there some migration script we can provide to clean up scenarios like this? It feels like the migration script we have discussed is only useful if you know to use it ahead of time AFAIU.
Is there any way to configure git so that this isn't the case?
No objection or opinion from me about that. |
Honestly really doubt we will ever support other hashes officially. md5 is enough for us and supporting multiple hashes is a hustle that no one will probably ever use. I wouldn't bother with it too much. Just having |
I find this statement to be dishonest. MD5 is effectively broken at this point, even for integrity checks. A lot of users will disagree on this, as you can see in #3069. We may learn more about md5 in the future that may leave dvc completely broken, that we should better start migrating away from it soon. Also, it's hard to define what "enough" is. DVC may end up being used in projects where there are security implications (or you thought it had no implications until you later realize it does). Or, some tiny feature of DVC that is used in a way that needs to be secure. So, we cannot and should not define security for our users.
I'm against supporting multiple hashes too (except internally). Ideally, we should just support one that is secure and performant enough. Today, that would be blake3 or sha256 (if we want to be FIPS compliant), preferably blake3. :) I see md5 as a temporary hashing algorithm until we migrate to a better solution. I am not saying that we do it today. :)
I'd prefer its architecture/design to be reflected in the namespace for consistency, i.e. that it supports multiple hashing algorithms internally instead. Also, we should be consistent with |
I don't think we need to re-litigate #3069 in this PR. I'll go ahead and make the change in this PR to use
In this case for the current DVC cache/remotes we do want We have also discussed moving dir objects into their own ODB (i.e. |
files/
prefixfiles/md5/
prefix
Remaining remote plugin test failure is due to outdated import stage hash All of the plugins will need to be updated with a 3.x stage hash which includes the new (see iterative/dvc-s3#43) |
Is this expected to change? #! /bin/bash
set -x
# pipx install --suffix 2.58 dvc==2.58.2
pushd "$(mktemp -d)"
git init && dvc init
echo -ne "foo\r\n" > foo
dvc2.58 add foo
git add -A
dvc2.58 data status
dvc data status + dvc2.58 data status
No changes in an empty git repo.
(there are changes not tracked by dvc, use "git status" to see)
+ dvc data status
DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
(use "dvc checkout <file>..." to discard changes)
modified: foo
(there are other changes not tracked by dvc, use "git status" to see) Change(
typ='modify',
old=DataIndexEntry(
key=('foo',),
meta=Meta(
isdir=False,
size=5,
nfiles=None,
isexec=False,
version_id=None,
etag=None,
checksum=None,
md5='d3b07384d113edec49eaa6238ad5ff00',
inode=None,
mtime=None,
remote=None
),
hash_info=HashInfo(name='md5', value='d3b07384d113edec49eaa6238ad5ff00', obj_name=None),
loaded=None
),
new=DataIndexEntry(
key=('foo',),
meta=Meta(
isdir=False,
size=5,
nfiles=None,
isexec=False,
version_id=None,
etag=None,
checksum=None,
md5=None,
inode=2547,
mtime=1686459999.367395,
remote=None
),
hash_info=HashInfo(name='md5', value='2145971cf82058b108229a3a2e3bff35', obj_name=None),
loaded=None
)
) For reference, $ dvc-data hash -n md5 foo
md5: 2145971cf82058b108229a3a2e3bff35
$ dvc-data hash -n md5-dos2unix foo
md5-dos2unix: d3b07384d113edec49eaa6238ad5ff00 |
No, it should be reported as unchanged in both |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
Will close #4658
requires iterative/dvc-data#362
Internal changes:
dos2unix
behavior by default in 3.x<cache_dir>/files/md5
md5
<remote_url>/files/md5
prefix for non-versioned remoteshash: md5
field (in both.dvc
anddvc.lock
).hash:
field, the output will be treated as a legacy 2.x object and hashed withmd5-dos2unix
.dvc
ordvc.lock
entries that do not have the 3.xhash: ...
fieldUser-facing changes:
pre-existing DVC-tracked data
means all data or pipeline stage outs with DVC-committed entries in an existing.dvc
ordvc.lock
file which was generated in DVC 2.x. This does not include pipeline stage outs that are listed in advc.yaml
but do not have a correspondingdvc.lock
file entry.dvc add
/dvc commit
/dvc exp run
/dvc repro
) will generate a 3.x DVC outputfiles/md5
cache locationrepro -f
/exp run -f
will force writing to 3.x cache)dvc commit
when there are no changes (i.e.dvc status
reports that everything is up-to-date) will not commit anything to cache (butdvc commit -f
will force writing to 3.x cache)md5-dos2unix
) matches a 3.x cache/remote object (hashed withmd5
)Unchanged behavior:
<cache_dir>/runs
location (it is not moved tofiles/runs/
in 3.x)