Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc checkout will leave empty folders #2397

Closed
glema-gpsw opened this issue Aug 14, 2019 · 14 comments
Closed

dvc checkout will leave empty folders #2397

glema-gpsw opened this issue Aug 14, 2019 · 14 comments
Labels
bug Did we break something? p2-medium Medium priority, should be done, but less important

Comments

@glema-gpsw
Copy link

glema-gpsw commented Aug 14, 2019

dvc checkout doesn't restore the exact snapshot as it was in a previous state.
For example if you create a new folder and add files, and want to go back to a previous commit, where this folder and the files didn't exist, it will still show the folder, but empty.

Below you will find a reproducer of the problem.

#!/bin/bash                            
                                       
set -e                                 
set -x                                 
                                       
rm -rf myrepo                          
mkdir myrepo                           
cd myrepo                              
                                       
git init                               
dvc init                               
                                       
mkdir dir                              
dvc run -o dir/foo 'echo foo > dir/foo'
                                       
rm foo.dvc                             
                                       
dvc checkout                           
                                       
tree
@efiop
Copy link
Contributor

efiop commented Aug 14, 2019

@efiop efiop added bug Did we break something? p2-medium Medium priority, should be done, but less important labels Aug 14, 2019
@efiop efiop added p2-medium Medium priority, should be done, but less important p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Sep 9, 2019
@efiop
Copy link
Contributor

efiop commented Sep 10, 2019

We are most likely using utils.remove method, so we probably need to make it(or some wrapper or whatever) remove empty parent directories up to repo root.

@efiop
Copy link
Contributor

efiop commented Sep 10, 2019

Ok, as we've discovered with @iterative/engineering , this

mkdir dir                              
dvc run -o dir/foo 'echo foo > dir/foo'

is actually wrong and dvc repro will break on it, because the command itself should be able to create all the needed subdirectories.

But even if we include mkdir sbudir, when you clone this repo elsewhere and you dvc checkout, dvc will automatically create that directory, so it should also remove it when it checks out elsewhere.

@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Sep 10, 2019
@dmpetrov
Copy link
Member

dmpetrov commented Feb 3, 2020

@efiop you explained that the first case is not correct. Could you please clarify with the second one - does DVC works properly and should we close this issue?

@efiop
Copy link
Contributor

efiop commented Feb 3, 2020

@dmpetrov No longer able to reproduce. Everything works as expected. Closing. Thanks for the heads up!

@efiop efiop closed this as completed Feb 3, 2020
@pokey
Copy link

pokey commented Oct 12, 2020

I am still able to reproduce this issue:

➜ dvc --version
1.8.2

➜ dvc init --no-scm
...

➜ mkdir foo

➜ dvc add foo
WARNING: 'foo' is empty.

➜ ls
foo/  foo.dvc

➜ cat foo.dvc
outs:
- md5: d751713988987e9331980363e24189ce.dir
  path: foo

➜ mkdir foo/bar

➜ dvc checkout --verbose
2020-10-12 14:27:46,792 DEBUG: Check for update is enabled.
2020-10-12 14:27:46,797 DEBUG: fetched: [(3,)]
2020-10-12 14:27:46,809 DEBUG: checking if 'foo'('HashInfo(name='md5', value='d751713988987e9331980363e24189ce.dir', dir_info=None)') has changed.
2020-10-12 14:27:46,809 DEBUG: Assuming '/Users/pokey/src/dvc-empty-test/.dvc/cache/d7/51713988987e9331980363e24189ce.dir' is unchanged since it is read-only
2020-10-12 14:27:46,811 DEBUG: Path '/Users/pokey/src/dvc-empty-test/foo' inode '87720699'
2020-10-12 14:27:46,811 DEBUG: fetched: [('99914b932bd37a50b983c5e7c90ae93b', '0', 'd751713988987e9331980363e24189ce.dir', '1602500101943105024')]
2020-10-12 14:27:46,812 DEBUG: 'foo' hasn't changed.
2020-10-12 14:27:46,813 DEBUG: Data 'foo' didn't change.
2020-10-12 14:27:46,814 DEBUG: fetched: [(4,)]
2020-10-12 14:27:46,817 DEBUG: Analytics is enabled.
2020-10-12 14:27:46,891 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/yb/k48gsy1s3qg1gdd72v9y6rdr0000gn/T/tmplekybbra']'
2020-10-12 14:27:46,893 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/yb/k48gsy1s3qg1gdd72v9y6rdr0000gn/T/tmplekybbra']'

➜ ls foo
bar/

At the end, I'd expect foo/bar to have been removed

@pared
Copy link
Contributor

pared commented Oct 13, 2020

@pokey is right, its still an issue:

def test_checkout_empty_dir(tmp_dir, dvc):
    tmp_dir.dvc_gen({"empty_dir":{}})
    tmp_dir.gen({"empty_dir": {"subdir": {}}})

    dvc.checkout()

    assert not os.path.exists(tmp_dir / "empty_dir" / "subdir")

This test fails. Seems like our check falsely reports that the directory is unchanged. This is probably due to the way we calculate cache for dir: introducing empty directories does not influence dir hash.

Obvious fix: introduce directories (even empty) information to dir hash is probably a no-go - that will make all dir hashes calculated prior to change invalid.

@pared pared reopened this Oct 13, 2020
@efiop
Copy link
Contributor

efiop commented Oct 18, 2020

This is probably due to the way we calculate cache for dir: introducing empty directories does not influence dir hash.

Obvious fix: introduce directories (even empty) information to dir hash is probably a no-go - that will make all dir hashes calculated prior to change invalid.

Great points! Just for the record: that behaviour might be revisited in #829 (comment)

But yeah, for now we could potentially just add some logic to clean up such cases.

@mvonpohle
Copy link

I'll just chime in and say that I'm still experiencing this problem on DVC 2.4.1.

@alanbuxton
Copy link

I'll just chime in and say that I'm still experiencing this problem on DVC 2.4.1.

Same on 2.9.3. Deleted some directories on one machine, pushed to remote, pulled this to my other machine and the directories remained, but were empty.

@efiop
Copy link
Contributor

efiop commented Jan 28, 2022

@alanbuxton Sounds like you are tracking an empty directory, right? If so, that is correct dvc behaviour (it saved that the dir is empty and re-creates it empty on pull/checkout). This ticket is about leaving empty directories when switching from one version to another.

@alanbuxton
Copy link

@efiop yes makes sense. Apols for getting wrong end of the stick.

@narbhar
Copy link

narbhar commented Jul 12, 2022

I still face this issue in version 2.13.0
In my case I created two versions of the dataset. Let's say the dataset folder is called "numbers". This has folders with all odd numbered named folders until 10 with, for e.g, "one", "three", "five"... and so on. I tag this on git as "odd"
Now I make changes to the dataset and remove all the folders inside "numbers" and replace then with a new set of folders whose folder names are all even numbered names now, for e.g, "two", "four", "six"... until 10. I tag this on git as "even"
Note: I perform dvc add <folder> and those necessary steps to push the datasets.

I delete the dataset folder now. If I type dvc checkout it rebuilds the dataset with "even" folders since it is the latest dataset in cache.
Now I wish to revert to "odd" dataset. I do
git checkout odd
dvc checkout

My numbers folder now has odd numbered folders and empty folders from "even" dataset.
Now I want to go the latest head on git for latest version of dataset. I do
git checkout -
dvc checkout
This now has empty folders from "odd" dataset along with populated "even" folders.

@efiop
Copy link
Contributor

efiop commented Dec 8, 2023

Checkout now deletes empty folders. Closing.

@efiop efiop closed this as completed Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

8 participants