Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty subdirectories left in place while checking out previous version of the datasets #4344

Open
atifraza opened this issue Aug 5, 2020 · 6 comments
Labels
bug Did we break something? p2-medium Medium priority, should be done, but less important

Comments

@atifraza
Copy link

atifraza commented Aug 5, 2020

Bug Report

I am tracking versions of a set of datasets in this repository.
The directory structure is as shown below.

.
|- datasets             // Root directory for all datasets
|  |
|  |- dataset1          // Directories are named after the datasets
|  |  |
|  |  |- TRAIN.tsv
|  |  |- ...
|  |
|  |- ...

Each version of the datasets adds additional subdirectories to the datasets directory.
When checking out an older version (say v1) using git checkout v1 followed by a dvc checkout, DVC leaves empty subdirectories instead of removing them.

Specifically, if dataset1 was present in v1 but dataset2 was added by v2, checking out v1 leaves behind an empty dataset2 directory.

.
|- datasets             // Root directory for all datasets
|  |
|  |- dataset1          // Directories are named after the datasets
|  |  |
|  |  |- TRAIN.tsv      // Train/Test sets are actually from the correct version
|  |  |- ...
|  |
|  |- dataset2          // Empty directory
|  |
|  |- ...

Output of dvc version:

$ dvc version -v

DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-4.9.0-0.bpo.6-amd64-x86_64-with-glibc2.10
Supports: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache types: hardlink, symlink
Repo: dvc, git
2020-08-06 00:01:05,317 DEBUG: Analytics is enabled.
2020-08-06 00:01:05,410 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmppp9x3ynw']'
2020-08-06 00:01:05,411 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmppp9x3ynw']'
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Aug 5, 2020
@shcheklein
Copy link
Member

shcheklein commented Aug 5, 2020

I was able to reproduce it with this:

mkdir datasets
mkdir datasets/dir1
echo "file1" > datasets/dir1/file1
dvc add datasets
git add .
git commit -a -m "add datasets v1"
mkdir datasets/dir2
echo "file2" > datasets/dir2/file2
dvc add datasets
git add .
git commit -a -m "add datasets v2"
git checkout HEAD^
dvc checkout
tree datasets

outputs:

datasets
├── dir1
│   └── file1
└── dir2

@pared pared added bug Did we break something? p1-important Important, aka current backlog of things to do labels Aug 6, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Aug 6, 2020
@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Aug 6, 2020
@lefos99
Copy link

lefos99 commented Feb 8, 2022

Do we have any updates on this? I still encounter this issue in version 2.3.0. 🤔

@karajan1001
Copy link
Contributor

karajan1001 commented Feb 15, 2022

@lefos99 , I had tried the script in #4344 (comment) still result

tree datasets
datasets
├── dir1
│   └── file1
└── dir2

2 directories, 1 file

@efiop
Copy link
Contributor

efiop commented Feb 15, 2022

As noted by @skshetry , related to #7374 . Both depend on using dvc/data/diff.py and not touching unrelated files, but at the same time detecting and removing empty dirs.

@timothylimyl
Copy link

Hi, can confirm that this problem still exists. I guess we can safeguard it on our code to ignore empty directories as for now, it does not seem that this is a priority.

@dberenbaum
Copy link
Collaborator

Is this a duplicate of #2397?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

8 participants