Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add tests for datasets #966

Merged
merged 7 commits into from
May 29, 2019
Merged

Conversation

fmassa
Copy link
Member

@fmassa fmassa commented May 28, 2019

This is a WIP PR that implements one of the approaches discussed in #963

Also adds a functionality for extracting zip / tar / gzip files.
For now, only use it in MNIST-like datasets, but I'll update the PR to remove custom extraction logic to use this function instead.

I'm opening the PR now to see how CI will behave for those small datasets.

@codecov-io
Copy link

codecov-io commented May 28, 2019

Codecov Report

Merging #966 into master will increase coverage by 1.8%.
The diff coverage is 84.44%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #966     +/-   ##
=========================================
+ Coverage   60.03%   61.83%   +1.8%     
=========================================
  Files          64       64             
  Lines        5054     5055      +1     
  Branches      754      758      +4     
=========================================
+ Hits         3034     3126     +92     
+ Misses       1817     1716    -101     
- Partials      203      213     +10
Impacted Files Coverage Δ
torchvision/datasets/mnist.py 81.61% <100%> (+50.69%) ⬆️
torchvision/datasets/caltech.py 20.65% <25%> (+1.65%) ⬆️
torchvision/datasets/omniglot.py 32% <50%> (+2.37%) ⬆️
torchvision/datasets/cifar.py 38.2% <50%> (+1.24%) ⬆️
torchvision/datasets/utils.py 60.62% <93.75%> (+12.2%) ⬆️
torchvision/transforms/transforms.py 81.89% <0%> (-0.65%) ⬇️
torchvision/datasets/imagenet.py 21.55% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b3a1b6...b51e0b3. Read the comment docs.

@fmassa
Copy link
Member Author

fmassa commented May 28, 2019

Note: adding those tests seems to have increased the test time by 3 minutes, compare against a PR from yesterday: https://travis-ci.org/pytorch/vision/builds/537777376?utm_source=github_status&utm_medium=notification

Full testing would take around 121s, now it takes 320s

@fmassa fmassa requested a review from soumith May 28, 2019 20:01
import errno
import tarfile
import zipfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do a lazy import in extract_file()? AFAIK we do this now to prevent becoming dependent on these packages at import.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe tarfile, zipfile and gzip are part of the python standard library, so I think this should be ok

raise ValueError("Extraction of {} not supported".format(from_path))

if remove_finished:
os.unlink(from_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pure curiosity: Why did you use os.unlink instead of os.remove? I'm only now aware that they provide the same functionality. I think os.remove would be clearer since the flag is also called remove_finished and not unlink_finished.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No other reason than just because it was what was used in MNIST before, so I decided to do the same here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

@fmassa
Copy link
Member Author

fmassa commented May 29, 2019

I've compiled a list of sizes for each one of the datasets that I considered "small"

Dataset Size
MNIST 10MB
KMNIST 20MB
EMNIST 540MB
FashionMNIST 30MB
Cifar10 163MB
Omniglot 9MB
Caltech101 140MB
Caltech256 1.1GB
STL10 2.5GB

In the current build, we are downloading ~600MB of datasets, which I think is too much. The majority of it comes from EMNIST.

I think that we should look into an alternative approach for testing the datasets. For example, in an issue like #968, in the current state of things we can't test repr of the dataset before having downloaded 2.5GB of data and extracted it.

I'll be removing the dataset test for EMNIST from this PR

@fmassa fmassa merged commit c59f047 into pytorch:master May 29, 2019
@fmassa fmassa deleted the refactor-download branch May 29, 2019 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants