[WIP] Add tests for datasets #966

fmassa · 2019-05-28T17:07:41Z

This is a WIP PR that implements one of the approaches discussed in #963

Also adds a functionality for extracting zip / tar / gzip files.
For now, only use it in MNIST-like datasets, but I'll update the PR to remove custom extraction logic to use this function instead.

I'm opening the PR now to see how CI will behave for those small datasets.

codecov-io · 2019-05-28T19:56:31Z

Codecov Report

Merging #966 into master will increase coverage by 1.8%.
The diff coverage is 84.44%.

@@            Coverage Diff            @@
##           master     #966     +/-   ##
=========================================
+ Coverage   60.03%   61.83%   +1.8%     
=========================================
  Files          64       64             
  Lines        5054     5055      +1     
  Branches      754      758      +4     
=========================================
+ Hits         3034     3126     +92     
+ Misses       1817     1716    -101     
- Partials      203      213     +10

Impacted Files	Coverage Δ
torchvision/datasets/mnist.py	`81.61% <100%> (+50.69%)`	⬆️
torchvision/datasets/caltech.py	`20.65% <25%> (+1.65%)`	⬆️
torchvision/datasets/omniglot.py	`32% <50%> (+2.37%)`	⬆️
torchvision/datasets/cifar.py	`38.2% <50%> (+1.24%)`	⬆️
torchvision/datasets/utils.py	`60.62% <93.75%> (+12.2%)`	⬆️
torchvision/transforms/transforms.py	`81.89% <0%> (-0.65%)`	⬇️
torchvision/datasets/imagenet.py	`21.55% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b3a1b6...b51e0b3. Read the comment docs.

fmassa · 2019-05-28T19:59:43Z

Note: adding those tests seems to have increased the test time by 3 minutes, compare against a PR from yesterday: https://travis-ci.org/pytorch/vision/builds/537777376?utm_source=github_status&utm_medium=notification

Full testing would take around 121s, now it takes 320s

pmeier · 2019-05-29T07:48:17Z

torchvision/datasets/utils.py

 import errno
+import tarfile
+import zipfile


Shouldn't we do a lazy import in extract_file()? AFAIK we do this now to prevent becoming dependent on these packages at import.

I believe tarfile, zipfile and gzip are part of the python standard library, so I think this should be ok

pmeier · 2019-05-29T07:52:13Z

torchvision/datasets/utils.py

+        raise ValueError("Extraction of {} not supported".format(from_path))
+
+    if remove_finished:
+        os.unlink(from_path)


Pure curiosity: Why did you use os.unlink instead of os.remove? I'm only now aware that they provide the same functionality. I think os.remove would be clearer since the flag is also called remove_finished and not unlink_finished.

No other reason than just because it was what was used in MNIST before, so I decided to do the same here.

Fair enough.

fmassa · 2019-05-29T12:45:47Z

I've compiled a list of sizes for each one of the datasets that I considered "small"

Dataset	Size
MNIST	10MB
KMNIST	20MB
EMNIST	540MB
FashionMNIST	30MB
Cifar10	163MB
Omniglot	9MB
Caltech101	140MB
Caltech256	1.1GB
STL10	2.5GB

In the current build, we are downloading ~600MB of datasets, which I think is too much. The majority of it comes from EMNIST.

I think that we should look into an alternative approach for testing the datasets. For example, in an issue like #968, in the current state of things we can't test repr of the dataset before having downloaded 2.5GB of data and extracted it.

I'll be removing the dataset test for EMNIST from this PR

fmassa added 4 commits May 28, 2019 16:34

WIP

6a73c83

WIP: minor improvements

bce0176

Add tests

c3c708b

Fix typo

d8c0d9f

fmassa requested a review from soumith May 28, 2019 20:01

soumith approved these changes May 28, 2019

View reviewed changes

pmeier reviewed May 29, 2019

View reviewed changes

pmeier approved these changes May 29, 2019

View reviewed changes

fmassa added 2 commits May 29, 2019 14:08

Use download_and_extract on caltech, cifar and omniglot

ce05448

Add a print message during extraction

05e64b0

Remove EMNIST from test

b51e0b3

fmassa merged commit c59f047 into pytorch:master May 29, 2019

fmassa deleted the refactor-download branch May 29, 2019 12:50

pmeier mentioned this pull request May 31, 2019

[WIP] Add test for ImageNet #976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add tests for datasets #966

[WIP] Add tests for datasets #966

fmassa commented May 28, 2019

codecov-io commented May 28, 2019 •

edited

Loading

fmassa commented May 28, 2019 •

edited

Loading

pmeier May 29, 2019

fmassa May 29, 2019

pmeier May 29, 2019

fmassa May 29, 2019

pmeier May 29, 2019

fmassa commented May 29, 2019 •

edited

Loading

[WIP] Add tests for datasets #966

[WIP] Add tests for datasets #966

Conversation

fmassa commented May 28, 2019

codecov-io commented May 28, 2019 • edited Loading

Codecov Report

fmassa commented May 28, 2019 • edited Loading

pmeier May 29, 2019

Choose a reason for hiding this comment

fmassa May 29, 2019

Choose a reason for hiding this comment

pmeier May 29, 2019

Choose a reason for hiding this comment

fmassa May 29, 2019

Choose a reason for hiding this comment

pmeier May 29, 2019

Choose a reason for hiding this comment

fmassa commented May 29, 2019 • edited Loading

codecov-io commented May 28, 2019 •

edited

Loading

fmassa commented May 28, 2019 •

edited

Loading

fmassa commented May 29, 2019 •

edited

Loading