-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add tests for datasets #966
Conversation
Codecov Report
@@ Coverage Diff @@
## master #966 +/- ##
=========================================
+ Coverage 60.03% 61.83% +1.8%
=========================================
Files 64 64
Lines 5054 5055 +1
Branches 754 758 +4
=========================================
+ Hits 3034 3126 +92
+ Misses 1817 1716 -101
- Partials 203 213 +10
Continue to review full report at Codecov.
|
Note: adding those tests seems to have increased the test time by 3 minutes, compare against a PR from yesterday: https://travis-ci.org/pytorch/vision/builds/537777376?utm_source=github_status&utm_medium=notification Full testing would take around 121s, now it takes 320s |
import errno | ||
import tarfile | ||
import zipfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we do a lazy import in extract_file()
? AFAIK we do this now to prevent becoming dependent on these packages at import
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe tarfile
, zipfile
and gzip
are part of the python standard library, so I think this should be ok
raise ValueError("Extraction of {} not supported".format(from_path)) | ||
|
||
if remove_finished: | ||
os.unlink(from_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pure curiosity: Why did you use os.unlink
instead of os.remove
? I'm only now aware that they provide the same functionality. I think os.remove
would be clearer since the flag is also called remove_finished
and not unlink_finished
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No other reason than just because it was what was used in MNIST
before, so I decided to do the same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
I've compiled a list of sizes for each one of the datasets that I considered "small"
In the current build, we are downloading ~600MB of datasets, which I think is too much. The majority of it comes from EMNIST. I think that we should look into an alternative approach for testing the datasets. For example, in an issue like #968, in the current state of things we can't test I'll be removing the dataset test for EMNIST from this PR |
This is a WIP PR that implements one of the approaches discussed in #963
Also adds a functionality for extracting zip / tar / gzip files.
For now, only use it in MNIST-like datasets, but I'll update the PR to remove custom extraction logic to use this function instead.
I'm opening the PR now to see how CI will behave for those small datasets.