-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate extraction and decompression logic in datasets.utils.extract_archive #3443
Conversation
This reverts commit 7fafebb.
e0ed436
to
a22abbc
Compare
Blocked by #3542 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock, I have a couple of comments
Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
…s.extract_archive (#3443) Summary: * generalize extract_archive * [test] re-enable extraction tests on windows * add tests for detect_file_type * add error messages to detect_file_type * Revert "[test] re-enable extraction tests on windows" This reverts commit 7fafebb. * add utility functions for better mock call checking * add tests for decompress * simplify logic by using pathlib * lint * Apply suggestions from code review * make decompress private * remove unnecessary checks * add error message * fix mocking * add remaining tests * lint Reviewed By: fmassa Differential Revision: D27128004 fbshipit-source-id: 73f7d8a43eca5dbc9c7e63d8b1ff6e0859915d92 Co-authored-by: Francisco Massa <fvsmassa@gmail.com> Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
elif len(suffixes) > 2: | ||
raise RuntimeError( | ||
"Archive type and compression detection only works for 1 or 2 suffixes. " f"Got {len(suffixes)} instead." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces a bug for downloads with a period in the file name. For example, https://landcover.ai/download/landcover.ai.v1.zip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on a PR to fix this, will ping you when it's done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #4099
Currently the implementation of
datasets.utils.extract_archive
has two downsides:It is not extendible. If a new archive type or compression needs to be added (see Remove caching from MNIST and variants #3420 (comment)) one needs to write a new
_is_whatever
method and implement the processing in the largeif
/elif
statement.This can get confusing quite fast especially since the archive type and compression are orthogonal concepts. This leads to structures like
vision/torchvision/datasets/utils.py
Lines 250 to 251 in 67b2528
Imagine we had another archive type that is also compressed with
gzip
.It mixes the extraction and decompression concept. This leads to constructs like this
vision/torchvision/datasets/utils.py
Lines 271 to 274 in 67b2528
Note that
to_path
would need to be fixed for every case where we only decompress the file.This PR overcomes these by separating the extraction and decompression logic:
_extract_whatever
function and add it to_ARCHIVE_EXTRACTORS
with the corresponding extension.open_whatever
function and add it to_COMPRESSED_FILE_OPENERS
with the corresponding extension.This PR is fully BC, but makes it a lot easier to move forward in the future.