Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate extraction and decompression logic in datasets.utils.extract_archive #3443

Merged
merged 20 commits into from
Mar 15, 2021

Conversation

pmeier
Copy link
Contributor

@pmeier pmeier commented Feb 23, 2021

Currently the implementation of datasets.utils.extract_archive has two downsides:

  1. It is not extendible. If a new archive type or compression needs to be added (see Remove caching from MNIST and variants #3420 (comment)) one needs to write a new _is_whatever method and implement the processing in the large if / elif statement.

    This can get confusing quite fast especially since the archive type and compression are orthogonal concepts. This leads to structures like

    def _is_gzip(filename: str) -> bool:
    return filename.endswith(".gz") and not filename.endswith(".tar.gz")

    Imagine we had another archive type that is also compressed with gzip.

  2. It mixes the extraction and decompression concept. This leads to constructs like this

    elif _is_gzip(from_path):
    to_path = os.path.join(to_path, os.path.splitext(os.path.basename(from_path))[0])
    with open(to_path, "wb") as out_f, gzip.GzipFile(from_path) as zip_f:
    out_f.write(zip_f.read())

    Note that to_path would need to be fixed for every case where we only decompress the file.


This PR overcomes these by separating the extraction and decompression logic:

  • To add a new archive type one has to provide an _extract_whatever function and add it to _ARCHIVE_EXTRACTORS with the corresponding extension.
  • To add a compression one has to provide an open_whatever function and add it to _COMPRESSED_FILE_OPENERS with the corresponding extension.

This PR is fully BC, but makes it a lot easier to move forward in the future.

@pmeier pmeier added the WIP label Feb 25, 2021
@fmassa fmassa marked this pull request as draft March 1, 2021 10:19
@pmeier
Copy link
Contributor Author

pmeier commented Mar 10, 2021

Blocked by #3542

@pmeier pmeier marked this pull request as ready for review March 12, 2021 08:42
@pmeier pmeier requested a review from fmassa March 12, 2021 08:42
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock, I have a couple of comments

test/test_datasets_utils.py Outdated Show resolved Hide resolved
test/test_datasets_utils.py Outdated Show resolved Hide resolved
torchvision/datasets/utils.py Outdated Show resolved Hide resolved
@pmeier pmeier requested a review from fmassa March 15, 2021 07:22
@fmassa fmassa merged commit f8a9957 into pytorch:master Mar 15, 2021
@pmeier pmeier deleted the generalize-decompression branch March 16, 2021 09:18
facebook-github-bot pushed a commit that referenced this pull request Mar 19, 2021
…s.extract_archive (#3443)

Summary:
* generalize extract_archive

* [test] re-enable extraction tests on windows

* add tests for detect_file_type

* add error messages to detect_file_type

* Revert "[test] re-enable extraction tests on windows"

This reverts commit 7fafebb.

* add utility functions for better mock call checking

* add tests for decompress

* simplify logic by using pathlib

* lint

* Apply suggestions from code review

* make decompress private

* remove unnecessary checks

* add error message

* fix mocking

* add remaining tests

* lint

Reviewed By: fmassa

Differential Revision: D27128004

fbshipit-source-id: 73f7d8a43eca5dbc9c7e63d8b1ff6e0859915d92

Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
Comment on lines +293 to +296
elif len(suffixes) > 2:
raise RuntimeError(
"Archive type and compression detection only works for 1 or 2 suffixes. " f"Got {len(suffixes)} instead."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a bug for downloads with a period in the file name. For example, https://landcover.ai/download/landcover.ai.v1.zip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on a PR to fix this, will ping you when it's done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #4099

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants