Separate extraction and decompression logic in datasets.utils.extract_archive #3443

pmeier · 2021-02-23T18:44:18Z

Currently the implementation of datasets.utils.extract_archive has two downsides:

It is not extendible. If a new archive type or compression needs to be added (see Remove caching from MNIST and variants #3420 (comment)) one needs to write a new _is_whatever method and implement the processing in the large if / elif statement.

This can get confusing quite fast especially since the archive type and compression are orthogonal concepts. This leads to structures like

vision/torchvision/datasets/utils.py

Lines 250 to 251 in 67b2528

def _is_gzip(filename: str) -> bool:

return filename.endswith(".gz") and not filename.endswith(".tar.gz")

Imagine we had another archive type that is also compressed with gzip.

It mixes the extraction and decompression concept. This leads to constructs like this

Lines 271 to 274 in 67b2528

    
           elif _is_gzip(from_path): 
        
               to_path = os.path.join(to_path, os.path.splitext(os.path.basename(from_path))[0]) 
        
               with open(to_path, "wb") as out_f, gzip.GzipFile(from_path) as zip_f: 
        
                   out_f.write(zip_f.read())

Note that to_path would need to be fixed for every case where we only decompress the file.

This PR overcomes these by separating the extraction and decompression logic:

To add a new archive type one has to provide an _extract_whatever function and add it to _ARCHIVE_EXTRACTORS with the corresponding extension.
To add a compression one has to provide an open_whatever function and add it to _COMPRESSED_FILE_OPENERS with the corresponding extension.

This PR is fully BC, but makes it a lot easier to move forward in the future.

This reverts commit 7fafebb.

pmeier · 2021-03-10T15:30:49Z

Blocked by #3542

fmassa

Approving to unblock, I have a couple of comments

test/test_datasets_utils.py

torchvision/datasets/utils.py

Co-authored-by: Francisco Massa <fvsmassa@gmail.com>

…s.extract_archive (#3443) Summary: * generalize extract_archive * [test] re-enable extraction tests on windows * add tests for detect_file_type * add error messages to detect_file_type * Revert "[test] re-enable extraction tests on windows" This reverts commit 7fafebb. * add utility functions for better mock call checking * add tests for decompress * simplify logic by using pathlib * lint * Apply suggestions from code review * make decompress private * remove unnecessary checks * add error message * fix mocking * add remaining tests * lint Reviewed By: fmassa Differential Revision: D27128004 fbshipit-source-id: 73f7d8a43eca5dbc9c7e63d8b1ff6e0859915d92 Co-authored-by: Francisco Massa <fvsmassa@gmail.com> Co-authored-by: Francisco Massa <fvsmassa@gmail.com>

adamjstewart · 2021-06-22T19:18:40Z

torchvision/datasets/utils.py

+    elif len(suffixes) > 2:
+        raise RuntimeError(
+            "Archive type and compression detection only works for 1 or 2 suffixes. " f"Got {len(suffixes)} instead."
+        )


This introduces a bug for downloads with a period in the file name. For example, https://landcover.ai/download/landcover.ai.v1.zip.

Working on a PR to fix this, will ping you when it's done.

generalize extract_archive

563cfa2

facebook-github-bot added the cla signed label Feb 23, 2021

Merge branch 'master' into generalize-decompression

9c58f2d

pmeier mentioned this pull request Feb 23, 2021

Remove caching from MNIST and variants #3420

Merged

pmeier added 3 commits February 24, 2021 07:14

[test] re-enable extraction tests on windows

7fafebb

add tests for detect_file_type

17f9c83

add error messages to detect_file_type

f783bcd

pmeier added the WIP label Feb 25, 2021

fmassa marked this pull request as draft March 1, 2021 10:19

Revert "[test] re-enable extraction tests on windows"

a22abbc

This reverts commit 7fafebb.

pmeier force-pushed the generalize-decompression branch from e0ed436 to a22abbc Compare March 10, 2021 15:29

pmeier added 5 commits March 12, 2021 08:14

Merge branch 'master' into generalize-decompression

aed3793

add utility functions for better mock call checking

ff29639

add tests for decompress

1ac42a5

simplify logic by using pathlib

b10036c

lint

e9510df

pmeier marked this pull request as ready for review March 12, 2021 08:42

pmeier requested a review from fmassa March 12, 2021 08:42

fmassa approved these changes Mar 12, 2021

View reviewed changes

test/test_datasets_utils.py Outdated Show resolved Hide resolved

test/test_datasets_utils.py Outdated Show resolved Hide resolved

torchvision/datasets/utils.py Outdated Show resolved Hide resolved

pmeier mentioned this pull request Mar 12, 2021

ValueError: Extraction of MNIST\raw\train-images-idx3-ubyte not supported #3554

Closed

pmeier and others added 8 commits March 15, 2021 08:01

Apply suggestions from code review

8bf6630

Co-authored-by: Francisco Massa <fvsmassa@gmail.com>

make decompress private

dc799a2

remove unnecessary checks

26e4f83

add error message

2c9d0c6

fix mocking

56b5770

add remaining tests

15c559d

lint

f4cf171

Merge branch 'master' into generalize-decompression

a1d9d65

pmeier requested a review from fmassa March 15, 2021 07:22

Merge branch 'master' into generalize-decompression

6c750a5

fmassa merged commit f8a9957 into pytorch:master Mar 15, 2021

fmassa added enhancement improvement module: datasets and removed WIP labels Mar 15, 2021

pmeier deleted the generalize-decompression branch March 16, 2021 09:18

datumbox removed the improvement label Jun 1, 2021

adamjstewart reviewed Jun 22, 2021

View reviewed changes

adamjstewart mentioned this pull request Jun 22, 2021

Add support for files with periods in name #4099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate extraction and decompression logic in datasets.utils.extract_archive #3443

Separate extraction and decompression logic in datasets.utils.extract_archive #3443

pmeier commented Feb 23, 2021 •

edited

Loading

pmeier commented Mar 10, 2021

fmassa left a comment

adamjstewart Jun 22, 2021

adamjstewart Jun 22, 2021

adamjstewart Jun 22, 2021

	def _is_gzip(filename: str) -> bool:
	return filename.endswith(".gz") and not filename.endswith(".tar.gz")

	elif _is_gzip(from_path):
	to_path = os.path.join(to_path, os.path.splitext(os.path.basename(from_path))[0])
	with open(to_path, "wb") as out_f, gzip.GzipFile(from_path) as zip_f:
	out_f.write(zip_f.read())

Separate extraction and decompression logic in datasets.utils.extract_archive #3443

Separate extraction and decompression logic in datasets.utils.extract_archive #3443

Conversation

pmeier commented Feb 23, 2021 • edited Loading

pmeier commented Mar 10, 2021

fmassa left a comment

Choose a reason for hiding this comment

adamjstewart Jun 22, 2021

Choose a reason for hiding this comment

adamjstewart Jun 22, 2021

Choose a reason for hiding this comment

adamjstewart Jun 22, 2021

Choose a reason for hiding this comment

pmeier commented Feb 23, 2021 •

edited

Loading