iter_archive for zip files #3347

Mehdi2402 · 2021-11-30T22:34:17Z

In this PR, I added the option to iterate through zipfiles for download_manager.py only.
Next PR will be the same applied to streaming_download_manager.py.
Related issue Make iter_archive work with ZIP files #3272.

Comments :

There is no .isreg() equivalent in zipfile library to check if file is Regular so I used .is_dir() instead to skip directories.
For now I got streaming_download_manager.py working for local zip files, but not for urls. I get the following error when I test it on an archive in google drive, so still working on it. BlockSizeError: Got more bytes so far (>2112) than requested (22)

Tasks :

download_manager.py
streaming_download_manager.py

* Make LABR dataset streamable * Fix dataset card

* fix JSON ClassLabel casting for integers * revert unnecessary changes * actually revert

* Remove duplicate name from dataset cards * Fix dataset cards * Fix dataset cards

lhoestq

Hi ! Thanks for diving into this :)

Your implementation for ZIP looks all good ! I think we can just improve the compression type check:

lhoestq · 2021-12-01T15:01:38Z

src/datasets/utils/download_manager.py

-            for tarinfo in stream:
-                file_path = tarinfo.name
-                if not tarinfo.isreg():
+        extension = Path(path).suffix


I think cached zip or tar archives don't have the extension at the end of their filenames. Also for streaming we don't always know the filename either.

You can take a look at the _get_extraction_protocol function in utils/streaming_download_manager.py. It first checks the extension and then fallbacks to using the magic number of the file to guess the compression type

lhoestq · 2021-12-01T15:08:34Z

And also don't always try streaming with Google Drive - it can have issues because of how Google Drive works (with quotas, restrictions, etc.) and it can indeed cause BlockSizeError.

Feel free to host your test data elsewhere, such as in a dataset repository on https://huggingface.co (see here for a tutorial on how to upload files)

* Add The Pile dataset and PubMed Central subset * Fix style * Fix style * Add README * Make streamable the all config * Add dummy data * Add more info to README * Fix dummy data

* Add The Pile Free Law subset * Update dataset card

* add array_xd docs * add feedback from review

* Support regex in tagset_validator * Validate source_datasets using tagset_validator * Force CI re-run

* Fix typo in other-structured-to-text task tag * Fix dataset cards * Fix dataset cards

* add bl books genre dataset * add missing doc string * update language code format Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * wrap bibxtext in bibtex block Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * update copyright date Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * add description of label field Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * replace eval with ast.literal_eval * regenerate infos * format code * add dummy data * add languages to datasheet contents Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * add placeholder for languages information Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update datasets/blbooksgenre/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Add The Pile USPTO subset * Update dataset card

* add map and torch training loop for streming dataset * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update stream.rst * fic docs Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Mehdi2402 and others added 6 commits November 30, 2021 23:12

iter_archive on zipfiles

c34b588

Avoid content-encoding issue while streaming datasets (huggingface#3350)

805ff93

Make LABR dataset streamable (huggingface#3352)

fb11291

* Make LABR dataset streamable * Fix dataset card

Fix JSON ClassLabel casting for integers (huggingface#3340)

e2c3cea

* fix JSON ClassLabel casting for integers * revert unnecessary changes * actually revert

better error message when downloading (huggingface#3343)

8f08bee

Remove duplicate name from dataset cards (huggingface#3354)

5efcab5

* Remove duplicate name from dataset cards * Fix dataset cards * Fix dataset cards

lhoestq reviewed Dec 1, 2021

View reviewed changes

albertvillanova and others added 12 commits December 1, 2021 16:29

Add The Pile dataset and PubMed Central subset (huggingface#3287)

d5724c7

* Add The Pile dataset and PubMed Central subset * Fix style * Fix style * Add README * Make streamable the all config * Add dummy data * Add more info to README * Fix dummy data

Add The Pile Free Law subset (huggingface#3359)

702389e

* Add The Pile Free Law subset * Update dataset card

Add ArrayXD docs (huggingface#3344)

e1104ad

* add array_xd docs * add feedback from review

Fix dict source_datasets tagset validator (huggingface#3368)

0ef4e2c

* Support regex in tagset_validator * Validate source_datasets using tagset_validator * Force CI re-run

Fix typo in other-structured-to-text task tag (huggingface#3367)

359113f

* Fix typo in other-structured-to-text task tag * Fix dataset cards * Fix dataset cards

Add The Pile USPTO subset (huggingface#3360)

793314d

* Add The Pile USPTO subset * Update dataset card

iter_archive on zipfiles

229bc56

iter_archive on zipfiles with better compression type check

cb3baeb

make style with black

3b429c7

resolve change

08cc719

Mehdi2402 closed this Dec 4, 2021

Mehdi2402 deleted the iter-archive-for-zip branch December 4, 2021 00:22

Mehdi2402 mentioned this pull request Dec 4, 2021

iter_archive on zipfiles with better compression type check #3379

Merged

2 tasks

lhoestq mentioned this pull request Dec 27, 2021

Use iter_files instead of str(Path(...) in image dataset #3477

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iter_archive for zip files #3347

iter_archive for zip files #3347

Mehdi2402 commented Nov 30, 2021

lhoestq left a comment

lhoestq Dec 1, 2021

lhoestq commented Dec 1, 2021

iter_archive for zip files #3347

iter_archive for zip files #3347

Conversation

Mehdi2402 commented Nov 30, 2021

Comments :

Tasks :

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Dec 1, 2021

Choose a reason for hiding this comment

lhoestq commented Dec 1, 2021