Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iter_archive for zip files #3347

Closed
wants to merge 18 commits into from

Conversation

Mehdi2402
Copy link
Contributor

Comments :

  • There is no .isreg() equivalent in zipfile library to check if file is Regular so I used .is_dir() instead to skip directories.
  • For now I got streaming_download_manager.py working for local zip files, but not for urls. I get the following error when I test it on an archive in google drive, so still working on it. BlockSizeError: Got more bytes so far (>2112) than requested (22)

Tasks :

  • download_manager.py
  • streaming_download_manager.py

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi ! Thanks for diving into this :)

Your implementation for ZIP looks all good ! I think we can just improve the compression type check:

for tarinfo in stream:
file_path = tarinfo.name
if not tarinfo.isreg():
extension = Path(path).suffix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cached zip or tar archives don't have the extension at the end of their filenames. Also for streaming we don't always know the filename either.

You can take a look at the _get_extraction_protocol function in utils/streaming_download_manager.py. It first checks the extension and then fallbacks to using the magic number of the file to guess the compression type

@lhoestq
Copy link
Member

lhoestq commented Dec 1, 2021

And also don't always try streaming with Google Drive - it can have issues because of how Google Drive works (with quotas, restrictions, etc.) and it can indeed cause BlockSizeError.

Feel free to host your test data elsewhere, such as in a dataset repository on https://huggingface.co (see here for a tutorial on how to upload files)

albertvillanova and others added 12 commits December 1, 2021 16:29
* Add The Pile dataset and PubMed Central subset

* Fix style

* Fix style

* Add README

* Make streamable the all config

* Add dummy data

* Add more info to README

* Fix dummy data
* Add The Pile Free Law subset

* Update dataset card
* add array_xd docs

* add feedback from review
* Support regex in tagset_validator

* Validate source_datasets using tagset_validator

* Force CI re-run
* Fix typo in other-structured-to-text task tag

* Fix dataset cards

* Fix dataset cards
* add bl books genre dataset

* add missing doc string

* update language code format

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* wrap bibxtext in bibtex block

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* update copyright date

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* add description of label field

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* replace eval with ast.literal_eval

* regenerate infos

* format code

* add dummy data

* add languages to datasheet contents

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* add placeholder for languages information

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update datasets/blbooksgenre/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Add The Pile USPTO subset

* Update dataset card
* add map and torch training loop for streming dataset

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update stream.rst

* fic docs

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants