-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iter_archive for zip files #3347
Conversation
* Make LABR dataset streamable * Fix dataset card
* fix JSON ClassLabel casting for integers * revert unnecessary changes * actually revert
* Remove duplicate name from dataset cards * Fix dataset cards * Fix dataset cards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi ! Thanks for diving into this :)
Your implementation for ZIP looks all good ! I think we can just improve the compression type check:
for tarinfo in stream: | ||
file_path = tarinfo.name | ||
if not tarinfo.isreg(): | ||
extension = Path(path).suffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cached zip or tar archives don't have the extension at the end of their filenames. Also for streaming we don't always know the filename either.
You can take a look at the _get_extraction_protocol
function in utils/streaming_download_manager.py
. It first checks the extension and then fallbacks to using the magic number of the file to guess the compression type
And also don't always try streaming with Google Drive - it can have issues because of how Google Drive works (with quotas, restrictions, etc.) and it can indeed cause Feel free to host your test data elsewhere, such as in a dataset repository on https://huggingface.co (see here for a tutorial on how to upload files) |
* Add The Pile dataset and PubMed Central subset * Fix style * Fix style * Add README * Make streamable the all config * Add dummy data * Add more info to README * Fix dummy data
* Add The Pile Free Law subset * Update dataset card
* add array_xd docs * add feedback from review
* Support regex in tagset_validator * Validate source_datasets using tagset_validator * Force CI re-run
* Fix typo in other-structured-to-text task tag * Fix dataset cards * Fix dataset cards
* add bl books genre dataset * add missing doc string * update language code format Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * wrap bibxtext in bibtex block Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * update copyright date Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * add description of label field Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * replace eval with ast.literal_eval * regenerate infos * format code * add dummy data * add languages to datasheet contents Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * add placeholder for languages information Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update datasets/blbooksgenre/README.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Add The Pile USPTO subset * Update dataset card
* add map and torch training loop for streming dataset * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update stream.rst * fic docs Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
download_manager.py
only.streaming_download_manager.py
.Comments :
.isreg()
equivalent in zipfile library to check if file is Regular so I used.is_dir()
instead to skip directories.streaming_download_manager.py
working for local zip files, but not for urls. I get the following error when I test it on an archive in google drive, so still working on it.BlockSizeError: Got more bytes so far (>2112) than requested (22)
Tasks :