Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming zipped dataset repo by passing only repo name #3375

Merged
merged 47 commits into from
Dec 16, 2021

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Dec 3, 2021

Proposed solution:

  • I have added the method iter_files to DownloadManager and StreamingDownloadManager
  • I use this in modules: "csv", "json", "text"
  • I test for CSV/JSONL/TXT zipped (and non-zipped) files, both in streaming and non-streaming modes

Fix #3373.

@lhoestq
Copy link
Member

lhoestq commented Dec 8, 2021

I just tested and I think this only opens one file ? If there are several files in the ZIP, only the first one is opened. To open several files from a ZIP, one has to call open several times.

What about updating the CSV loader to make it download_and_extract zip files, and open each extracted file ?

@albertvillanova
Copy link
Member Author

I have implemented the glob of ZIP files in the packaged modules:

  • csv
  • json
  • text

@albertvillanova
Copy link
Member Author

Also for streaming and non-streaming.

@albertvillanova
Copy link
Member Author

albertvillanova commented Dec 9, 2021

In c10275f, there were 3 failing tests, only on Linux:

=========================== short test summary info ============================
FAILED tests/test_streaming_download_manager.py::test_streaming_dl_manager_get_extraction_protocol[https://drive.google.com/uc?export=download&id=1k92sUfpHxKq8PXWRr7Y5aNHXwOCNUmqh-zip]
FAILED tests/test_streaming_download_manager.py::test_streaming_gg_drive - Fi...
FAILED tests/test_streaming_download_manager.py::test_streaming_gg_drive_zipped
= 3 failed, 3553 passed, 2950 skipped, 2 xfailed, 1 xpassed, 125 warnings in 192.79s (0:03:12) =

After re-running the CI in 57bfe1f, there was only 1 failing test:

  • On Linux:
=========================== short test summary info ============================
FAILED tests/test_streaming_download_manager.py::test_streaming_gg_drive_zipped
= 1 failed, 3555 passed, 2950 skipped, 2 xfailed, 1 xpassed, 125 warnings in 199.76s (0:03:19) =
  • On Windows:
=========================== short test summary info ===========================
FAILED tests/test_load.py::test_load_dataset_builder_for_community_dataset_without_script
= 1 failed, 3551 passed, 2954 skipped, 2 xfailed, 1 xpassed, 121 warnings in 478.58s (0:07:58) =

The test tests/test_streaming_download_manager.py::test_streaming_gg_drive_zipped passes locally.

I guess the issue is caused by those tests and has nothing to do with this PR.

@albertvillanova
Copy link
Member Author

@lhoestq my final proposed solution:

  • I have added the method iter_files to DownloadManager and StreamingDownloadManager
  • I use this in modules: "csv", "json", "text"
  • I test for CSV/JSONL/TXT zipped (and non-zipped) files, both in streaming and non-streaming modes

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thank you ! Good job on this :)

Note that at one point we might consider switching to using iter_archive for ZIP files in the json/text/csv loaders since it should be faster.

@@ -126,7 +126,7 @@ def download_and_extract(self, data_url, *args):
return self.create_dummy_data_single(dummy_file, data_url)

# this function has to be in the manager under this name so that testing works
def download(self, data_url, *args):
def download(self, data_url, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove the kwargs ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it because *args was already present. But I'm removing it if you think it is OK to have *args but not **kwargs.... :P

@@ -109,7 +109,7 @@ def manual_dir(self):
return "/".join(self.dummy_file.replace(os.sep, "/").split("/")[:-1])

# this function has to be in the manager under this name so that testing works
def download_and_extract(self, data_url, *args):
def download_and_extract(self, data_url, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well

@albertvillanova
Copy link
Member Author

Note that at one point we might consider switching to using iter_archive for ZIP files in the json/text/csv loaders since it should be faster.

As far as the functionality is kept... ;)

@albertvillanova albertvillanova merged commit 28644cc into master Dec 16, 2021
@albertvillanova albertvillanova deleted the fix-3373 branch December 16, 2021 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support streaming zipped CSV dataset repo by passing only repo name
2 participants