Support streaming zipped dataset repo by passing only repo name #3375

albertvillanova · 2021-12-03T10:43:05Z

Proposed solution:

I have added the method iter_files to DownloadManager and StreamingDownloadManager
I use this in modules: "csv", "json", "text"
I test for CSV/JSONL/TXT zipped (and non-zipped) files, both in streaming and non-streaming modes

lhoestq · 2021-12-08T14:03:47Z

I just tested and I think this only opens one file ? If there are several files in the ZIP, only the first one is opened. To open several files from a ZIP, one has to call open several times.

What about updating the CSV loader to make it download_and_extract zip files, and open each extracted file ?

albertvillanova · 2021-12-09T08:28:44Z

I have implemented the glob of ZIP files in the packaged modules:

csv
json
text

albertvillanova · 2021-12-09T09:28:56Z

Also for streaming and non-streaming.

albertvillanova · 2021-12-09T18:39:37Z

In c10275f, there were 3 failing tests, only on Linux:

=========================== short test summary info ============================
FAILED tests/test_streaming_download_manager.py::test_streaming_dl_manager_get_extraction_protocol[https://drive.google.com/uc?export=download&id=1k92sUfpHxKq8PXWRr7Y5aNHXwOCNUmqh-zip]
FAILED tests/test_streaming_download_manager.py::test_streaming_gg_drive - Fi...
FAILED tests/test_streaming_download_manager.py::test_streaming_gg_drive_zipped
= 3 failed, 3553 passed, 2950 skipped, 2 xfailed, 1 xpassed, 125 warnings in 192.79s (0:03:12) =

After re-running the CI in 57bfe1f, there was only 1 failing test:

On Linux:

=========================== short test summary info ============================
FAILED tests/test_streaming_download_manager.py::test_streaming_gg_drive_zipped
= 1 failed, 3555 passed, 2950 skipped, 2 xfailed, 1 xpassed, 125 warnings in 199.76s (0:03:19) =

On Windows:

=========================== short test summary info ===========================
FAILED tests/test_load.py::test_load_dataset_builder_for_community_dataset_without_script
= 1 failed, 3551 passed, 2954 skipped, 2 xfailed, 1 xpassed, 121 warnings in 478.58s (0:07:58) =

The test tests/test_streaming_download_manager.py::test_streaming_gg_drive_zipped passes locally.

I guess the issue is caused by those tests and has nothing to do with this PR.

albertvillanova · 2021-12-14T16:45:27Z

@lhoestq my final proposed solution:

I have added the method iter_files to DownloadManager and StreamingDownloadManager
I use this in modules: "csv", "json", "text"
I test for CSV/JSONL/TXT zipped (and non-zipped) files, both in streaming and non-streaming modes

lhoestq

Cool thank you ! Good job on this :)

Note that at one point we might consider switching to using iter_archive for ZIP files in the json/text/csv loaders since it should be faster.

lhoestq · 2021-12-16T15:01:08Z

src/datasets/utils/mock_download_manager.py

@@ -126,7 +126,7 @@ def download_and_extract(self, data_url, *args):
            return self.create_dummy_data_single(dummy_file, data_url)

    # this function has to be in the manager under this name so that testing works
-    def download(self, data_url, *args):
+    def download(self, data_url, *args, **kwargs):


I think you can remove the kwargs ?

I left it because *args was already present. But I'm removing it if you think it is OK to have *args but not **kwargs.... :P

lhoestq · 2021-12-16T15:01:14Z

src/datasets/utils/mock_download_manager.py

@@ -109,7 +109,7 @@ def manual_dir(self):
        return "/".join(self.dummy_file.replace(os.sep, "/").split("/")[:-1])

    # this function has to be in the manager under this name so that testing works
-    def download_and_extract(self, data_url, *args):
+    def download_and_extract(self, data_url, *args, **kwargs):


here as well

albertvillanova · 2021-12-16T17:41:39Z

Note that at one point we might consider switching to using iter_archive for ZIP files in the json/text/csv loaders since it should be faster.

As far as the functionality is kept... ;)

albertvillanova added 2 commits December 3, 2021 11:39

Implement function to add glob to URL

8131b00

Use function to add glob to URL before opening it

0f130c6

albertvillanova added 8 commits December 9, 2021 09:05

Fix add glob to make it recursive

5bc0dc9

Remove function to add glob

4825cd8

Implement StreamingDownloadManager.glob

695312d

Use glob after extracting

35c0351

Pass glob_archives also to DownloadManager

16fcda4

Force glob archives from json module

26a3f42

Force glob archives from csv module

12cf07b

Force glob archives from text module

04953b9

albertvillanova added 4 commits December 9, 2021 09:31

Implement NestedDataStructure.map

5f26b6c

Use map to detect ZIP archives in DownloadManager.download

c57a619

Implement NestedDataStructure.glob

adac2b6

Use glob in DownloadManager.extract

8fa2673

albertvillanova added 5 commits December 9, 2021 10:38

Fix style

532af67

Fix MockDownloadManager

57c939c

Pass use_auth_token from StreamingDownloadManager.glob

66c1124

Merge remote-tracking branch 'upstream/master' into fix-3373

c10275f

Force CI re-run

57bfe1f

Force CI re-run

39f32f2

This was referenced Dec 10, 2021

Non-deterministic tests: CI tests randomly fail #3415

Closed

Create dataset Habibi bigscience-workshop/data_tooling#291

Closed

Create dataset royal_society_corpus bigscience-workshop/data_tooling#205

Closed

albertvillanova requested review from lhoestq and mariosasko December 13, 2021 16:41

albertvillanova mentioned this pull request Dec 13, 2021

Create dataset british_library_hertiage_made_digital_newspapers bigscience-workshop/data_tooling#232

Open

albertvillanova added 19 commits December 14, 2021 15:34

Use DownloadManager.iter_files in csv module

0c3de93

Fix style

9534e98

Remove unused import

111508c

Pass use_auth_token to xwalk

f48ab97

Pass use_auth_token to xwalk from iter_files

5a32cf7

Remove unnecessary condition in csv module

88b1c5d

Remove unused import

b30dce8

Implement MockDownloadManager.iter_files

0873176

Pass iter_files generator

8beb12f

Add test fixtures with zip jsonl

353e28f

Test load_dataset with zipped JSONL

37311a2

Do not pass glob_archives in json module

8bfdc95

Use DownloadManager.iter_files in json module

bd103ab

Refactor csv module

beab3ec

Add test fixtures with zip txt

e94fcf3

Test load_dataset with zipped text

8326d89

Do not pass glob_archives in text module

3859c08

Use DownloadManager.iter_files in json module

69aedcf

Remove glob functionality

2a19548

Remove glob functionality

053d5d4

This was referenced Dec 15, 2021

Create dataset from Project Gutenberg bigscience-workshop/data_tooling#55

Closed

Create dataset vicon_visim400 bigscience-workshop/data_tooling#126

Open

lhoestq approved these changes Dec 16, 2021

View reviewed changes

lhoestq reviewed Dec 16, 2021

View reviewed changes

Address requested changes

a6419bf

albertvillanova merged commit 28644cc into master Dec 16, 2021

albertvillanova deleted the fix-3373 branch December 16, 2021 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support streaming zipped dataset repo by passing only repo name #3375

Support streaming zipped dataset repo by passing only repo name #3375

albertvillanova commented Dec 3, 2021 •

edited

Loading

lhoestq commented Dec 8, 2021

albertvillanova commented Dec 9, 2021

albertvillanova commented Dec 9, 2021

albertvillanova commented Dec 9, 2021 •

edited

Loading

albertvillanova commented Dec 14, 2021

lhoestq left a comment

lhoestq Dec 16, 2021

albertvillanova Dec 16, 2021

lhoestq Dec 16, 2021

albertvillanova commented Dec 16, 2021

Support streaming zipped dataset repo by passing only repo name #3375

Support streaming zipped dataset repo by passing only repo name #3375

Conversation

albertvillanova commented Dec 3, 2021 • edited Loading

lhoestq commented Dec 8, 2021

albertvillanova commented Dec 9, 2021

albertvillanova commented Dec 9, 2021

albertvillanova commented Dec 9, 2021 • edited Loading

albertvillanova commented Dec 14, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Dec 16, 2021

Choose a reason for hiding this comment

albertvillanova Dec 16, 2021

Choose a reason for hiding this comment

lhoestq Dec 16, 2021

Choose a reason for hiding this comment

albertvillanova commented Dec 16, 2021

albertvillanova commented Dec 3, 2021 •

edited

Loading

albertvillanova commented Dec 9, 2021 •

edited

Loading