iter_archive on zipfiles with better compression type check #3379

Mehdi2402 · 2021-12-04T01:04:48Z

Hello @lhoestq , thank you for your detailed answer on previous PR !
I made this new PR because I misused git on the previous one #3347.
Related issue #3272.

Comments :

For extension check I used the _get_extraction_protocol function in download_manager.py with a slight change and called it _get_extraction_protocol_local:

I removed this part :

   elif path.endswith(".tar.gz") or path.endswith(".tgz"):
       raise NotImplementedError(
           f"Extraction protocol for TAR archives like '{urlpath}' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead."
       )

And also changed :

- extension = path.split(".")[-1]
+ extension = "tar" if path.endswith(".tar.gz") else path.split(".")[-1]

The reason for this is a compression like .tar.gz will be considered a .gz which is handled with zipfile, though tar.gz can only be opened using tarfile.

Please tell me if there's anything to change.

Tasks :

download_manager.py
streaming_download_manager.py

lhoestq

Awesome thank you !

It looks all good to me, I just added a comment to simplify _get_extraction_protocol_local a bit since it only has to support local files. There is an error message that also needs to be moved.

To make sure that everything works es expected it would also be nice to have some tests for iter_archive in test_download_manager.py and test_streaming_download_manager.py. I can help you with that if you want

src/datasets/utils/download_manager.py

src/datasets/utils/streaming_download_manager.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Mehdi2402 · 2021-12-08T01:15:47Z

Hello @lhoestq, thank you for your answer.

I don't use pytest a lot so I think I might need some help on it :) but I tried some tests for streaming_download_manager.py only. I don't know how to test download_manager.py since we need to use local files.

Comments :

In download_manager.py I removed some unnecessary imports after the simplification of _get_extraction_protocol_local.
In streaming_download_manager I moved the raised Error as suggested.

I also started some tests on `StreamingDownloadManager()` :

Used an existing zipfile url and added a new one that has a folder and many files :

TEST_GG_DRIVE_ZIPPED_URL = "https://drive.google.com/uc?export=download&id=1k92sUfpHxKq8PXWRr7Y5aNHXwOCNUmqh"
TEST_GG_DRIVE2_ZIPPED_URL = "https://drive.google.com/uc?export=download&id=1X4jyUBBbShyCRfD-vCO1ZvfqFXP3NEeU"

For now is being tested :
- Return type of the function : should be tuple
- Files names
- Files content
Added an xfail test for the gzip file, because I get a zipfile.BadZipFile exception.
And lastly, changed the test for _get_extraction_protocol_throws since it was moved to _extract :

@pytest.mark.xfail(raises=NotImplementedError)
def test_streaming_dl_manager_get_extraction_protocol_throws(urlpath):
-    _get_extraction_protocol(urlpath)

@pytest.mark.xfail(raises=NotImplementedError)
def test_streaming_dl_manager_get_extraction_protocol_throws(urlpath):
+    StreamingDownloadManager()._extract(urlpath)

lhoestq

Cool Thanks ! I think we can use another host that Google Drive for the test file you added.

For the local test you can use the zip_csv_path pytest fixture (just add zip_csv_path as an argument to the test function). It's a path to a temporary zip file that contains two CSV files dataset.csv and dataset2.csv. If you're not familiar with pytest I can take care of this

lhoestq · 2021-12-13T15:08:37Z

src/datasets/utils/download_manager.py

                    continue
-                file_obj = stream.extractfile(tarinfo)
+                file_obj = io.BufferedReader(zipf.open(member, "r"))
                yield (file_path, file_obj)


Let's raise an error if it's not TAR or ZIP, just in case:

Suggested change

yield (file_path, file_obj)

yield (file_path, file_obj)

else:

raise NotImplementedError(f"DownloadManager.iter_archive isn't implemented for '{extension}' file at {path}")

lhoestq · 2021-12-13T15:09:24Z

src/datasets/utils/streaming_download_manager.py

+                    file_obj = stream.extractfile(tarinfo)
+                    yield (file_path, file_obj)
+                    stream.members = []
+                del stream


Suggested change

del stream

del stream

else:

raise NotImplementedError(f"StreamingDownloadManager.iter_archive isn't implemented for '{extension}' file at {path}")

lhoestq · 2021-12-13T15:27:18Z

tests/test_streaming_download_manager.py

+TEST_GG_DRIVE2_ZIPPED_URL = "https://drive.google.com/uc?export=download&id=1X4jyUBBbShyCRfD-vCO1ZvfqFXP3NEeU"
+TEST_GG_DRIVE2_FILENAME = ["atoz/test1.txt", "atoz/test2.txt", "atoz/test3/test3.txt"]
+TEST_GG_DRIVE2_CONTENT = ["Erwin is the best character", "this is test 2", "foo"]


For such tests I would use another host than Google Drive, which is often unreliable.
In this file you can see some tests that are using files hosted on Google Drive, but this is just to make sure that streaming from Google Drive works (it required a special logic). For the others tests let's use better hosts.

I downloaded your zip and put it here if it can be useful:
https://huggingface.co/datasets/lhoestq/test_zip_txt/resolve/main/atoz.zip

lhoestq · 2021-12-13T15:49:22Z

tests/test_streaming_download_manager.py

+    "urlpath",
+    [(TEST_GG_DRIVE_GZIPPED_URL)],
+)
+@pytest.mark.xfail()


It's ok if iter_archive on a gzip file raises zipfile.BadZipFile as specified in your test, so I would remove xfail

Suggested change

@pytest.mark.xfail()

mariosasko · 2022-01-18T14:18:02Z

src/datasets/utils/download_manager.py

+from .streaming_download_manager import (
+    BASE_KNOWN_EXTENSIONS,
+    COMPRESSION_EXTENSION_TO_PROTOCOL,
+    _get_extraction_protocol_with_magic_number,
+)


Would move this to src/datasets/utils/file_utils.py to avoid importing from src/datasets/utils/streaming_download_manager.py to src/datasets/utils/download_manager.py

I am not sure if I understood well, do you want me to redefine the function and dictionnaries in file_utils.py or just move the imports to file_utils.py.
I used the second one in the commit 10cf700 Until your response.

Would move this to src/datasets/utils/file_utils.py to avoid importing from src/datasets/utils/streaming_download_manager.py to src/datasets/utils/download_manager.py

Hello @mariosasko, moving COMPRESSION_EXTENSION_TO_PROTOCOL to file_utils.py creates a cyclic import error as it uses COMPRESSION_FILESYSTEMS which is imported from filesystems.py, and inside filesystems.py there are imports from file_utils.py.

How do I arrange import which is best for clarity and consistency ?

Maybe let's move COMPRESSION_EXTENSION_TO_PROTOCOL to filesystems.py then?

mariosasko · 2022-01-18T14:28:04Z

src/datasets/utils/download_manager.py

@@ -64,6 +71,20 @@ class GenerateMode(enum.Enum):
    FORCE_REDOWNLOAD = "force_redownload"


+def _get_extraction_protocol_local(path: str) -> Optional[str]:


A nit for consistency with streaming_download_manager.py:

Suggested change

def _get_extraction_protocol_local(path: str) -> Optional[str]:

def _get_extraction_protocol(path: str) -> Optional[str]:

Hello @mariosasko @lhoestq Thanks for the answers, I wasn't active lately I will get back to complete this pull request later this week.

thanks @Mehdi2402 !

Mehdi2402 · 2022-02-03T23:08:35Z

Hello,
In this Commit was taken into account all the comment escept the test_download _manager.py.
I will work on that for the next commit.

Sorry again for being inactive lately in this PR.

lhoestq · 2022-02-04T15:16:56Z

thanks a lot ! This CI seems to have import errors now though ?

Mehdi2402 · 2022-02-08T15:16:53Z

thanks a lot ! This CI seems to have import errors now though ?

Yes sorry about that, it's due to a cyclic import I didn't pay attention to.

Will fix that in the next Commit along with adding the tests to download_manager.

albertvillanova · 2022-09-21T14:40:31Z

Hello @Mehdi2402, are you still interested in working on this further?

Mehdi2402 · 2022-09-22T21:17:02Z

Hello @Mehdi2402, are you still interested in working on this further?

Hello @albertvillanova, yes I would like to resume work on this.

albertvillanova · 2022-09-23T06:44:49Z

Great, we would like to have this feature.

First, you should resolve the conflicts with the main branch, by merging main into your feature branch and then fixing the conflicts by hand. Let us know if you would need some help on this: we can resolve the conflicts for you, so that you can continue your contribution afterwards.

…ive-for-branch-2

HuggingFaceDocBuilderDev · 2023-01-20T15:44:33Z

The documentation is not available anymore as the PR was closed or merged.

mariosasko · 2023-01-20T16:19:58Z

I refactored the code to make this PR ready for the final review.

lhoestq

Nice !

src/datasets/download/download_manager.py

github-actions · 2023-01-24T13:00:19Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009475 / 0.011353 (-0.001878)	0.005249 / 0.011008 (-0.005759)	0.099713 / 0.038508 (0.061205)	0.036328 / 0.023109 (0.013219)	0.295955 / 0.275898 (0.020057)	0.368779 / 0.323480 (0.045299)	0.007796 / 0.007986 (-0.000190)	0.005635 / 0.004328 (0.001306)	0.077351 / 0.004250 (0.073100)	0.045290 / 0.037052 (0.008238)	0.306634 / 0.258489 (0.048145)	0.345025 / 0.293841 (0.051184)	0.038241 / 0.128546 (-0.090306)	0.012338 / 0.075646 (-0.063308)	0.335184 / 0.419271 (-0.084088)	0.047737 / 0.043533 (0.004204)	0.295092 / 0.255139 (0.039953)	0.319810 / 0.283200 (0.036610)	0.102777 / 0.141683 (-0.038906)	1.399444 / 1.452155 (-0.052711)	1.450239 / 1.492716 (-0.042478)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.202919 / 0.018006 (0.184912)	0.447493 / 0.000490 (0.447003)	0.004187 / 0.000200 (0.003987)	0.000083 / 0.000054 (0.000029)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028570 / 0.037411 (-0.008841)	0.113536 / 0.014526 (0.099010)	0.120525 / 0.176557 (-0.056031)	0.162732 / 0.737135 (-0.574404)	0.130195 / 0.296338 (-0.166144)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.408831 / 0.215209 (0.193622)	4.094929 / 2.077655 (2.017274)	1.810356 / 1.504120 (0.306236)	1.618532 / 1.541195 (0.077337)	1.681310 / 1.468490 (0.212820)	0.705157 / 4.584777 (-3.879620)	3.789040 / 3.745712 (0.043327)	2.121842 / 5.269862 (-3.148020)	1.522505 / 4.565676 (-3.043171)	0.085443 / 0.424275 (-0.338832)	0.012065 / 0.007607 (0.004458)	0.521176 / 0.226044 (0.295132)	5.201899 / 2.268929 (2.932970)	2.303055 / 55.444624 (-53.141569)	1.971721 / 6.876477 (-4.904756)	2.053827 / 2.142072 (-0.088245)	0.864810 / 4.805227 (-3.940418)	0.168040 / 6.500664 (-6.332624)	0.063332 / 0.075469 (-0.012138)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.208105 / 1.841788 (-0.633683)	14.722757 / 8.074308 (6.648449)	14.396695 / 10.191392 (4.205303)	0.152702 / 0.680424 (-0.527722)	0.028828 / 0.534201 (-0.505373)	0.439573 / 0.579283 (-0.139710)	0.438891 / 0.434364 (0.004527)	0.509043 / 0.540337 (-0.031295)	0.603531 / 1.386936 (-0.783405)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007337 / 0.011353 (-0.004016)	0.005080 / 0.011008 (-0.005929)	0.097916 / 0.038508 (0.059408)	0.032722 / 0.023109 (0.009612)	0.338925 / 0.275898 (0.063027)	0.372945 / 0.323480 (0.049465)	0.005464 / 0.007986 (-0.002522)	0.004031 / 0.004328 (-0.000297)	0.076761 / 0.004250 (0.072511)	0.046804 / 0.037052 (0.009752)	0.336088 / 0.258489 (0.077599)	0.403704 / 0.293841 (0.109863)	0.036928 / 0.128546 (-0.091618)	0.012204 / 0.075646 (-0.063442)	0.335467 / 0.419271 (-0.083804)	0.049158 / 0.043533 (0.005625)	0.342040 / 0.255139 (0.086901)	0.356729 / 0.283200 (0.073530)	0.101280 / 0.141683 (-0.040403)	1.432540 / 1.452155 (-0.019614)	1.545228 / 1.492716 (0.052512)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.226003 / 0.018006 (0.207997)	0.445601 / 0.000490 (0.445112)	0.000408 / 0.000200 (0.000208)	0.000057 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028861 / 0.037411 (-0.008551)	0.112083 / 0.014526 (0.097557)	0.130843 / 0.176557 (-0.045713)	0.159275 / 0.737135 (-0.577861)	0.127582 / 0.296338 (-0.168756)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.446357 / 0.215209 (0.231148)	4.448568 / 2.077655 (2.370914)	2.197861 / 1.504120 (0.693741)	2.004675 / 1.541195 (0.463480)	2.052082 / 1.468490 (0.583592)	0.710770 / 4.584777 (-3.874007)	3.868936 / 3.745712 (0.123224)	2.095008 / 5.269862 (-3.174854)	1.363064 / 4.565676 (-3.202613)	0.086734 / 0.424275 (-0.337541)	0.012272 / 0.007607 (0.004665)	0.546378 / 0.226044 (0.320334)	5.475189 / 2.268929 (3.206260)	2.702742 / 55.444624 (-52.741882)	2.335880 / 6.876477 (-4.540597)	2.396194 / 2.142072 (0.254121)	0.856249 / 4.805227 (-3.948978)	0.170466 / 6.500664 (-6.330198)	0.063585 / 0.075469 (-0.011884)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.236981 / 1.841788 (-0.604807)	15.046616 / 8.074308 (6.972307)	14.551781 / 10.191392 (4.360389)	0.144485 / 0.680424 (-0.535939)	0.017774 / 0.534201 (-0.516427)	0.446274 / 0.579283 (-0.133010)	0.436871 / 0.434364 (0.002507)	0.504503 / 0.540337 (-0.035834)	0.602014 / 1.386936 (-0.784922)

iter_archive on zipfiles with better compression type check

5769cf4

lhoestq reviewed Dec 6, 2021

View reviewed changes

src/datasets/utils/download_manager.py Outdated Show resolved Hide resolved

src/datasets/utils/streaming_download_manager.py Outdated Show resolved Hide resolved

Mehdi2402 and others added 2 commits December 6, 2021 21:10

Update src/datasets/utils/download_manager.py

804cf14

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

tests for iter_archive on streaming_dl_manager

15a872e

function names corrections

c8dc7cd

lhoestq reviewed Dec 13, 2021

View reviewed changes

lhoestq mentioned this pull request Dec 14, 2021

Support streaming zipped dataset repo by passing only repo name #3375

Merged

lhoestq mentioned this pull request Dec 27, 2021

Use iter_files instead of str(Path(...) in image dataset #3477

Merged

1 task

mariosasko reviewed Jan 18, 2022

View reviewed changes

change host and other corrections

10cf700

huggingface deleted a comment from aa0532525555 Sep 21, 2022

mariosasko added 6 commits January 19, 2023 04:05

Merge conflict

9b2b8a4

Refactor code

2b141f9

Fix bad merge

f4e4340

Merge branch 'main' of github.com:huggingface/datasets into iter-arch…

e0fdb35

…ive-for-branch-2

Tests

294c12c

Style fixes

1e276b6

mariosasko requested review from lhoestq and albertvillanova January 20, 2023 16:17

lhoestq approved these changes Jan 23, 2023

View reviewed changes

src/datasets/download/download_manager.py Show resolved Hide resolved

mariosasko merged commit 697b6d6 into huggingface:main Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iter_archive on zipfiles with better compression type check #3379

iter_archive on zipfiles with better compression type check #3379

Mehdi2402 commented Dec 4, 2021

lhoestq left a comment

Mehdi2402 commented Dec 8, 2021

lhoestq left a comment

lhoestq Dec 13, 2021

lhoestq Dec 13, 2021

lhoestq Dec 13, 2021

lhoestq Dec 13, 2021

mariosasko Jan 18, 2022

Mehdi2402 Feb 3, 2022

Mehdi2402 Feb 8, 2022

mariosasko Feb 11, 2022

mariosasko Jan 18, 2022

Mehdi2402 Jan 25, 2022

lhoestq Jan 26, 2022

Mehdi2402 commented Feb 3, 2022

lhoestq commented Feb 4, 2022

Mehdi2402 commented Feb 8, 2022

albertvillanova commented Sep 21, 2022

Mehdi2402 commented Sep 22, 2022

albertvillanova commented Sep 23, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 20, 2023 •

edited

Loading

mariosasko commented Jan 20, 2023

lhoestq left a comment

github-actions bot commented Jan 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

		@@ -64,6 +71,20 @@ class GenerateMode(enum.Enum):
		FORCE_REDOWNLOAD = "force_redownload"


		def _get_extraction_protocol_local(path: str) -> Optional[str]:

	def _get_extraction_protocol_local(path: str) -> Optional[str]:
	def _get_extraction_protocol(path: str) -> Optional[str]:

iter_archive on zipfiles with better compression type check #3379

iter_archive on zipfiles with better compression type check #3379

Conversation

Mehdi2402 commented Dec 4, 2021

Comments :

Tasks :

lhoestq left a comment

Choose a reason for hiding this comment

Mehdi2402 commented Dec 8, 2021

Comments :

I also started some tests on StreamingDownloadManager() :

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mehdi2402 commented Feb 3, 2022

lhoestq commented Feb 4, 2022

Mehdi2402 commented Feb 8, 2022

albertvillanova commented Sep 21, 2022

Mehdi2402 commented Sep 22, 2022

albertvillanova commented Sep 23, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Jan 20, 2023 • edited Loading

mariosasko commented Jan 20, 2023

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

I also started some tests on `StreamingDownloadManager()` :

albertvillanova commented Sep 23, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 20, 2023 •

edited

Loading