Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iter_archive on zipfiles with better compression type check #3379

Merged
merged 11 commits into from
Jan 24, 2023

Conversation

Mehdi2402
Copy link
Contributor

Hello @lhoestq , thank you for your detailed answer on previous PR !
I made this new PR because I misused git on the previous one #3347.
Related issue #3272.

Comments :

  • For extension check I used the _get_extraction_protocol function in download_manager.py with a slight change and called it _get_extraction_protocol_local:

I removed this part :

   elif path.endswith(".tar.gz") or path.endswith(".tgz"):
       raise NotImplementedError(
           f"Extraction protocol for TAR archives like '{urlpath}' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead."
       )

And also changed :

- extension = path.split(".")[-1]
+ extension = "tar" if path.endswith(".tar.gz") else path.split(".")[-1]

The reason for this is a compression like .tar.gz will be considered a .gz which is handled with zipfile, though tar.gz can only be opened using tarfile.

Please tell me if there's anything to change.

Tasks :

  • download_manager.py
  • streaming_download_manager.py

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you !

It looks all good to me, I just added a comment to simplify _get_extraction_protocol_local a bit since it only has to support local files. There is an error message that also needs to be moved.

To make sure that everything works es expected it would also be nice to have some tests for iter_archive in test_download_manager.py and test_streaming_download_manager.py. I can help you with that if you want

src/datasets/utils/download_manager.py Outdated Show resolved Hide resolved
src/datasets/utils/streaming_download_manager.py Outdated Show resolved Hide resolved
Mehdi2402 and others added 2 commits December 6, 2021 21:10
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@Mehdi2402
Copy link
Contributor Author

Hello @lhoestq, thank you for your answer.

I don't use pytest a lot so I think I might need some help on it :) but I tried some tests for streaming_download_manager.py only. I don't know how to test download_manager.py since we need to use local files.

Comments :

  • In download_manager.py I removed some unnecessary imports after the simplification of _get_extraction_protocol_local.
  • In streaming_download_manager I moved the raised Error as suggested.

I also started some tests on StreamingDownloadManager() :

  • Used an existing zipfile url and added a new one that has a folder and many files :
TEST_GG_DRIVE_ZIPPED_URL = "https://drive.google.com/uc?export=download&id=1k92sUfpHxKq8PXWRr7Y5aNHXwOCNUmqh"
TEST_GG_DRIVE2_ZIPPED_URL = "https://drive.google.com/uc?export=download&id=1X4jyUBBbShyCRfD-vCO1ZvfqFXP3NEeU"
  • For now is being tested :

    • Return type of the function : should be tuple
    • Files names
    • Files content
  • Added an xfail test for the gzip file, because I get a zipfile.BadZipFile exception.

  • And lastly, changed the test for _get_extraction_protocol_throws since it was moved to _extract :

@pytest.mark.xfail(raises=NotImplementedError)
def test_streaming_dl_manager_get_extraction_protocol_throws(urlpath):
-    _get_extraction_protocol(urlpath)

@pytest.mark.xfail(raises=NotImplementedError)
def test_streaming_dl_manager_get_extraction_protocol_throws(urlpath):
+    StreamingDownloadManager()._extract(urlpath)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool Thanks ! I think we can use another host that Google Drive for the test file you added.

For the local test you can use the zip_csv_path pytest fixture (just add zip_csv_path as an argument to the test function). It's a path to a temporary zip file that contains two CSV files dataset.csv and dataset2.csv. If you're not familiar with pytest I can take care of this

continue
file_obj = stream.extractfile(tarinfo)
file_obj = io.BufferedReader(zipf.open(member, "r"))
yield (file_path, file_obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's raise an error if it's not TAR or ZIP, just in case:

Suggested change
yield (file_path, file_obj)
yield (file_path, file_obj)
else:
raise NotImplementedError(f"DownloadManager.iter_archive isn't implemented for '{extension}' file at {path}")

file_obj = stream.extractfile(tarinfo)
yield (file_path, file_obj)
stream.members = []
del stream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
del stream
del stream
else:
raise NotImplementedError(f"StreamingDownloadManager.iter_archive isn't implemented for '{extension}' file at {path}")

Comment on lines 43 to 45
TEST_GG_DRIVE2_ZIPPED_URL = "https://drive.google.com/uc?export=download&id=1X4jyUBBbShyCRfD-vCO1ZvfqFXP3NEeU"
TEST_GG_DRIVE2_FILENAME = ["atoz/test1.txt", "atoz/test2.txt", "atoz/test3/test3.txt"]
TEST_GG_DRIVE2_CONTENT = ["Erwin is the best character", "this is test 2", "foo"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For such tests I would use another host than Google Drive, which is often unreliable.
In this file you can see some tests that are using files hosted on Google Drive, but this is just to make sure that streaming from Google Drive works (it required a special logic). For the others tests let's use better hosts.

I downloaded your zip and put it here if it can be useful:
https://huggingface.co/datasets/lhoestq/test_zip_txt/resolve/main/atoz.zip

"urlpath",
[(TEST_GG_DRIVE_GZIPPED_URL)],
)
@pytest.mark.xfail()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok if iter_archive on a gzip file raises zipfile.BadZipFile as specified in your test, so I would remove xfail

Suggested change
@pytest.mark.xfail()

Comment on lines 40 to 44
from .streaming_download_manager import (
BASE_KNOWN_EXTENSIONS,
COMPRESSION_EXTENSION_TO_PROTOCOL,
_get_extraction_protocol_with_magic_number,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would move this to src/datasets/utils/file_utils.py to avoid importing from src/datasets/utils/streaming_download_manager.py to src/datasets/utils/download_manager.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I understood well, do you want me to redefine the function and dictionnaries in file_utils.py or just move the imports to file_utils.py.
I used the second one in the commit 10cf700 Until your response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would move this to src/datasets/utils/file_utils.py to avoid importing from src/datasets/utils/streaming_download_manager.py to src/datasets/utils/download_manager.py

Hello @mariosasko, moving COMPRESSION_EXTENSION_TO_PROTOCOL to file_utils.py creates a cyclic import error as it uses COMPRESSION_FILESYSTEMS which is imported from filesystems.py, and inside filesystems.py there are imports from file_utils.py.

How do I arrange import which is best for clarity and consistency ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's move COMPRESSION_EXTENSION_TO_PROTOCOL to filesystems.py then?

@@ -64,6 +71,20 @@ class GenerateMode(enum.Enum):
FORCE_REDOWNLOAD = "force_redownload"


def _get_extraction_protocol_local(path: str) -> Optional[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit for consistency with streaming_download_manager.py:

Suggested change
def _get_extraction_protocol_local(path: str) -> Optional[str]:
def _get_extraction_protocol(path: str) -> Optional[str]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @mariosasko @lhoestq Thanks for the answers, I wasn't active lately I will get back to complete this pull request later this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Mehdi2402 !

@Mehdi2402
Copy link
Contributor Author

Hello,
In this Commit was taken into account all the comment escept the test_download _manager.py.
I will work on that for the next commit.

Sorry again for being inactive lately in this PR.

@lhoestq
Copy link
Member

lhoestq commented Feb 4, 2022

thanks a lot ! This CI seems to have import errors now though ?

@Mehdi2402
Copy link
Contributor Author

thanks a lot ! This CI seems to have import errors now though ?

Yes sorry about that, it's due to a cyclic import I didn't pay attention to.

Will fix that in the next Commit along with adding the tests to download_manager.

@albertvillanova
Copy link
Member

Hello @Mehdi2402, are you still interested in working on this further?

@huggingface huggingface deleted a comment from aa0532525555 Sep 21, 2022
@Mehdi2402
Copy link
Contributor Author

Hello @Mehdi2402, are you still interested in working on this further?

Hello @albertvillanova, yes I would like to resume work on this.

@albertvillanova
Copy link
Member

albertvillanova commented Sep 23, 2022

Great, we would like to have this feature.

First, you should resolve the conflicts with the main branch, by merging main into your feature branch and then fixing the conflicts by hand. Let us know if you would need some help on this: we can resolve the conflicts for you, so that you can continue your contribution afterwards.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jan 20, 2023

The documentation is not available anymore as the PR was closed or merged.

@mariosasko
Copy link
Collaborator

I refactored the code to make this PR ready for the final review.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

src/datasets/download/download_manager.py Show resolved Hide resolved
@mariosasko mariosasko merged commit 697b6d6 into huggingface:main Jan 24, 2023
@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009475 / 0.011353 (-0.001878) 0.005249 / 0.011008 (-0.005759) 0.099713 / 0.038508 (0.061205) 0.036328 / 0.023109 (0.013219) 0.295955 / 0.275898 (0.020057) 0.368779 / 0.323480 (0.045299) 0.007796 / 0.007986 (-0.000190) 0.005635 / 0.004328 (0.001306) 0.077351 / 0.004250 (0.073100) 0.045290 / 0.037052 (0.008238) 0.306634 / 0.258489 (0.048145) 0.345025 / 0.293841 (0.051184) 0.038241 / 0.128546 (-0.090306) 0.012338 / 0.075646 (-0.063308) 0.335184 / 0.419271 (-0.084088) 0.047737 / 0.043533 (0.004204) 0.295092 / 0.255139 (0.039953) 0.319810 / 0.283200 (0.036610) 0.102777 / 0.141683 (-0.038906) 1.399444 / 1.452155 (-0.052711) 1.450239 / 1.492716 (-0.042478)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.202919 / 0.018006 (0.184912) 0.447493 / 0.000490 (0.447003) 0.004187 / 0.000200 (0.003987) 0.000083 / 0.000054 (0.000029)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028570 / 0.037411 (-0.008841) 0.113536 / 0.014526 (0.099010) 0.120525 / 0.176557 (-0.056031) 0.162732 / 0.737135 (-0.574404) 0.130195 / 0.296338 (-0.166144)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.408831 / 0.215209 (0.193622) 4.094929 / 2.077655 (2.017274) 1.810356 / 1.504120 (0.306236) 1.618532 / 1.541195 (0.077337) 1.681310 / 1.468490 (0.212820) 0.705157 / 4.584777 (-3.879620) 3.789040 / 3.745712 (0.043327) 2.121842 / 5.269862 (-3.148020) 1.522505 / 4.565676 (-3.043171) 0.085443 / 0.424275 (-0.338832) 0.012065 / 0.007607 (0.004458) 0.521176 / 0.226044 (0.295132) 5.201899 / 2.268929 (2.932970) 2.303055 / 55.444624 (-53.141569) 1.971721 / 6.876477 (-4.904756) 2.053827 / 2.142072 (-0.088245) 0.864810 / 4.805227 (-3.940418) 0.168040 / 6.500664 (-6.332624) 0.063332 / 0.075469 (-0.012138)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.208105 / 1.841788 (-0.633683) 14.722757 / 8.074308 (6.648449) 14.396695 / 10.191392 (4.205303) 0.152702 / 0.680424 (-0.527722) 0.028828 / 0.534201 (-0.505373) 0.439573 / 0.579283 (-0.139710) 0.438891 / 0.434364 (0.004527) 0.509043 / 0.540337 (-0.031295) 0.603531 / 1.386936 (-0.783405)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007337 / 0.011353 (-0.004016) 0.005080 / 0.011008 (-0.005929) 0.097916 / 0.038508 (0.059408) 0.032722 / 0.023109 (0.009612) 0.338925 / 0.275898 (0.063027) 0.372945 / 0.323480 (0.049465) 0.005464 / 0.007986 (-0.002522) 0.004031 / 0.004328 (-0.000297) 0.076761 / 0.004250 (0.072511) 0.046804 / 0.037052 (0.009752) 0.336088 / 0.258489 (0.077599) 0.403704 / 0.293841 (0.109863) 0.036928 / 0.128546 (-0.091618) 0.012204 / 0.075646 (-0.063442) 0.335467 / 0.419271 (-0.083804) 0.049158 / 0.043533 (0.005625) 0.342040 / 0.255139 (0.086901) 0.356729 / 0.283200 (0.073530) 0.101280 / 0.141683 (-0.040403) 1.432540 / 1.452155 (-0.019614) 1.545228 / 1.492716 (0.052512)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.226003 / 0.018006 (0.207997) 0.445601 / 0.000490 (0.445112) 0.000408 / 0.000200 (0.000208) 0.000057 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028861 / 0.037411 (-0.008551) 0.112083 / 0.014526 (0.097557) 0.130843 / 0.176557 (-0.045713) 0.159275 / 0.737135 (-0.577861) 0.127582 / 0.296338 (-0.168756)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.446357 / 0.215209 (0.231148) 4.448568 / 2.077655 (2.370914) 2.197861 / 1.504120 (0.693741) 2.004675 / 1.541195 (0.463480) 2.052082 / 1.468490 (0.583592) 0.710770 / 4.584777 (-3.874007) 3.868936 / 3.745712 (0.123224) 2.095008 / 5.269862 (-3.174854) 1.363064 / 4.565676 (-3.202613) 0.086734 / 0.424275 (-0.337541) 0.012272 / 0.007607 (0.004665) 0.546378 / 0.226044 (0.320334) 5.475189 / 2.268929 (3.206260) 2.702742 / 55.444624 (-52.741882) 2.335880 / 6.876477 (-4.540597) 2.396194 / 2.142072 (0.254121) 0.856249 / 4.805227 (-3.948978) 0.170466 / 6.500664 (-6.330198) 0.063585 / 0.075469 (-0.011884)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.236981 / 1.841788 (-0.604807) 15.046616 / 8.074308 (6.972307) 14.551781 / 10.191392 (4.360389) 0.144485 / 0.680424 (-0.535939) 0.017774 / 0.534201 (-0.516427) 0.446274 / 0.579283 (-0.133010) 0.436871 / 0.434364 (0.002507) 0.504503 / 0.540337 (-0.035834) 0.602014 / 1.386936 (-0.784922)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants