-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iter_archive on zipfiles with better compression type check #3379
iter_archive on zipfiles with better compression type check #3379
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you !
It looks all good to me, I just added a comment to simplify _get_extraction_protocol_local
a bit since it only has to support local files. There is an error message that also needs to be moved.
To make sure that everything works es expected it would also be nice to have some tests for iter_archive
in test_download_manager.py
and test_streaming_download_manager.py
. I can help you with that if you want
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Hello @lhoestq, thank you for your answer. I don't use pytest a lot so I think I might need some help on it :) but I tried some tests for Comments :
I also started some tests on
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool Thanks ! I think we can use another host that Google Drive for the test file you added.
For the local test you can use the zip_csv_path
pytest fixture (just add zip_csv_path
as an argument to the test function). It's a path to a temporary zip file that contains two CSV files dataset.csv
and dataset2.csv
. If you're not familiar with pytest I can take care of this
continue | ||
file_obj = stream.extractfile(tarinfo) | ||
file_obj = io.BufferedReader(zipf.open(member, "r")) | ||
yield (file_path, file_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's raise an error if it's not TAR or ZIP, just in case:
yield (file_path, file_obj) | |
yield (file_path, file_obj) | |
else: | |
raise NotImplementedError(f"DownloadManager.iter_archive isn't implemented for '{extension}' file at {path}") |
file_obj = stream.extractfile(tarinfo) | ||
yield (file_path, file_obj) | ||
stream.members = [] | ||
del stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
del stream | |
del stream | |
else: | |
raise NotImplementedError(f"StreamingDownloadManager.iter_archive isn't implemented for '{extension}' file at {path}") |
TEST_GG_DRIVE2_ZIPPED_URL = "https://drive.google.com/uc?export=download&id=1X4jyUBBbShyCRfD-vCO1ZvfqFXP3NEeU" | ||
TEST_GG_DRIVE2_FILENAME = ["atoz/test1.txt", "atoz/test2.txt", "atoz/test3/test3.txt"] | ||
TEST_GG_DRIVE2_CONTENT = ["Erwin is the best character", "this is test 2", "foo"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For such tests I would use another host than Google Drive, which is often unreliable.
In this file you can see some tests that are using files hosted on Google Drive, but this is just to make sure that streaming from Google Drive works (it required a special logic). For the others tests let's use better hosts.
I downloaded your zip and put it here if it can be useful:
https://huggingface.co/datasets/lhoestq/test_zip_txt/resolve/main/atoz.zip
"urlpath", | ||
[(TEST_GG_DRIVE_GZIPPED_URL)], | ||
) | ||
@pytest.mark.xfail() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok if iter_archive
on a gzip file raises zipfile.BadZipFile
as specified in your test, so I would remove xfail
@pytest.mark.xfail() |
from .streaming_download_manager import ( | ||
BASE_KNOWN_EXTENSIONS, | ||
COMPRESSION_EXTENSION_TO_PROTOCOL, | ||
_get_extraction_protocol_with_magic_number, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would move this to src/datasets/utils/file_utils.py
to avoid importing from src/datasets/utils/streaming_download_manager.py
to src/datasets/utils/download_manager.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I understood well, do you want me to redefine the function and dictionnaries in file_utils.py
or just move the imports to file_utils.py
.
I used the second one in the commit 10cf700 Until your response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would move this to
src/datasets/utils/file_utils.py
to avoid importing fromsrc/datasets/utils/streaming_download_manager.py
tosrc/datasets/utils/download_manager.py
Hello @mariosasko, moving COMPRESSION_EXTENSION_TO_PROTOCOL
to file_utils.py
creates a cyclic import error as it uses COMPRESSION_FILESYSTEMS
which is imported from filesystems.py
, and inside filesystems.py
there are imports from file_utils.py
.
How do I arrange import which is best for clarity and consistency ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's move COMPRESSION_EXTENSION_TO_PROTOCOL
to filesystems.py
then?
@@ -64,6 +71,20 @@ class GenerateMode(enum.Enum): | |||
FORCE_REDOWNLOAD = "force_redownload" | |||
|
|||
|
|||
def _get_extraction_protocol_local(path: str) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit for consistency with streaming_download_manager.py
:
def _get_extraction_protocol_local(path: str) -> Optional[str]: | |
def _get_extraction_protocol(path: str) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @mariosasko @lhoestq Thanks for the answers, I wasn't active lately I will get back to complete this pull request later this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @Mehdi2402 !
Hello, Sorry again for being inactive lately in this PR. |
thanks a lot ! This CI seems to have import errors now though ? |
Yes sorry about that, it's due to a cyclic import I didn't pay attention to. Will fix that in the next Commit along with adding the tests to download_manager. |
Hello @Mehdi2402, are you still interested in working on this further? |
Hello @albertvillanova, yes I would like to resume work on this. |
Great, we would like to have this feature. First, you should resolve the conflicts with the main branch, by merging main into your feature branch and then fixing the conflicts by hand. Let us know if you would need some help on this: we can resolve the conflicts for you, so that you can continue your contribution afterwards. |
The documentation is not available anymore as the PR was closed or merged. |
I refactored the code to make this PR ready for the final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice !
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Hello @lhoestq , thank you for your detailed answer on previous PR !
I made this new PR because I misused git on the previous one #3347.
Related issue #3272.
Comments :
_get_extraction_protocol
function in download_manager.py with a slight change and called it_get_extraction_protocol_local
:I removed this part :
And also changed :
The reason for this is a compression like .tar.gz will be considered a .gz which is handled with zipfile, though tar.gz can only be opened using tarfile.
Please tell me if there's anything to change.
Tasks :