-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support streaming zipped dataset repo by passing only repo name #3375
Conversation
I just tested and I think this only opens one file ? If there are several files in the ZIP, only the first one is opened. To open several files from a ZIP, one has to call What about updating the CSV loader to make it |
I have implemented the glob of ZIP files in the packaged modules:
|
Also for streaming and non-streaming. |
In c10275f, there were 3 failing tests, only on Linux:
After re-running the CI in 57bfe1f, there was only 1 failing test:
The test I guess the issue is caused by those tests and has nothing to do with this PR. |
@lhoestq my final proposed solution:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thank you ! Good job on this :)
Note that at one point we might consider switching to using iter_archive
for ZIP files in the json/text/csv loaders since it should be faster.
@@ -126,7 +126,7 @@ def download_and_extract(self, data_url, *args): | |||
return self.create_dummy_data_single(dummy_file, data_url) | |||
|
|||
# this function has to be in the manager under this name so that testing works | |||
def download(self, data_url, *args): | |||
def download(self, data_url, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove the kwargs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it because *args
was already present. But I'm removing it if you think it is OK to have *args but not **kwargs.... :P
@@ -109,7 +109,7 @@ def manual_dir(self): | |||
return "/".join(self.dummy_file.replace(os.sep, "/").split("/")[:-1]) | |||
|
|||
# this function has to be in the manager under this name so that testing works | |||
def download_and_extract(self, data_url, *args): | |||
def download_and_extract(self, data_url, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here as well
As far as the functionality is kept... ;) |
Proposed solution:
iter_files
to DownloadManager and StreamingDownloadManagerFix #3373.