-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refac module factory + avoid etag requests for hub datasets #2986
Conversation
FYI, various other projects currently use it, thus clearly a major version would be required: https://github.com/search?q=org%3Ahuggingface+prepare_module&type=code |
Yea so I kept |
I think I'm done with the tests :) I'll do the comments/docs and then we just wait for huggingface/huggingface_hub#373 to get merged |
When there's a new release of The PR is ready for review. Let me know if I need to clarify some parts |
One additional change I did: the tests won't affect the number of downloads on the website anymore. And users can choose to not update the number of downloads with |
CI failures are simply due to RAM issues with circleci workers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this contribution! This was so necessary.
This PR makes a clear distinction between all the different use cases, making use of the factory pattern instead of the previous heavily nested if conditions.
Some comments/questions below.
src/datasets/load.py
Outdated
logger.warning( | ||
f"Couldn't find a directory or a dataset named '{self.name}'. " | ||
f"It was picked from the master branch on github instead at {file_path}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't log this warning from within the download_dataset_script_from_master()
method because:
- This implies this method is always dependent on some other method previously called. I would move this warning one level up, from within its caller (as the caller is responsible of calling the other method and this one).
- The warning is already present in the caller
get_module()
(see line 465)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that once you remove the warning, both methods have the same implementation with different revision
parameter values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged both methods, and only kept the warning one level up.
src/datasets/load.py
Outdated
def download_dataset_script(self) -> str: | ||
file_path = hf_github_url(path=self.name, name=self.name + ".py", revision=self.revision) | ||
return cached_path(file_path, download_config=self.download_config) | ||
|
||
def download_dataset_script_from_master(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following with the philosophy of the factory design pattern (a common interface with different implementations), I would call these 2 methods:
download_script
ordownload_loading_script
download_script_from_master
ordownload_loading_script_from_master
analogously to the methodget_module
(we don't call itget_dataset_module
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to download_loading_script
, and removed the method "from_master" (see previous comment about merging the two methods)
class Url(str): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it has no further implementation, maybe a type alias instead?
class Url(str): | |
pass | |
Url = str |
This way any str
passed qualifies as Url
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it useful for the tests to be able to do isinstance(url, Url)
and check that it's not just a str
type
src/datasets/load.py
Outdated
dynamic_modules_path = self.dynamic_modules_path if self.dynamic_modules_path else init_dynamic_modules() | ||
importable_directory_path = os.path.join(dynamic_modules_path, "datasets", self.name) | ||
Path(importable_directory_path).mkdir(parents=True, exist_ok=True) | ||
(Path(importable_directory_path).parent / "__init__.py").touch(exist_ok=True) | ||
hash = files_to_hash([local_path] + [loc[1] for loc in local_imports]) | ||
importable_local_file = _copy_script_and_other_resouces_in_importable_dir( | ||
name=self.name, | ||
importable_directory_path=importable_directory_path, | ||
subdirrectory_name=hash, | ||
original_local_path=local_path, | ||
local_imports=local_imports, | ||
additional_files=additional_files, | ||
download_mode=self.download_mode, | ||
) | ||
logger.debug(f"Created importable dataset file at {importable_local_file}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A method with this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put all this into a function, common to all the ones that new to create an importable script
Refactor the module factory
When trying to extend the
data_files
logic to avoid doing unnecessary ETag requests, I noticed that the module preparation mechanism needed a refactor:The module preparation mechanism is in charge of taking the name of a dataset or a metric given by the user (ex: "squad", "accuracy", "lhoestq/test", "path/to/my/script.py", "path/to/my/data/directory", "json", "csv") and return a module (possibly downloaded from the Hub) that contains the dataset builder or the metric class to use.
Implementation details
I decided to separate all these use cases into different dataset/metric module factories.
First, the metric module factories:
Then, the dataset module factories:
And finally, additional factories when users have no internet:
Breaking changes
One thing is that I still don't know at what extent we want to keep backward compatibility for
prepare_module
. For now I just kept it (except I removed two parameters) just in case, but it's not used anywhere anymore.Avoid etag requests for hub datasets
To do this I added a class
DataFilesDict
that can be hashed to define the cache directory of the dataset.It contains the usual data files formatted as
{"train": ["train.txt"]}
for example.But each list of file is a
DataFilesList
that also has aorigin_metadata
attribute that contains metadata about the origin of each file:*
,*.csv
, "train.txt", etc.) and the commit sha of the repository (so there're no ETag requests !)This way if any file changes, the hash of the
DataFilesDict
changes too !You can instantiate a
DataFilesDict
by using patterns for local/remote files or files in a HF repository:DataFilesDict.from_local_or_remote(patterns)
DataFilesDict.from_hf_repo(patterns, dataset_info)
Fix #2859
TODO
Fix the latest test:
Add some more tests:
Other: