Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refac module factory + avoid etag requests for hub datasets #2986

Merged
merged 32 commits into from
Oct 11, 2021

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Sep 29, 2021

Refactor the module factory

When trying to extend the data_files logic to avoid doing unnecessary ETag requests, I noticed that the module preparation mechanism needed a refactor:

  • the function was 600 lines long
  • it was not readable
  • it contained many different cases that made it complex to maintain
  • it was hard to properly test it
  • it was hard to extend without breaking anything

The module preparation mechanism is in charge of taking the name of a dataset or a metric given by the user (ex: "squad", "accuracy", "lhoestq/test", "path/to/my/script.py", "path/to/my/data/directory", "json", "csv") and return a module (possibly downloaded from the Hub) that contains the dataset builder or the metric class to use.

Implementation details

I decided to separate all these use cases into different dataset/metric module factories.

First, the metric module factories:

  • CanonicalMetricModuleFactory: "accuracy", "rouge", ...
  • LocalMetricModuleFactory: "path/to/my/metric.py"

Then, the dataset module factories:

  • CanonicalDatasetModuleFactory: "squad", "glue", ...
  • CommunityDatasetModuleFactoryWithScript: "lhoestq/test"
  • CommunityDatasetModuleFactoryWithoutScript: "lhoestq/demo1"
  • PackagedDatasetModuleFactory: "json", "csv", ...
  • LocalDatasetModuleFactoryWithScript: "path/to/my/script.py"
  • LocalDatasetModuleFactoryWithoutScript: "path/to/my/data/directory"

And finally, additional factories when users have no internet:

  • CachedDatasetModuleFactory
  • CachedMetricModuleFactory

Breaking changes

One thing is that I still don't know at what extent we want to keep backward compatibility for prepare_module. For now I just kept it (except I removed two parameters) just in case, but it's not used anywhere anymore.

Avoid etag requests for hub datasets

To do this I added a class DataFilesDict that can be hashed to define the cache directory of the dataset.
It contains the usual data files formatted as {"train": ["train.txt"]} for example.
But each list of file is a DataFilesList that also has a origin_metadata attribute that contains metadata about the origin of each file:

  • for URLs: it stores the ETags of the files
  • for local files: it stores the last modification data
  • for files from a Hugging Face repository on the Hub: it stores the pattern (*, *.csv, "train.txt", etc.) and the commit sha of the repository (so there're no ETag requests !)

This way if any file changes, the hash of the DataFilesDict changes too !

You can instantiate a DataFilesDict by using patterns for local/remote files or files in a HF repository:

  • for local/remote files: DataFilesDict.from_local_or_remote(patterns)
  • for files in a HF repository: DataFilesDict.from_hf_repo(patterns, dataset_info)

Fix #2859

TODO

Fix the latest test:

Add some more tests:

  • test all the factories
  • test the new data files logic

Other:

  • docstrings
  • comments

@severo
Copy link
Collaborator

severo commented Sep 29, 2021

One thing is that I still don't know at what extent we want to keep backward compatibility for prepare_module. For now I just kept it (except I removed two parameters) just in case, but it's not used anywhere anymore.

FYI, various other projects currently use it, thus clearly a major version would be required:

https://github.com/search?q=org%3Ahuggingface+prepare_module&type=code

@lhoestq
Copy link
Member Author

lhoestq commented Sep 29, 2021

Yea so I kept prepare_module and changed it to use all the factories I added, so all the use cases in the link you shared are still working. The only two parameters I removed are minor IMO and were a bit hacky anyway (return_resolved_file_path and return_associated_base_path). I think they were only used internally in datasets but let me know if you're aware of a use case I didn't think of.

@lhoestq
Copy link
Member Author

lhoestq commented Sep 30, 2021

I think I'm done with the tests :) I'll do the comments/docs and then we just wait for huggingface/huggingface_hub#373 to get merged

@lhoestq lhoestq marked this pull request as ready for review October 1, 2021 14:39
@lhoestq
Copy link
Member Author

lhoestq commented Oct 1, 2021

When there's a new release of huggingface_hub (probably on monday), it will fix the CI.

The PR is ready for review. Let me know if I need to clarify some parts

@lhoestq
Copy link
Member Author

lhoestq commented Oct 1, 2021

One additional change I did: the tests won't affect the number of downloads on the website anymore. And users can choose to not update the number of downloads with HF_UPDATE_DOWNLOAD_COUNTS=0

@lhoestq
Copy link
Member Author

lhoestq commented Oct 5, 2021

CI failures are simply due to RAM issues with circleci workers.
And on windows there is an issue with installing ruamel.yaml from the bump of huggingface_hub (fixed on master)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this contribution! This was so necessary.

This PR makes a clear distinction between all the different use cases, making use of the factory pattern instead of the previous heavily nested if conditions.

Some comments/questions below.

src/datasets/load.py Outdated Show resolved Hide resolved
src/datasets/load.py Outdated Show resolved Hide resolved
Comment on lines 438 to 441
logger.warning(
f"Couldn't find a directory or a dataset named '{self.name}'. "
f"It was picked from the master branch on github instead at {file_path}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't log this warning from within the download_dataset_script_from_master() method because:

  • This implies this method is always dependent on some other method previously called. I would move this warning one level up, from within its caller (as the caller is responsible of calling the other method and this one).
  • The warning is already present in the caller get_module() (see line 465)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that once you remove the warning, both methods have the same implementation with different revision parameter values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged both methods, and only kept the warning one level up.

Comment on lines 431 to 435
def download_dataset_script(self) -> str:
file_path = hf_github_url(path=self.name, name=self.name + ".py", revision=self.revision)
return cached_path(file_path, download_config=self.download_config)

def download_dataset_script_from_master(self) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following with the philosophy of the factory design pattern (a common interface with different implementations), I would call these 2 methods:

  • download_script or download_loading_script
  • download_script_from_master or download_loading_script_from_master
    analogously to the method get_module (we don't call it get_dataset_module).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to download_loading_script, and removed the method "from_master" (see previous comment about merging the two methods)

src/datasets/load.py Outdated Show resolved Hide resolved
src/datasets/load.py Outdated Show resolved Hide resolved
src/datasets/load.py Outdated Show resolved Hide resolved
Comment on lines +22 to +23
class Url(str):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it has no further implementation, maybe a type alias instead?

Suggested change
class Url(str):
pass
Url = str

This way any str passed qualifies as Url.

Copy link
Member Author

@lhoestq lhoestq Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it useful for the tests to be able to do isinstance(url, Url) and check that it's not just a str type

src/datasets/data_files.py Outdated Show resolved Hide resolved
src/datasets/data_files.py Outdated Show resolved Hide resolved
Comment on lines 479 to 493
dynamic_modules_path = self.dynamic_modules_path if self.dynamic_modules_path else init_dynamic_modules()
importable_directory_path = os.path.join(dynamic_modules_path, "datasets", self.name)
Path(importable_directory_path).mkdir(parents=True, exist_ok=True)
(Path(importable_directory_path).parent / "__init__.py").touch(exist_ok=True)
hash = files_to_hash([local_path] + [loc[1] for loc in local_imports])
importable_local_file = _copy_script_and_other_resouces_in_importable_dir(
name=self.name,
importable_directory_path=importable_directory_path,
subdirrectory_name=hash,
original_local_path=local_path,
local_imports=local_imports,
additional_files=additional_files,
download_mode=self.download_mode,
)
logger.debug(f"Created importable dataset file at {importable_local_file}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A method with this code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put all this into a function, common to all the ones that new to create an importable script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading allenai/c4 in streaming mode does too many HEAD requests
3 participants