Refac module factory + avoid etag requests for hub datasets #2986

lhoestq · 2021-09-29T10:42:00Z

Refactor the module factory

When trying to extend the data_files logic to avoid doing unnecessary ETag requests, I noticed that the module preparation mechanism needed a refactor:

the function was 600 lines long
it was not readable
it contained many different cases that made it complex to maintain
it was hard to properly test it
it was hard to extend without breaking anything

The module preparation mechanism is in charge of taking the name of a dataset or a metric given by the user (ex: "squad", "accuracy", "lhoestq/test", "path/to/my/script.py", "path/to/my/data/directory", "json", "csv") and return a module (possibly downloaded from the Hub) that contains the dataset builder or the metric class to use.

Implementation details

I decided to separate all these use cases into different dataset/metric module factories.

First, the metric module factories:

CanonicalMetricModuleFactory: "accuracy", "rouge", ...
LocalMetricModuleFactory: "path/to/my/metric.py"

Then, the dataset module factories:

CanonicalDatasetModuleFactory: "squad", "glue", ...
CommunityDatasetModuleFactoryWithScript: "lhoestq/test"
CommunityDatasetModuleFactoryWithoutScript: "lhoestq/demo1"
PackagedDatasetModuleFactory: "json", "csv", ...
LocalDatasetModuleFactoryWithScript: "path/to/my/script.py"
LocalDatasetModuleFactoryWithoutScript: "path/to/my/data/directory"

And finally, additional factories when users have no internet:

CachedDatasetModuleFactory
CachedMetricModuleFactory

Breaking changes

One thing is that I still don't know at what extent we want to keep backward compatibility for prepare_module. For now I just kept it (except I removed two parameters) just in case, but it's not used anywhere anymore.

Avoid etag requests for hub datasets

To do this I added a class DataFilesDict that can be hashed to define the cache directory of the dataset.
It contains the usual data files formatted as {"train": ["train.txt"]} for example.
But each list of file is a DataFilesList that also has a origin_metadata attribute that contains metadata about the origin of each file:

for URLs: it stores the ETags of the files
for local files: it stores the last modification data
for files from a Hugging Face repository on the Hub: it stores the pattern (*, *.csv, "train.txt", etc.) and the commit sha of the repository (so there're no ETag requests !)

This way if any file changes, the hash of the DataFilesDict changes too !

You can instantiate a DataFilesDict by using patterns for local/remote files or files in a HF repository:

for local/remote files: DataFilesDict.from_local_or_remote(patterns)
for files in a HF repository: DataFilesDict.from_hf_repo(patterns, dataset_info)

Fix #2859

TODO

Fix the latest test:

fix the call to dataset_info in offline mode (related to Add timeout parameter to HfApi.dataset_info to work on JZ huggingface_hub#372)

Add some more tests:

test all the factories
test the new data files logic

Other:

docstrings
comments

severo · 2021-09-29T11:45:15Z

One thing is that I still don't know at what extent we want to keep backward compatibility for prepare_module. For now I just kept it (except I removed two parameters) just in case, but it's not used anywhere anymore.

FYI, various other projects currently use it, thus clearly a major version would be required:

https://github.com/search?q=org%3Ahuggingface+prepare_module&type=code

lhoestq · 2021-09-29T12:47:17Z

Yea so I kept prepare_module and changed it to use all the factories I added, so all the use cases in the link you shared are still working. The only two parameters I removed are minor IMO and were a bit hacky anyway (return_resolved_file_path and return_associated_base_path). I think they were only used internally in datasets but let me know if you're aware of a use case I didn't think of.

lhoestq · 2021-09-30T16:46:55Z

I think I'm done with the tests :) I'll do the comments/docs and then we just wait for huggingface/huggingface_hub#373 to get merged

lhoestq · 2021-10-01T14:42:12Z

When there's a new release of huggingface_hub (probably on monday), it will fix the CI.

The PR is ready for review. Let me know if I need to clarify some parts

lhoestq · 2021-10-01T14:43:25Z

One additional change I did: the tests won't affect the number of downloads on the website anymore. And users can choose to not update the number of downloads with HF_UPDATE_DOWNLOAD_COUNTS=0

lhoestq · 2021-10-05T10:40:08Z

CI failures are simply due to RAM issues with circleci workers.
And on windows there is an issue with installing ruamel.yaml from the bump of huggingface_hub (fixed on master)

albertvillanova

Thanks a lot for this contribution! This was so necessary.

This PR makes a clear distinction between all the different use cases, making use of the factory pattern instead of the previous heavily nested if conditions.

Some comments/questions below.

src/datasets/load.py

albertvillanova · 2021-10-07T08:52:20Z

src/datasets/load.py

+        logger.warning(
+            f"Couldn't find a directory or a dataset named '{self.name}'. "
+            f"It was picked from the master branch on github instead at {file_path}"
+        )


I wouldn't log this warning from within the download_dataset_script_from_master() method because:

This implies this method is always dependent on some other method previously called. I would move this warning one level up, from within its caller (as the caller is responsible of calling the other method and this one).

The warning is already present in the caller get_module() (see line 465)

Also note that once you remove the warning, both methods have the same implementation with different revision parameter values.

Good catch !

I merged both methods, and only kept the warning one level up.

albertvillanova · 2021-10-07T08:58:56Z

src/datasets/load.py

+    def download_dataset_script(self) -> str:
+        file_path = hf_github_url(path=self.name, name=self.name + ".py", revision=self.revision)
+        return cached_path(file_path, download_config=self.download_config)
+
+    def download_dataset_script_from_master(self) -> str:


Following with the philosophy of the factory design pattern (a common interface with different implementations), I would call these 2 methods:

download_script or download_loading_script

download_script_from_master or download_loading_script_from_master
analogously to the method get_module (we don't call it get_dataset_module).

I renamed it to download_loading_script, and removed the method "from_master" (see previous comment about merging the two methods)

src/datasets/load.py

albertvillanova · 2021-10-07T11:24:07Z

src/datasets/data_files.py

+class Url(str):
+    pass


As it has no further implementation, maybe a type alias instead?

Suggested change

class Url(str):

pass

Url = str

This way any str passed qualifies as Url.

I found it useful for the tests to be able to do isinstance(url, Url) and check that it's not just a str type

src/datasets/data_files.py

albertvillanova · 2021-10-07T15:35:42Z

src/datasets/load.py

+        dynamic_modules_path = self.dynamic_modules_path if self.dynamic_modules_path else init_dynamic_modules()
+        importable_directory_path = os.path.join(dynamic_modules_path, "datasets", self.name)
+        Path(importable_directory_path).mkdir(parents=True, exist_ok=True)
+        (Path(importable_directory_path).parent / "__init__.py").touch(exist_ok=True)
+        hash = files_to_hash([local_path] + [loc[1] for loc in local_imports])
+        importable_local_file = _copy_script_and_other_resouces_in_importable_dir(
+            name=self.name,
+            importable_directory_path=importable_directory_path,
+            subdirrectory_name=hash,
+            original_local_path=local_path,
+            local_imports=local_imports,
+            additional_files=additional_files,
+            download_mode=self.download_mode,
+        )
+        logger.debug(f"Created importable dataset file at {importable_local_file}")


A method with this code?

I put all this into a function, common to all the ones that new to create an importable script

lhoestq added 2 commits September 28, 2021 23:32

refac module factory + avoid etag requests for hub datasets

b670060

fix tests

14559bc

lhoestq mentioned this pull request Sep 29, 2021

Take namespace into account in caching #2938

Merged

lhoestq added 3 commits September 29, 2021 15:42

typing

ebd95ea

Merge branch 'master' into refac-dataset-builder-preparation

99789ee

fixes

a08ac4a

lhoestq mentioned this pull request Sep 29, 2021

Add timeout to dataset_info huggingface/huggingface_hub#373

Merged

lhoestq added 7 commits September 29, 2021 17:20

prepare timeout

0bf47d1

fix offline simulator with hugginggace_hub

dfa64b0

add module factory tests (1/N)

c51f6ac

add module factory test (2/N)

31ce562

add data files tests (1/N)

08cc3b4

add data fiels tests (2/N)

d29f2e6

add data files tests (3/N)

6bb3430

lhoestq added 3 commits September 30, 2021 18:47

style

f20dcc4

docstrings

d60d6f9

don't update counts when running tests

9d03ac2

lhoestq marked this pull request as ready for review October 1, 2021 14:39

lhoestq mentioned this pull request Oct 4, 2021

feat: 🎸 add a function to get a dataset config's split names #2906

Merged

2 tasks

lhoestq added 5 commits October 5, 2021 10:39

Merge branch 'master' into refac-dataset-builder-preparation

e16aa37

nump huggingface_hub

a4c0504

add timeouts for offline mode

adda857

minor

2779164

minor bis

1589711

lhoestq added 2 commits October 5, 2021 12:43

install ruamel-yaml properly in the CI for windows

3a3296c

Merge branch 'master' into refac-dataset-builder-preparation

8e04027

lhoestq mentioned this pull request Oct 5, 2021

Properly install ruamel-yaml for windows CI #3028

Merged

lhoestq added 7 commits October 5, 2021 19:55

Merge branch 'master' into refac-dataset-builder-preparation

14b88ce

Merge branch 'master' into refac-dataset-builder-preparation

feb449c

fix windows test

6f34511

style

e259cc9

fix comet intensive calls patcher

dfcaa0c

warning message when loading from the master branch

5e016df

style

87e2ab7

lhoestq mentioned this pull request Oct 6, 2021

Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041

Merged

albertvillanova reviewed Oct 7, 2021

View reviewed changes

lhoestq added 3 commits October 8, 2021 15:07

albert's comments

243ac56

remove unnecessary check

3f32b80

don't use master if HF_SCRIPTS_VERSION is specified

9fc18b9

lhoestq requested a review from albertvillanova October 8, 2021 13:38

lhoestq merged commit d86c7fb into master Oct 11, 2021

lhoestq deleted the refac-dataset-builder-preparation branch October 11, 2021 11:05

This was referenced Oct 12, 2021

Fix test command after refac #3065

Merged

Minor refactor prepare_module #2314

Closed

albertvillanova mentioned this pull request Oct 14, 2021

Fix loading a metric with internal import #3077

Merged

mariosasko mentioned this pull request May 23, 2022

inspect functions can't fetch dataset script from the Hub #4348

Closed

albertvillanova mentioned this pull request May 21, 2024

Remove dead code for non-dict data_files from packaged modules #6911

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refac module factory + avoid etag requests for hub datasets #2986

Refac module factory + avoid etag requests for hub datasets #2986

lhoestq commented Sep 29, 2021 •

edited

Loading

severo commented Sep 29, 2021

lhoestq commented Sep 29, 2021 •

edited

Loading

lhoestq commented Sep 30, 2021

lhoestq commented Oct 1, 2021

lhoestq commented Oct 1, 2021 •

edited

Loading

lhoestq commented Oct 5, 2021 •

edited

Loading

albertvillanova left a comment

albertvillanova Oct 7, 2021

albertvillanova Oct 7, 2021

lhoestq Oct 7, 2021

lhoestq Oct 8, 2021

albertvillanova Oct 7, 2021

lhoestq Oct 8, 2021

albertvillanova Oct 7, 2021

lhoestq Oct 7, 2021 •

edited

Loading

albertvillanova Oct 7, 2021

lhoestq Oct 8, 2021

Refac module factory + avoid etag requests for hub datasets #2986

Refac module factory + avoid etag requests for hub datasets #2986

Conversation

lhoestq commented Sep 29, 2021 • edited Loading

Refactor the module factory

Implementation details

Breaking changes

Avoid etag requests for hub datasets

TODO

severo commented Sep 29, 2021

lhoestq commented Sep 29, 2021 • edited Loading

lhoestq commented Sep 30, 2021

lhoestq commented Oct 1, 2021

lhoestq commented Oct 1, 2021 • edited Loading

lhoestq commented Oct 5, 2021 • edited Loading

albertvillanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq Oct 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq commented Sep 29, 2021 •

edited

Loading

lhoestq commented Sep 29, 2021 •

edited

Loading

lhoestq commented Oct 1, 2021 •

edited

Loading

lhoestq commented Oct 5, 2021 •

edited

Loading

lhoestq Oct 7, 2021 •

edited

Loading