-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multiple configs in packaged modules via metadata yaml info #5331
Merged
polinaeterna
merged 223 commits into
huggingface:main
from
polinaeterna:arbitrary-config-parameters-in-meta-yaml
Jul 13, 2023
Merged
Support for multiple configs in packaged modules via metadata yaml info #5331
polinaeterna
merged 223 commits into
huggingface:main
from
polinaeterna:arbitrary-config-parameters-in-meta-yaml
Jul 13, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4 tasks
The documentation is not available anymore as the PR was closed or merged. |
…e pass glob pattern
…:polinaeterna/datasets into arbitrary-config-parameters-in-meta-yaml
save it until better times
because otherwise the error says that there is no data files in repository which is misleading
This reverts commit b3a963b.
lhoestq
approved these changes
Jul 13, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving again :)
Good job !
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
This was referenced Jul 21, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
will solve #5209 and #5151 and many other...
Config parameters for packaged builders are parsed from
“builder_config”
field in README.md file (separate firs-level field, not part of “dataset_info”), example:I tried to align packaged builders with custom configs parsed from metadata with scripts dataset builder as much as possible. Their builders are created dynamically (see
configure_builder_class()
in load.py) and have
BUILDER_CONFIGSattribute filled with
BuilderConfig` objects in the same way as for datasets with script.load_dataset
config_name
provided, it becomes a default one (loaded when noconfig_name
is specified, exampleThanks to @lhoestq and this change, it's possible to add
"default"
field in yaml and set it to True, to make the config a default one (loaded when no config is specified):then
load_dataset("ds") == load_dataset("ds", "v1")
.dataset_name and cache
I decided that it’s reasonable to add a
dataset_name
attribute toDatasetBuilder
class which would be equal toname
for scripts dataset but reflect a real dataset name for packaged builders (last part of path/name from hub). This is mostly to reorganize cache structure (I believe we can do this in the major release?) because otherwise, with custom configs for packaged builders which were all stored in the same directory, i it was becoming a mess. And in general it makes much more sense like this, from datasets server perspective too, though it’s a breaking changeSo the cache dir has the following structure:
"{namespace__}<dataset_name>/<config_name>/<version>/<hash>/"
and arrow/parquet filenames are also"<dataset_name>-<split>.arrow"
.For example
polinaeterna___audiofolder_two_configs_in_metadata/v1-5532fac9443ea252/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc/
forpolinaeterna/audiofolder_two_configs_in_metadata
Hub dataset, train arrow file isaudiofolder_two_configs_in_metadata-train.arrow
.For script datasets it remains unchanged.
push_to_hub
To support custom configs with
push_to_hub
, the data is put under directory named either as<config_name>
ifconfig_name
is not "default" or "data" ifconfig_name
is omitted or "default" (for backward compatibility)."builder_config"
field is added to README.md, withconfig_name
(optional) anddata_files
fields. for"data_files"
,"pattern"
parameter is introduced, to resolve data files correctly, see polinaeterna#1.ds.push_to_hub("ds")
--> one config ("default"), put under "data" directory, exampleds.push_to_hub("ds", "custom")
--> put under "custom" directory, exampleThanks to @lhoestq and polinaeterna#1, when pushing to datasets created before this change, README.md is updated accordingly (config for old data is added along with the one that is being pushed).
"dataset_info"
yaml field is updated accordingly (new configs are added).This shouldn't break anything!
TODO in separate PRs:
builder_config
in readme)