Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make inspect.get_dataset_config_names always return a non-empty list #3159

Merged
merged 17 commits into from
Oct 28, 2021

Conversation

albertvillanova
Copy link
Member

Make all named configs cases, so that no special unnamed config case needs to be handled differently.

Fix #3135.

@albertvillanova albertvillanova marked this pull request as ready for review October 26, 2021 13:50
@albertvillanova
Copy link
Member Author

albertvillanova commented Oct 26, 2021

This PR is already working (although not very beautiful; see below): the idea was to have the DatasetModule.builder_kwargs accessible from the builder_cls, so that this can generate the default builder config (at the class level, without requiring the builder to be instantiated).

I have a plan for a follow-up refactoring (same functionality, better implementation, much nicer), but I think we could already merge this, so that @severo can test it in the datasets previewer and report any potential issues.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure that storing the kwargs in a class attribute is a good idea.
Indeed the kwargs depend on how the builder is loaded, and therefore you could overwrite another builder's kwargs:

from datasets import *                          

b1 = load_dataset_builder("text", data_files="setup.py")                                        
print(b1.kwargs["data_files"])
# {'train': [PosixPath('/Users/quentinlhoest/Desktop/hf/datasets/setup.py')]}

b2 = load_dataset_builder("text", data_files="setup.cfg")                                       
print(b2.kwargs["data_files"])
# {'train': [PosixPath('/Users/quentinlhoest/Desktop/hf/datasets/setup.cfg')]}

print(b1.kwargs["data_files"])
# {'train': [PosixPath('/Users/quentinlhoest/Desktop/hf/datasets/setup.cfg')]}

The issue can also happen for community datasets without scripts, and so it can break the dataset viewer

Though you mentioned that you have a better implementation in mind so that's great :)

Here are my thoughts if it can help:

The default config name doesn't depend on the builder class alone - but also how it is imported. For example the name of the configuration of lhoestq/demo1 is lhoestq___demo1, and the builder class is the Csv builder.

Therefore I would let the responsibility of the default config name to the load.py file, and not add this complexity to the builder class logic.

In particular I think you can actually get the default config name with dataset_module.builder_kwargs.get("name", "default") if builder_cls.builder_configs is empty.

@albertvillanova
Copy link
Member Author

albertvillanova commented Oct 26, 2021

Yes @lhoestq you are completely right. Indeed I was exclusively using builder_cls.kwargs to get the community dataset name (nothing else): "lhoestq___demo1"

See et: https://github.com/huggingface/datasets/pull/3159/files#diff-f933ce41f71c6c0d1ce658e27de62cbe0b45d777e9e68056dd012ac3eb9324f7R413-R415

In your example, the name I was getting from builder_cls.kwargs was:

{"name": "lhoestq___demo1",...}

I'm going to refactor all the approach... as I only need the name for this specific case ;)

@albertvillanova
Copy link
Member Author

I think this makes more sense now, @lhoestq @severo 😅

@albertvillanova albertvillanova changed the title Standardize access to dataset configs Make inspect.get_dataset_config_names always return a non-empty list Oct 28, 2021
Copy link
Member Author

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, there is an incident on the CircleCI side: VMs failing to be created.

@albertvillanova albertvillanova merged commit dfdd2f9 into master Oct 28, 2021
@albertvillanova albertvillanova deleted the fix-3135 branch October 28, 2021 05:44
@severo
Copy link
Collaborator

severo commented Oct 29, 2021

It works well, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make inspect.get_dataset_config_names always return a non-empty list of configs
3 participants