Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix URLs for WikiAuto Manual, jeopardy and definite_pronoun_resolution #3266

Merged
merged 7 commits into from
Dec 6, 2021

Conversation

LashaO
Copy link
Contributor

@LashaO LashaO commented Nov 13, 2021

@LashaO
Copy link
Contributor Author

LashaO commented Nov 14, 2021

There seems to be problems with datasets metadata, of which I dont have access to. I think one of the datasets is from reddit. Can anyone help?

@slyviacassell
Copy link

Hello @LashaO , I think the errors were caused by _DATA_FILES in definite_pronoun_resolution.py. Here are details of the test error.

self = BuilderConfig(name='plain_text', version=1.0.0, data_dir=None, data_files={'train': 'train.c.txt', 'test': 'test.c.txt'}, description='Plain text import of the Definite Pronoun Resolution Dataset.')

    def __post_init__(self):
        # The config name is used to name the cache directory.
        invalid_windows_characters = r"<>:/\|?*"
        for invalid_char in invalid_windows_characters:
            if invalid_char in self.name:
                raise InvalidConfigName(
                    f"Bad characters from black list '{invalid_windows_characters}' found in '{self.name}'. "
                    f"They could create issues when creating a directory for this config on Windows filesystem."
                )
        if self.data_files is not None and not isinstance(self.data_files, DataFilesDict):
>           raise ValueError(f"Expected a DataFilesDict in data_files but got {self.data_files}")
E           ValueError: Expected a DataFilesDict in data_files but got {'train': 'train.c.txt', 'test': 'test.c.txt'}

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2021

Hi ! Thanks for the fixes :)

Instead of uploading the definite_pronoun_resolution data files in this PR, maybe we can just update the URL ?
The old url was http://www.hlt.utdallas.edu/~vince/data/emnlp12/train.c.txt, but now it's https://www.hlt.utdallas.edu/~vince/data/emnlp12/train.c.txt (https instead of http)

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2021

Actually the bad certificate creates an issue with the download

import datasets                                           
datasets.DownloadManager().download("https://www.hlt.utdallas.edu/~vince/data/emnlp12/train.c.txt")
# raises: ConnectionError: Couldn't reach https://www.hlt.utdallas.edu/~vince/data/emnlp12/train.c.txt

Let me see if I can fix that

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2021

I uploaded them to these URLs, feel free to use them instead of having the text files here in the PR :)
https://s3.amazonaws.com/datasets.huggingface.co/definite_pronoun_resolution/train.c.txt
https://s3.amazonaws.com/datasets.huggingface.co/definite_pronoun_resolution/test.c.txt

@LashaO
Copy link
Contributor Author

LashaO commented Dec 2, 2021

Thank you for the tips! Having a busy week so anyone willing to commit the suggestions is welcome. Else, I will try to get back to this in a while.

@mariosasko
Copy link
Collaborator

@LashaO Thanks for working on this. Yes, I'll take over as we already have a request to fix the URL of the Jeopardy! dataset in a separate issue.

@mariosasko
Copy link
Collaborator

mariosasko commented Dec 3, 2021

Still have to fix the error in the dummy data test of the WikiAuto dataset (so please don't merge). Done! Ready for merging.

@mariosasko mariosasko linked an issue Dec 3, 2021 that may be closed by this pull request
@LashaO
Copy link
Contributor Author

LashaO commented Dec 4, 2021

Thank you, Mario!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates !

@lhoestq
Copy link
Member

lhoestq commented Dec 6, 2021

The CI failure is only related to missing tags in the dataset cards, merging :)

@lhoestq lhoestq changed the title fix-3264-change-download-urls Fix URLs for WikiAuto Manual, jeopardy and definite_pronoun_resolution Dec 6, 2021
@lhoestq lhoestq merged commit 127746c into huggingface:master Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jeopardy _URL access denied
4 participants