Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update link in wiki_bio dataset #3651

Merged
merged 3 commits into from
Jan 31, 2022
Merged

Conversation

jxmorris12
Copy link
Contributor

Fixes #3580 and makes the wiki_bio dataset work again. I changed the link and some documentation, and all the tests pass. Thanks @lhoestq for uploading the dataset to the HuggingFace data bucket.

@lhoestq -- all the tests pass, but I'm still not able to import the dataset, as the old Google Drive link is cached somewhere:

>>> from datasets import load_dataset
load_dataset("wiki_bio>>> load_dataset("wiki_bio")
Using custom data configuration default
Downloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1.03 GiB) to /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.1.0/5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9...
Traceback (most recent call last):
  ...
  File "/home/jxm3/random/datasets/src/datasets/utils/file_utils.py", line 612, in get_from_cache
    raise FileNotFoundError(f"Couldn't find file at {url}")
FileNotFoundError: Couldn't find file at https://drive.google.com/uc?export=download&id=1L7aoUXzHPzyzQ0ns4ApBbYepsjFOtXil

what do I have to do to invalidate the cache and actually import the dataset? It's clearly set up correctly, since the data is downloaded and processed by the tests.

As an aside, this caching-loading-scripts behavior makes for a really bad developer experience. I just wasted an hour trying to figure out where the caching was happening and how to disable it, and I don't know. All I wanted to do was update the link and submit a pull request! I recommend that you all either change this behavior (i.e. updating the link to a dataset should "just work") or document it, since I couldn't find any information about this in the contributing.md or readme or anywhere else! Thanks!

@lhoestq
Copy link
Member

lhoestq commented Jan 31, 2022

all the tests pass, but I'm still not able to import the dataset

Since it's not merged on master yet, you have to provide the path to your local wiki_bio.py to use it.
Indeed the library downloads the dataset files from master if you have a dev installation of the library.

I agree it would be nice to change that, and use the local dataset scripts from the datasets directory - it feels definitely more natural.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, thanks a lot for fixing the dataset !

@lhoestq lhoestq merged commit ffc35f4 into huggingface:master Jan 31, 2022
@jxmorris12
Copy link
Contributor Author

Cool, thanks for your help and I agree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in wiki bio load
2 participants