Update link in wiki_bio dataset #3651

jxmorris12 · 2022-01-30T16:28:54Z

Fixes #3580 and makes the wiki_bio dataset work again. I changed the link and some documentation, and all the tests pass. Thanks @lhoestq for uploading the dataset to the HuggingFace data bucket.

@lhoestq -- all the tests pass, but I'm still not able to import the dataset, as the old Google Drive link is cached somewhere:

>>> from datasets import load_dataset
load_dataset("wiki_bio>>> load_dataset("wiki_bio")
Using custom data configuration default
Downloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1.03 GiB) to /home/jxm3/.cache/huggingface/datasets/wiki_bio/default/1.1.0/5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf87e911c9...
Traceback (most recent call last):
  ...
  File "/home/jxm3/random/datasets/src/datasets/utils/file_utils.py", line 612, in get_from_cache
    raise FileNotFoundError(f"Couldn't find file at {url}")
FileNotFoundError: Couldn't find file at https://drive.google.com/uc?export=download&id=1L7aoUXzHPzyzQ0ns4ApBbYepsjFOtXil

what do I have to do to invalidate the cache and actually import the dataset? It's clearly set up correctly, since the data is downloaded and processed by the tests.

As an aside, this caching-loading-scripts behavior makes for a really bad developer experience. I just wasted an hour trying to figure out where the caching was happening and how to disable it, and I don't know. All I wanted to do was update the link and submit a pull request! I recommend that you all either change this behavior (i.e. updating the link to a dataset should "just work") or document it, since I couldn't find any information about this in the contributing.md or readme or anywhere else! Thanks!

lhoestq · 2022-01-31T08:36:04Z

all the tests pass, but I'm still not able to import the dataset

Since it's not merged on master yet, you have to provide the path to your local wiki_bio.py to use it.
Indeed the library downloads the dataset files from master if you have a dev installation of the library.

I agree it would be nice to change that, and use the local dataset scripts from the datasets directory - it feels definitely more natural.

lhoestq

Anyway, thanks a lot for fixing the dataset !

jxmorris12 · 2022-01-31T14:50:47Z

Cool, thanks for your help and I agree!

jxmorris12 added 3 commits January 30, 2022 11:23

update link in wiki_bio dataset

24cfb8e

run linter and update dummy data

0c96845

fix markdown so that test passes (even though I didnt break it)

663f3b7

lhoestq approved these changes Jan 31, 2022

View reviewed changes

lhoestq merged commit ffc35f4 into huggingface:master Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update link in wiki_bio dataset #3651

Update link in wiki_bio dataset #3651

jxmorris12 commented Jan 30, 2022

lhoestq commented Jan 31, 2022

lhoestq left a comment

jxmorris12 commented Jan 31, 2022

Update link in wiki_bio dataset #3651

Update link in wiki_bio dataset #3651

Conversation

jxmorris12 commented Jan 30, 2022

lhoestq commented Jan 31, 2022

lhoestq left a comment

Choose a reason for hiding this comment

jxmorris12 commented Jan 31, 2022