Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add long answer candidates to natural questions dataset #4368

Merged
merged 14 commits into from
Jul 26, 2022

Conversation

seirasto
Copy link
Contributor

This is a modification of the Natural Questions dataset to include missing information specifically related to long answer candidates. (See here: https://github.com/google-research-datasets/natural-questions#long-answer-candidates). This information is important to ensure consistent comparison with prior work. It does not disturb the rest of the format . @lhoestq @albertvillanova

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 24, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @seirasto for your contribution. Definitely, the addition of this field will be useful for all community members using this dataset.

Just some comments below.

In relation with the non-passing tests, please note that once the changes are validated, we should pre-process all this dataset and upload it to our HuggingFace cloud.

datasets/natural_questions/README.md Outdated Show resolved Hide resolved
add extra space to long_answer_candidates features
@albertvillanova
Copy link
Member

Once we have added long_answer_candidates maybe it would be worth to also add the missing candidate_index (inside long_answer). What do you think, @seirasto ?

@albertvillanova
Copy link
Member

Also note the "Data Fields" section in the README is missing the long_answer field.

Moreover, there is no instance example in "Data Instances" section.

@albertvillanova
Copy link
Member

We could either make these fixes in this PR or in a subsequent PR.

@seirasto
Copy link
Contributor Author

seirasto commented Jun 1, 2022

@albertvillanova I've added the missing fields and updated the README to include a data instance and some other things.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

I think the script is already OK, so that we can start pre-processing the entire dataset: I'm addressing this.

In relation with the dataset card (README file), I think there is an issue: the fields in both "Data Instances" and "Data Fields" section should be aligned (naming and nesting) with the fields we return in the script:

{
"id": id_,
"document": {
"title": ex_json["document_title"],
"url": ex_json["document_url"],
"html": ex_json["document_html"],
"tokens": [
{"token": t["token"], "is_html": t["html_token"], "start_byte": t["start_byte"], "end_byte": t["end_byte"]} for t in ex_json["document_tokens"]
],
},
"question": {"text": ex_json["question_text"], "tokens": ex_json["question_tokens"]},
"long_answer_candidates": [lac_json for lac_json in ex_json["long_answer_candidates"]],
"annotations": [_parse_annotation(an_json) for an_json in ex_json["annotations"]],
},

@seirasto
Copy link
Contributor Author

seirasto commented Jun 1, 2022

Great! I've made the updates to align the README. Please let me know if I missed anything.

Fix:
- Rename "example_id" to "id"
- Set value of "id" as str
- Move "question" below "document"
- Move "document">"title" above "document">"url"
- Set value of "document">"title" as str
- Rename "document">"document_tokens" to "document">"tokens"
- Rename "Token" to "token"
- Add missing closing `}` to value of "document"
- Add missing "id" inside each "annotations"
- Add missing "text" inside each "short_answer"
- Set value of "yes_no_answer" to corresponding int
@albertvillanova
Copy link
Member

As there were many minor little fixes, I thought it would be easier to fix them directly.

@albertvillanova
Copy link
Member

I think the loading script is OK now. If it is also validated by another datasets maintainer, I could run the generation of the pre-processed data and then merge this PR into master (once all the tests are green).

CC: @lhoestq

@lhoestq
Copy link
Member

lhoestq commented Jun 9, 2022

It looks good to me, thanks @seirasto !

@albertvillanova
Copy link
Member

I have merged the master branch, so that we include all the fixes on Apache Beam + Google Dataflow.

@albertvillanova
Copy link
Member

Pre-processing is running!

Already finished for "dev" config:

In [2]: ds = load_dataset("datasets/natural_questions", "dev")

In [3]: ds
Out[3]: 
DatasetDict({
    validation: Dataset({
        features: ['id', 'document', 'question', 'long_answer_candidates', 'annotations'],
        num_rows: 7830
    })
})

@albertvillanova
Copy link
Member

albertvillanova commented Jun 20, 2022

There is an issue while running the preprocessing for the "default" (train+dev) config. Train data files are larger than than dev ones and workers run out of memory.

I'm opening a separate issue to handle this problem: #4525

@albertvillanova
Copy link
Member

@seirasto is proposing uploading their preprocessed data files to our Datasets bucket.

I think @lhoestq can give a more informed answer about authentication requirements.

@lhoestq
Copy link
Member

lhoestq commented Jul 26, 2022

Now that the data fiels are uploaded, can you merge the main branch into yours to re-trigger the CI @seirasto please ? :) Then I think we can merge if it's good for you @albertvillanova

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, once CI is green! Finally we can merge this PR... 😅

@seirasto
Copy link
Contributor Author

Merge is done! I think someone needs to approve the CI to run :)

@lhoestq
Copy link
Member

lhoestq commented Jul 26, 2022

Can you run make style to fix the code formatting required by the CI please ?

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After running make style, these are the style fixes (see below).

datasets/natural_questions/natural_questions.py Outdated Show resolved Hide resolved
datasets/natural_questions/natural_questions.py Outdated Show resolved Hide resolved
datasets/natural_questions/natural_questions.py Outdated Show resolved Hide resolved
datasets/natural_questions/natural_questions.py Outdated Show resolved Hide resolved
seirasto and others added 4 commits July 26, 2022 12:14
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
@seirasto
Copy link
Contributor Author

Thanks @albertvillanova! I've committed all your suggestions.

@albertvillanova
Copy link
Member

The CI is green. I'm merging this PR.

@albertvillanova albertvillanova merged commit f5847a3 into huggingface:main Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants