Add long answer candidates to natural questions dataset #4368

seirasto · 2022-05-18T14:35:42Z

This is a modification of the Natural Questions dataset to include missing information specifically related to long answer candidates. (See here: https://github.com/google-research-datasets/natural-questions#long-answer-candidates). This information is important to ensure consistent comparison with prior work. It does not disturb the rest of the format . @lhoestq @albertvillanova

HuggingFaceDocBuilderDev · 2022-05-24T08:24:22Z

The documentation is not available anymore as the PR was closed or merged.

albertvillanova

Thanks a lot @seirasto for your contribution. Definitely, the addition of this field will be useful for all community members using this dataset.

Just some comments below.

In relation with the non-passing tests, please note that once the changes are validated, we should pre-process all this dataset and upload it to our HuggingFace cloud.

datasets/natural_questions/README.md

add extra space to long_answer_candidates features

albertvillanova · 2022-05-31T13:33:31Z

Once we have added long_answer_candidates maybe it would be worth to also add the missing candidate_index (inside long_answer). What do you think, @seirasto ?

albertvillanova · 2022-05-31T13:35:12Z

Also note the "Data Fields" section in the README is missing the long_answer field.

Moreover, there is no instance example in "Data Instances" section.

albertvillanova · 2022-05-31T13:35:47Z

We could either make these fixes in this PR or in a subsequent PR.

…s into nq_long_answer_candidates

seirasto · 2022-06-01T14:50:58Z

@albertvillanova I've added the missing fields and updated the README to include a data instance and some other things.

albertvillanova

Thank you.

I think the script is already OK, so that we can start pre-processing the entire dataset: I'm addressing this.

In relation with the dataset card (README file), I think there is an issue: the fields in both "Data Instances" and "Data Fields" section should be aligned (naming and nesting) with the fields we return in the script:

datasets/datasets/natural_questions/natural_questions.py

Lines 187 to 200 in 7129c41

    
           { 
        
               "id": id_, 
        
               "document": { 
        
                   "title": ex_json["document_title"], 
        
                   "url": ex_json["document_url"], 
        
                   "html": ex_json["document_html"], 
        
                   "tokens": [ 
        
                       {"token": t["token"], "is_html": t["html_token"], "start_byte": t["start_byte"], "end_byte": t["end_byte"]} for t in ex_json["document_tokens"] 
        
                   ], 
        
               }, 
        
               "question": {"text": ex_json["question_text"], "tokens": ex_json["question_tokens"]}, 
        
               "long_answer_candidates": [lac_json for lac_json in ex_json["long_answer_candidates"]], 
        
               "annotations": [_parse_annotation(an_json) for an_json in ex_json["annotations"]], 
        
           },

seirasto · 2022-06-01T15:59:42Z

Great! I've made the updates to align the README. Please let me know if I missed anything.

Fix: - Rename "example_id" to "id" - Set value of "id" as str - Move "question" below "document" - Move "document">"title" above "document">"url" - Set value of "document">"title" as str - Rename "document">"document_tokens" to "document">"tokens" - Rename "Token" to "token" - Add missing closing `}` to value of "document" - Add missing "id" inside each "annotations" - Add missing "text" inside each "short_answer" - Set value of "yes_no_answer" to corresponding int

albertvillanova · 2022-06-01T16:57:25Z

As there were many minor little fixes, I thought it would be easier to fix them directly.

albertvillanova · 2022-06-08T17:22:01Z

I think the loading script is OK now. If it is also validated by another datasets maintainer, I could run the generation of the pre-processed data and then merge this PR into master (once all the tests are green).

CC: @lhoestq

lhoestq · 2022-06-09T09:27:35Z

It looks good to me, thanks @seirasto !

…_to_nq

albertvillanova · 2022-06-17T06:38:42Z

I have merged the master branch, so that we include all the fixes on Apache Beam + Google Dataflow.

albertvillanova · 2022-06-17T11:25:42Z

Pre-processing is running!

Already finished for "dev" config:

In [2]: ds = load_dataset("datasets/natural_questions", "dev")

In [3]: ds
Out[3]: 
DatasetDict({
    validation: Dataset({
        features: ['id', 'document', 'question', 'long_answer_candidates', 'annotations'],
        num_rows: 7830
    })
})

albertvillanova · 2022-06-20T07:16:49Z

There is an issue while running the preprocessing for the "default" (train+dev) config. Train data files are larger than than dev ones and workers run out of memory.

I'm opening a separate issue to handle this problem: #4525

albertvillanova · 2022-06-28T14:38:58Z

@seirasto is proposing uploading their preprocessed data files to our Datasets bucket.

I think @lhoestq can give a more informed answer about authentication requirements.

lhoestq · 2022-07-26T14:37:32Z

Now that the data fiels are uploaded, can you merge the main branch into yours to re-trigger the CI @seirasto please ? :) Then I think we can merge if it's good for you @albertvillanova

albertvillanova

Good, once CI is green! Finally we can merge this PR... 😅

seirasto · 2022-07-26T14:51:20Z

Merge is done! I think someone needs to approve the CI to run :)

lhoestq · 2022-07-26T15:10:35Z

Can you run make style to fix the code formatting required by the CI please ?

albertvillanova

After running make style, these are the style fixes (see below).

datasets/natural_questions/natural_questions.py

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

seirasto · 2022-07-26T16:16:16Z

Thanks @albertvillanova! I've committed all your suggestions.

albertvillanova · 2022-07-26T20:18:31Z

The CI is green. I'm merging this PR.

seirasto added 3 commits May 18, 2022 09:58

add long answer candidates to natural questions

91c4a31

formatting add long answer candidates to natural questions

b7de947

update nq readme and json

fdc87c4

albertvillanova requested changes May 24, 2022

View reviewed changes

datasets/natural_questions/README.md Outdated Show resolved Hide resolved

Update README.md

4984a52

add extra space to long_answer_candidates features

seirasto added 2 commits June 1, 2022 10:37

added additional missing fields

7129c41

Merge branch 'add_la_candidates_to_nq' of github.com:seirasto/dataset…

a9a6d88

…s into nq_long_answer_candidates

albertvillanova requested changes Jun 1, 2022

View reviewed changes

fixes to field an instance

79df5df

Merge remote-tracking branch 'upstream/master' into add_la_candidates…

9962dae

…_to_nq

jdpsen mentioned this pull request Jun 17, 2022

push NQ fixes to HF primeqa/primeqa#118

Closed

albertvillanova mentioned this pull request Jun 20, 2022

Out of memory error on workers while running Beam+Dataflow #4525

Closed

albertvillanova approved these changes Jul 26, 2022

View reviewed changes

Merge branch 'huggingface:main' into add_la_candidates_to_nq

e8745a8

albertvillanova reviewed Jul 26, 2022

View reviewed changes

seirasto and others added 4 commits July 26, 2022 12:14

Update datasets/natural_questions/natural_questions.py

38f7d6b

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

Update datasets/natural_questions/natural_questions.py

5598f60

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

Update datasets/natural_questions/natural_questions.py

7b63f74

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

Update datasets/natural_questions/natural_questions.py

ec18262

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

albertvillanova merged commit f5847a3 into huggingface:main Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add long answer candidates to natural questions dataset #4368

Add long answer candidates to natural questions dataset #4368

seirasto commented May 18, 2022

HuggingFaceDocBuilderDev commented May 24, 2022 •

edited

Loading

albertvillanova left a comment •

edited

Loading

albertvillanova commented May 31, 2022

albertvillanova commented May 31, 2022

albertvillanova commented May 31, 2022

seirasto commented Jun 1, 2022

albertvillanova left a comment

seirasto commented Jun 1, 2022

albertvillanova commented Jun 1, 2022

albertvillanova commented Jun 8, 2022

lhoestq commented Jun 9, 2022

albertvillanova commented Jun 17, 2022

albertvillanova commented Jun 17, 2022

albertvillanova commented Jun 20, 2022 •

edited

Loading

albertvillanova commented Jun 28, 2022

lhoestq commented Jul 26, 2022

albertvillanova left a comment

seirasto commented Jul 26, 2022

lhoestq commented Jul 26, 2022

albertvillanova left a comment

seirasto commented Jul 26, 2022

albertvillanova commented Jul 26, 2022

	{
	"id": id_,
	"document": {
	"title": ex_json["document_title"],
	"url": ex_json["document_url"],
	"html": ex_json["document_html"],
	"tokens": [
	{"token": t["token"], "is_html": t["html_token"], "start_byte": t["start_byte"], "end_byte": t["end_byte"]} for t in ex_json["document_tokens"]
	],
	},
	"question": {"text": ex_json["question_text"], "tokens": ex_json["question_tokens"]},
	"long_answer_candidates": [lac_json for lac_json in ex_json["long_answer_candidates"]],
	"annotations": [_parse_annotation(an_json) for an_json in ex_json["annotations"]],
	},

Add long answer candidates to natural questions dataset #4368

Add long answer candidates to natural questions dataset #4368

Conversation

seirasto commented May 18, 2022

HuggingFaceDocBuilderDev commented May 24, 2022 • edited Loading

albertvillanova left a comment • edited Loading

Choose a reason for hiding this comment

albertvillanova commented May 31, 2022

albertvillanova commented May 31, 2022

albertvillanova commented May 31, 2022

seirasto commented Jun 1, 2022

albertvillanova left a comment

Choose a reason for hiding this comment

seirasto commented Jun 1, 2022

albertvillanova commented Jun 1, 2022

albertvillanova commented Jun 8, 2022

lhoestq commented Jun 9, 2022

albertvillanova commented Jun 17, 2022

albertvillanova commented Jun 17, 2022

albertvillanova commented Jun 20, 2022 • edited Loading

albertvillanova commented Jun 28, 2022

lhoestq commented Jul 26, 2022

albertvillanova left a comment

Choose a reason for hiding this comment

seirasto commented Jul 26, 2022

lhoestq commented Jul 26, 2022

albertvillanova left a comment

Choose a reason for hiding this comment

seirasto commented Jul 26, 2022

albertvillanova commented Jul 26, 2022

HuggingFaceDocBuilderDev commented May 24, 2022 •

edited

Loading

albertvillanova left a comment •

edited

Loading

albertvillanova commented Jun 20, 2022 •

edited

Loading