-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add long answer candidates to natural questions dataset #4368
Add long answer candidates to natural questions dataset #4368
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @seirasto for your contribution. Definitely, the addition of this field will be useful for all community members using this dataset.
Just some comments below.
In relation with the non-passing tests, please note that once the changes are validated, we should pre-process all this dataset and upload it to our HuggingFace cloud.
add extra space to long_answer_candidates features
Once we have added |
Also note the "Data Fields" section in the README is missing the Moreover, there is no instance example in "Data Instances" section. |
We could either make these fixes in this PR or in a subsequent PR. |
…s into nq_long_answer_candidates
@albertvillanova I've added the missing fields and updated the README to include a data instance and some other things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
I think the script is already OK, so that we can start pre-processing the entire dataset: I'm addressing this.
In relation with the dataset card (README file), I think there is an issue: the fields in both "Data Instances" and "Data Fields" section should be aligned (naming and nesting) with the fields we return in the script:
datasets/datasets/natural_questions/natural_questions.py
Lines 187 to 200 in 7129c41
{ | |
"id": id_, | |
"document": { | |
"title": ex_json["document_title"], | |
"url": ex_json["document_url"], | |
"html": ex_json["document_html"], | |
"tokens": [ | |
{"token": t["token"], "is_html": t["html_token"], "start_byte": t["start_byte"], "end_byte": t["end_byte"]} for t in ex_json["document_tokens"] | |
], | |
}, | |
"question": {"text": ex_json["question_text"], "tokens": ex_json["question_tokens"]}, | |
"long_answer_candidates": [lac_json for lac_json in ex_json["long_answer_candidates"]], | |
"annotations": [_parse_annotation(an_json) for an_json in ex_json["annotations"]], | |
}, |
Great! I've made the updates to align the README. Please let me know if I missed anything. |
Fix: - Rename "example_id" to "id" - Set value of "id" as str - Move "question" below "document" - Move "document">"title" above "document">"url" - Set value of "document">"title" as str - Rename "document">"document_tokens" to "document">"tokens" - Rename "Token" to "token" - Add missing closing `}` to value of "document" - Add missing "id" inside each "annotations" - Add missing "text" inside each "short_answer" - Set value of "yes_no_answer" to corresponding int
As there were many minor little fixes, I thought it would be easier to fix them directly. |
I think the loading script is OK now. If it is also validated by another datasets maintainer, I could run the generation of the pre-processed data and then merge this PR into master (once all the tests are green). CC: @lhoestq |
It looks good to me, thanks @seirasto ! |
I have merged the master branch, so that we include all the fixes on Apache Beam + Google Dataflow. |
Pre-processing is running! Already finished for "dev" config: In [2]: ds = load_dataset("datasets/natural_questions", "dev")
In [3]: ds
Out[3]:
DatasetDict({
validation: Dataset({
features: ['id', 'document', 'question', 'long_answer_candidates', 'annotations'],
num_rows: 7830
})
}) |
There is an issue while running the preprocessing for the "default" (train+dev) config. Train data files are larger than than dev ones and workers run out of memory. I'm opening a separate issue to handle this problem: #4525 |
Now that the data fiels are uploaded, can you merge the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, once CI is green! Finally we can merge this PR... 😅
Merge is done! I think someone needs to approve the CI to run :) |
Can you run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running make style
, these are the style fixes (see below).
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Thanks @albertvillanova! I've committed all your suggestions. |
The CI is green. I'm merging this PR. |
This is a modification of the Natural Questions dataset to include missing information specifically related to long answer candidates. (See here: https://github.com/google-research-datasets/natural-questions#long-answer-candidates). This information is important to ensure consistent comparison with prior work. It does not disturb the rest of the format . @lhoestq @albertvillanova