Include entity positions as feature in ReCoRD #4479

richarddwang · 2022-06-12T11:56:28Z

https://huggingface.co/datasets/super_glue/viewer/record/validation

TLDR: We need to record entity positions, which are included in the source data but excluded by the loading script, to enable efficient and effective training for ReCoRD.

Currently, the loading script ignores the entity positions ("entity_start", "entity_end") and only records entity text. This might be because the training method of the official baseline is to make n training instance from a datapoint by replacing "@ placeholder" in query with each entity individually.

But it increases the already heavy computation by multiple folds. So DeBERTa uses a method that take entity embeddings by their positions in the passage, and thus makes one training instance from one data point. It is way more efficient and proved effective for the ReCoRD task.

Can anybody help me with the dataset card rendering error? Maybe @lhoestq ?

HuggingFaceDocBuilderDev · 2022-06-12T12:05:18Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Thanks @richarddwang ! Sorry for getting back to you after such a long delay.

Anyway it looks good to me ! Can you just remove the remaining jsonl files that are outside of the dummy data zip files ? They are not needed.

Also can you update the dataset_infos.json file with the new column ?

datasets-cli test ./datasets/super_glue --name record --save_infos

richarddwang · 2022-08-17T23:35:00Z

Thanks for the reply @lhoestq !

I have sucessed on datasets-cli test ./datasets/super_glue --name record --save_infos,
But as you can see, the check ran into FAILED tests/test_dataset_cards.py::test_changed_dataset_card[super_glue] - V....
How can we solve it?

lhoestq

Thanks !

Actually I just noticed that this is a breaking change: the length of the "entities" field changes (current implementation uses a set to dedupe them). What about keeping "entities" deduplicated, and having a "entities_spans" field consisting of a list of {"text": ..., "start": ..., "end": ...} ? This way it won't break users code, e.g.

What do you think ?

BTW the CI failure is unrelated to this PR - it appears that some tags are missing in the dataset cards ('annotations_creators', 'language_creators', 'license', 'multilinguality', 'size_categories', 'source_datasets', 'task_categories', and 'task_ids'), and so the dataset is not indexed properly on the Hugging Face website. This can be fixed in another PR

richarddwang · 2022-08-19T01:21:53Z

That would be neat! Let me implement it.

lhoestq

Thank you ! LGTM

Merging since the CI errors are unrelated to this PR

Include entity positions as feature in ReCoRD

e3fb81b

fix dummy data version

5ba8c4b

lhoestq reviewed Aug 17, 2022

View reviewed changes

fix info

fdc896a

lhoestq reviewed Aug 18, 2022

View reviewed changes

maintain backward compatability

c7ea278

lhoestq approved these changes Aug 19, 2022

View reviewed changes

lhoestq merged commit 97e0e21 into huggingface:main Aug 19, 2022

richarddwang deleted the fix_record branch August 19, 2022 23:23

richarddwang mentioned this pull request Aug 19, 2022

Complete tags of superglue dataset card #4867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include entity positions as feature in ReCoRD #4479

Include entity positions as feature in ReCoRD #4479

richarddwang commented Jun 12, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 12, 2022 •

edited

Loading

lhoestq left a comment

richarddwang commented Aug 17, 2022

lhoestq left a comment

richarddwang commented Aug 19, 2022

lhoestq left a comment •

edited

Loading

Include entity positions as feature in ReCoRD #4479

Include entity positions as feature in ReCoRD #4479

Conversation

richarddwang commented Jun 12, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Jun 12, 2022 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

richarddwang commented Aug 17, 2022

lhoestq left a comment

Choose a reason for hiding this comment

richarddwang commented Aug 19, 2022

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

richarddwang commented Jun 12, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 12, 2022 •

edited

Loading

lhoestq left a comment •

edited

Loading