Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include entity positions as feature in ReCoRD #4479

Merged
merged 4 commits into from
Aug 19, 2022
Merged

Include entity positions as feature in ReCoRD #4479

merged 4 commits into from
Aug 19, 2022

Conversation

richarddwang
Copy link
Contributor

@richarddwang richarddwang commented Jun 12, 2022

https://huggingface.co/datasets/super_glue/viewer/record/validation

TLDR: We need to record entity positions, which are included in the source data but excluded by the loading script, to enable efficient and effective training for ReCoRD.

Currently, the loading script ignores the entity positions ("entity_start", "entity_end") and only records entity text. This might be because the training method of the official baseline is to make n training instance from a datapoint by replacing "@ placeholder" in query with each entity individually.

But it increases the already heavy computation by multiple folds. So DeBERTa uses a method that take entity embeddings by their positions in the passage, and thus makes one training instance from one data point. It is way more efficient and proved effective for the ReCoRD task.

Can anybody help me with the dataset card rendering error? Maybe @lhoestq ?

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 12, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @richarddwang ! Sorry for getting back to you after such a long delay.

Anyway it looks good to me ! Can you just remove the remaining jsonl files that are outside of the dummy data zip files ? They are not needed.

Also can you update the dataset_infos.json file with the new column ?

datasets-cli test ./datasets/super_glue --name record --save_infos

@richarddwang
Copy link
Contributor Author

Thanks for the reply @lhoestq !

I have sucessed on datasets-cli test ./datasets/super_glue --name record --save_infos,
But as you can see, the check ran into FAILED tests/test_dataset_cards.py::test_changed_dataset_card[super_glue] - V....
How can we solve it?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

Actually I just noticed that this is a breaking change: the length of the "entities" field changes (current implementation uses a set to dedupe them). What about keeping "entities" deduplicated, and having a "entities_spans" field consisting of a list of {"text": ..., "start": ..., "end": ...} ? This way it won't break users code, e.g.

What do you think ?

BTW the CI failure is unrelated to this PR - it appears that some tags are missing in the dataset cards ('annotations_creators', 'language_creators', 'license', 'multilinguality', 'size_categories', 'source_datasets', 'task_categories', and 'task_ids'), and so the dataset is not indexed properly on the Hugging Face website. This can be fixed in another PR

@richarddwang
Copy link
Contributor Author

That would be neat! Let me implement it.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ! LGTM

Merging since the CI errors are unrelated to this PR

@lhoestq lhoestq merged commit 97e0e21 into huggingface:main Aug 19, 2022
@richarddwang richarddwang deleted the fix_record branch August 19, 2022 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants