-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include entity positions as feature in ReCoRD #4479
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @richarddwang ! Sorry for getting back to you after such a long delay.
Anyway it looks good to me ! Can you just remove the remaining jsonl files that are outside of the dummy data zip files ? They are not needed.
Also can you update the dataset_infos.json
file with the new column ?
datasets-cli test ./datasets/super_glue --name record --save_infos
Thanks for the reply @lhoestq ! I have sucessed on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
Actually I just noticed that this is a breaking change: the length of the "entities" field changes (current implementation uses a set
to dedupe them). What about keeping "entities" deduplicated, and having a "entities_spans" field consisting of a list of {"text": ..., "start": ..., "end": ...}
? This way it won't break users code, e.g.
- https://github.com/google-research/text-to-text-transfer-transformer/blob/3c58859b8fe72c2dbca6a43bc775aa510ba7e706/t5/data/preprocessors.py#L925
- equal operation to perform unbatch for huggingface datasets #2767 (comment)
What do you think ?
BTW the CI failure is unrelated to this PR - it appears that some tags are missing in the dataset cards ('annotations_creators', 'language_creators', 'license', 'multilinguality', 'size_categories', 'source_datasets', 'task_categories', and 'task_ids'), and so the dataset is not indexed properly on the Hugging Face website. This can be fixed in another PR
That would be neat! Let me implement it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ! LGTM
Merging since the CI errors are unrelated to this PR
https://huggingface.co/datasets/super_glue/viewer/record/validation
TLDR: We need to record entity positions, which are included in the source data but excluded by the loading script, to enable efficient and effective training for ReCoRD.
Currently, the loading script ignores the entity positions ("entity_start", "entity_end") and only records entity text. This might be because the training method of the official baseline is to make n training instance from a datapoint by replacing "@ placeholder" in query with each entity individually.
But it increases the already heavy computation by multiple folds. So DeBERTa uses a method that take entity embeddings by their positions in the passage, and thus makes one training instance from one data point. It is way more efficient and proved effective for the ReCoRD task.
Can anybody help me with the dataset card rendering error? Maybe @lhoestq ?