Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add article_id and process test set template for semeval 2020 task 11… #1979

Merged

Conversation

hemildesai
Copy link
Contributor

… dataset

  • article_id is needed to create the submission file for the task at https://propaganda.qcri.org/semeval2020-task11/
  • The technique classification task provides the span indices in a template for the test set that is necessary to complete the task. This PR implements processing of that template for the dataset.

@hemildesai hemildesai force-pushed the add_articleid_to_semeval2020_11 branch from 3a798fd to 795554a Compare March 7, 2021 08:04
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thank you :)

datasets/sem_eval_2020_task_11/sem_eval_2020_task_11.py Outdated Show resolved Hide resolved
@hemildesai hemildesai force-pushed the add_articleid_to_semeval2020_11 branch from 795554a to c5933b5 Compare March 9, 2021 06:46
Comment on lines 133 to 136
if os.path.isfile(tc_labels_template):
tc_test_template = self._process_tc_labels_template(tc_labels_template)
else:
tc_test_template = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do you need to test if the file exists ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a classification task on a span in a given article. So this template just provides the offsets for the spans to be classified in an article.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
And in these lines you're checking if the file with the template exists.
Is this verification necessary ? Isn't the file always present in the ptc-corpus.tgz file downloaded for this dataset ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is just to be on the safer side since this dataset loads via a manual data dir so I'm assuming different people might have different discrepancies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reproducibility concerns we try to have deterministic dataset generations.
Do you think we can simply assume that this file always exists ?
Otherwise we can keep it this way..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the official dataset contains this file so I guess we can consider it to be there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed it to make it deterministic.

@hemildesai hemildesai force-pushed the add_articleid_to_semeval2020_11 branch from f3c1daf to 406ed9d Compare March 11, 2021 17:18
@lhoestq
Copy link
Member

lhoestq commented Mar 12, 2021

Thanks !
Now to fix the CI the only thing left is to add a dummy test-task-tc-template.out file inside the dummy_data.zip at ./datasets/sem_eval_2020_task_11/dummy/1.1.0
It must contain the labels template for each dummy article of the test set included in dummy_data.zip

After that we should be good to merge this one :)

@hemildesai hemildesai force-pushed the add_articleid_to_semeval2020_11 branch from 406ed9d to 07283f6 Compare March 12, 2021 12:41
@hemildesai
Copy link
Contributor Author

@lhoestq Made the changes! The failure now seems to be unrelated to the changes. Any idea what's going on?

@lhoestq
Copy link
Member

lhoestq commented Mar 12, 2021

This is a bug on master that we're investigating. You can ignore it

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good now ! Thanks again :)

merging

@lhoestq lhoestq merged commit 162c0ee into huggingface:master Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants