CANNOT is a dataset that focuses on negated textual pairs. It currently contains 77,376 samples, of which roughly of them are negated pairs of sentences, and the other half are not (they are paraphrased versions of each other).
The most frequent negation that appears in the dataset is verbal negation (e.g., will → won't), although it also contains pairs with antonyms (cold → hot).
The dataset is given as a
.tsv
file with the
following structure:
premise | hypothesis | label |
---|---|---|
A sentence. | An equivalent, non-negated sentence (paraphrased). | 0 |
A sentence. | The sentence negated. | 1 |
The dataset can be easily loaded into a Pandas DataFrame by running:
import pandas as pd
dataset = pd.read_csv('negation_dataset_v1.0.tsv', sep='\t')
The dataset has been created by cleaning up and merging the following datasets:
-
Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation (see
datasets/nan-nli
). -
GLUE Diagnostic Dataset (see
datasets/glue-diagnostic
). -
Automated Fact-Checking of Claims from Wikipedia (see
datasets/wikifactcheck-english
). -
From Group to Individual Labels Using Deep Features (see
datasets/sentiment-labelled-sentences
). In this case, the negated sentences were obtained by using the Python modulenegate
. -
It Is Not Easy To Detect Paraphrases: Analysing Semantic Similarity With Antonyms and Negation Using the New SemAntoNeg Benchmark (see
datasets/antonym-substitution
).
Once processed, the number of remaining samples in each of the datasets above are:
Dataset | Samples |
---|---|
Not another Negation Benchmark | 118 |
GLUE Diagnostic Dataset | 154 |
Automated Fact-Checking of Claims from Wikipedia | 14,970 |
From Group to Individual Labels Using Deep Features | 2,110 |
It Is Not Easy To Detect Paraphrases | 8,597 |
Total |
25,949 |
Additionally, for each of the negated samples, another pair of non-negated
sentences has been added by paraphrasing them with the pre-trained model
🤗tuner007/pegasus_paraphrase
.
Finally, the swapped version of each pair (premise ⇋ hypothesis) has also been included, and any duplicates have been removed.
With this, the number of premises/hypothesis in the CANNOT dataset that appear in the original datasets are:
Dataset |
Sentences |
---|---|
Not another Negation Benchmark | 552 (0.36 %) |
GLUE Diagnostic Dataset | 586 (0.38 %) |
Automated Fact-Checking of Claims from Wikipedia | 89,728 (59.98 %) |
From Group to Individual Labels Using Deep Features | 12,626 (8.16 %) |
It Is Not Easy To Detect Paraphrases | 17,198 (11.11 %) |
Total |
120,690 (77.99 %) |
The percentages above are in relation to the total number of premises and hypothesis in the CANNOT dataset. The remaining 22.01 % (34,062 sentences) are the novel premises/hypothesis added through paraphrase and rule-based negation.
Questions? Bugs...? Then feel free to open a new issue.
We thank all the previous authors that have made this dataset possible:
Thinh Hung Truong, Yulia Otmakhova, Timothy Baldwin, Trevor Cohn, Jey Han Lau, Karin Verspoor, Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman, Aalok Sathe, Salar Ather, Tuan Manh Le, Nathan Perry, Joonsuk Park, Dimitrios Kotzias, Misha Denil, Nando De Freitas, Padhraic Smyth, Teemu Vahtola, Mathias Creutz, and Jörg Tiedemann.
The CANNOT dataset is released under CC BY-SA 4.0.
@misc{anschütz2023correct,
title={This is not correct! Negation-aware Evaluation of Language Generation Systems},
author={Miriam Anschütz and Diego Miguel Lozano and Georg Groh},
year={2023},
eprint={2307.13989},
archivePrefix={arXiv},
primaryClass={cs.CL}
}