This repository contains a ground truth corpus for semantic frame disambiguation, acquired with crowdsourcing and processed with CrowdTruth metrics that capture ambiguity in annotations by measuring inter-annotator disagreement.
The dataset contains annotations for over 9000 sentence-word pairs from the FrameNet corpus v.1.7, with each sentence-word pair annotated for frame disambiguation by 15 workers. The crowdsourced data was collected from Amazon Mechanical Turk.
The corpus has been referenced in the following papers:
- Anca Dumitrache, Lora Aroyo and Chris Welty: A Crowdsourced Frame Disambiguation Corpus with Ambiguity. NAACL 2019.
- Anca Dumitrache, Lora Aroyo and Chris Welty: Capturing and Interpreting Ambiguity in Crowdsourcing Frame Disambiguation. HCOMP 2018.
To replicate the data processing from the paper, use the Jupyter Notebook file CrowdTruth metrics.ipynb
. It requires the installation of the CrowdTruth metrics Python package (v >= 2.0).
The data aggregated with CrowdTruth metrics is available in folder data/output/
The raw crowdsourcing data is available in folder data/input/
If you find this data useful in your research, please consider citing:
@inproceedings{dumitrache2018frames,
Author = {Anca Dumitrache and Lora Aroyo and Chris Welty},
Title = {A Crowdsourced Frame Disambiguation Corpus with Ambiguity},
Booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
Year = {2019}
}