Skip to content

Training Data

Benjamin Meyers edited this page May 18, 2017 · 2 revisions

Training Data

This classifier was trained using the human-annotated Szeged Uncertainty Corpus, which is composed of three sub-corpora:

  • BioScape 2.0[1]
  • FactBank 2.0[2]
  • WikiWeasel 2.0[3]

The original corpus is provided in XML and has been reformatted (by us) into JSON for readability.

A secondary corpus is provided within the source code used in the experiments for the ConLL-2010 Shared Task. This corpus contains all of the pre-generated features used to train the original classifier. We have the unedited features available here and the updated features (with multiclass labels) available here.


📃 [1] Vincze, V., Szarvas, G., Farkas, R., Móra, G., & Csirik, J. (2008). The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC bioinformatics, 9(11), S9.

📃 [2] Saurí, R., & Pustejovsky, J. (2009). FactBank: a corpus annotated with event factuality. Language resources and evaluation, 43(3), 227.

📃 [3] Farkas, R., Vincze, V., Móra, G., Csirik, J., & Szarvas, G. (2010, July). The CoNLL-2010 shared task: learning to detect hedges and their scope in natural language text. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning---Shared Task (pp. 1-12). Association for Computational Linguistics.

Clone this wiki locally