This is an implementation of the multi-pass sieve for coreference resolution described by Raghunathan et al. (2010). It applies multiple sieves to clusters of referential
expressions in order to link coreferential NP.
This program takes files from the Ontonotes corpus in the conll format as input. Find out more here.
Written with Python 3.8.5.
See requirements.txt
.
Download nltk
's stopword corpus by running:
>>> import nltk
>>> nltk.download('stopwords')
Move your directory containing files from the Ontonotes corpus in the conll-format into this directory.
Run:
main.py [-h] [--config [CONFIG]] [--ext [EXT]] [--lang [LANG]] in_dir out_dir
in_dir
: The directory containing the conll files from the Ontonotes corpus. Subdirectories will also be searched.out_dir
: A name for the directory where output files should be stored.--config
: The name of the config file where sieves are specified. This defaults toconfig.txt
.--ext
: The extension files should have. This defaults toconll
. If you only want to extract gold annotated file, set this togold_conll
.--lang
: A subdirectory inin_dir
from which files should be extracted. Set this toenglish
to only extract english files from nested Ontonotes corpus. Per default all subdirectories will be searched.
Examples:
main.py corpus output
main.py --ext gold_conll corpus output
main.py --ext gold_conll --lang english nested_corpus output
The sieves and their order are specified in config.txt
. The values specify the order in which the sieves are applied.
To exclude a sieve, set its value to -1.
Example:
[Sieves]
Exact_Match_Sieve = 1
Precise_Constructs_Sieve = 2
Strict_Head_Match_Sieve = 3
Strict_Head_Relax_Modifiers = 4
Strict_Head_Relax_Inclusion = -1
For each document from which coreference information was extracted, there is one file in the output folder. The first line contains the path to the original file.
After that follow clusters of coreferential mentions. Clusters are seperated by -;-
.
Each line in a cluster represents a mention. The first column is a 3-tuple where the
first element is the index of the sentence in which the mention appears. The second element is the start index, the third the end index of the mention
in the respective sentence.
For example, assume the following sentence is at index 3:
The0 dog1 is2 happy3 about4 his5 new6 toy7
Then the mention "the dog" would have the 3-tuple (3,0,2), the mention "his new toy" would be (3,5,8) and the mention "his" would be (3,5,6).
The second column is the string of the mention.
Example:
path/to/example/file.conll
(1,2,3);man
(4,5,6);he
-;-
(7,8,9);dog
(10,11,12);it
Additionally, the output directory contains a file `_summary.csv` that lists the evaluation metrics precision, recall and f1 for each file.
Author: Katja Konermann (katja.konermann@uni-potsdam.de)
Course: Programmierung II
Summer semester 2021