Skip to content

Ein Projekt für den Kurs Programmierung II im Sommersemester 2021

Notifications You must be signed in to change notification settings

katjakon/Coreference-Resolution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coreference-Resolution

Description

This is an implementation of the multi-pass sieve for coreference resolution described by Raghunathan et al. (2010). It applies multiple sieves to clusters of referential expressions in order to link coreferential NP.
This program takes files from the Ontonotes corpus in the conll format as input. Find out more here.

Requirements

Written with Python 3.8.5.
See requirements.txt.
Download nltk's stopword corpus by running:

>>> import nltk
>>> nltk.download('stopwords')

How to use

Move your directory containing files from the Ontonotes corpus in the conll-format into this directory.
Run:
main.py [-h] [--config [CONFIG]] [--ext [EXT]] [--lang [LANG]] in_dir out_dir

  • in_dir: The directory containing the conll files from the Ontonotes corpus. Subdirectories will also be searched.
  • out_dir: A name for the directory where output files should be stored.
  • --config: The name of the config file where sieves are specified. This defaults to config.txt.
  • --ext: The extension files should have. This defaults to conll. If you only want to extract gold annotated file, set this to gold_conll.
  • --lang: A subdirectory in in_dir from which files should be extracted. Set this to english to only extract english files from nested Ontonotes corpus. Per default all subdirectories will be searched.

Examples:
main.py corpus output
main.py --ext gold_conll corpus output
main.py --ext gold_conll --lang english nested_corpus output

Adjust Sieves

The sieves and their order are specified in config.txt. The values specify the order in which the sieves are applied. To exclude a sieve, set its value to -1.
Example:

[Sieves]
Exact_Match_Sieve = 1
Precise_Constructs_Sieve = 2
Strict_Head_Match_Sieve = 3
Strict_Head_Relax_Modifiers = 4
Strict_Head_Relax_Inclusion = -1

Interpret Output

For each document from which coreference information was extracted, there is one file in the output folder. The first line contains the path to the original file. After that follow clusters of coreferential mentions. Clusters are seperated by -;-.
Each line in a cluster represents a mention. The first column is a 3-tuple where the first element is the index of the sentence in which the mention appears. The second element is the start index, the third the end index of the mention in the respective sentence.
For example, assume the following sentence is at index 3:
The0 dog1 is2 happy3 about4 his5 new6 toy7
Then the mention "the dog" would have the 3-tuple (3,0,2), the mention "his new toy" would be (3,5,8) and the mention "his" would be (3,5,6).
The second column is the string of the mention.
Example:

path/to/example/file.conll
(1,2,3);man
(4,5,6);he
-;-
(7,8,9);dog
(10,11,12);it

Additionally, the output directory contains a file `_summary.csv` that lists the evaluation metrics precision, recall and f1 for each file.

About

Author: Katja Konermann (katja.konermann@uni-potsdam.de)
Course: Programmierung II
Summer semester 2021

About

Ein Projekt für den Kurs Programmierung II im Sommersemester 2021

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages