GitHub - kadarakos/wl-coref: This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution

This is a repository with the code to reproduce the experiments described in the paper of the same name, which was accepted to EMNLP 2021. The paper is available here. It uses a slightly modified version of https://github.com/polm/spacy-coref-scorer for scoring.

The repo can be used almost the same way as the descriptions below says. The additional step below is needed to convert the jsonlines to a DocBin:

    python convert_to_spacy.py

After data preparation the model can be train with

    python run.py train roberta

Here roberta doesn't do anything its just the config system hasn't been modified yet. Evaluation is done with:

    python run.py eval roberta --data-split test --coref-weights data/coref.pt --span-weights data/span.pt

The weights are saved in the data folder. One must choose manually to run evaluation with the latest weights (in the last epoch), because early-stopping kills it based on LEA.

Preparation

The following instruction has been tested with Python 3.7 on an Ubuntu 20.04 machine.

You will need:

OntoNotes 5.0 corpus (download here, registration needed)
Python 2.7 to run conll-2012 scripts
Java runtime to run Stanford Parser
Python 3.7+ to run the model
Perl to run conll-2012 evaluation scripts
CUDA-enabled machine (48 GB to train, 4 GB to evaluate)

Extract OntoNotes 5.0 arhive. In case it's in the repo's root directory:
```
 tar -xzvf ontonotes-release-5.0_LDC2013T19.tgz
```
Switch to Python 2.7 environment (where python would run 2.7 version). This is necessary for conll scripts to run correctly. To do it with with conda:
```
 conda create -y --name py27 python=2.7 && conda activate py27
```

Run the conll data preparation scripts (~30min):

 sh get_conll_data.sh ontonotes-release-5.0 data

Download conll scorers and Stanford Parser:
```
 sh get_third_party.sh
```

Prepare your environment. To do it with conda:

 conda create -y --name wl-coref python=3.7 openjdk perl
 conda activate wl-coref
 python -m pip install -r requirements.txt

Build the corpus in jsonlines format (~20 min):

 python convert_to_jsonlines.py data/conll-2012/ --out-dir data
 python convert_to_heads.py

You're all set!

Training

If you have completed all the steps in the previous section, then just run:

python run.py train roberta

Use -h flag for more parameters and CUDA_VISIBLE_DEVICES environment variable to limit the cuda devices visible to the script. Refer to config.toml to modify existing model configurations or create your own.

Evaluation

Make sure that you have successfully completed all steps of the Preparation section.

Download and save the pretrained model to the data directory.

 https://www.dropbox.com/s/vf7zadyksgj40zu/roberta_%28e20_2021.05.02_01.16%29_release.pt?dl=0

Generate the conll-formatted output:

 python run.py eval roberta --data-split test

Run the conll-2012 scripts to obtain the metrics:
```
 python calculate_conll.py roberta test 20
```

Prediction

To predict coreference relations on an arbitrary text, you will need to prepare the data in the jsonlines format (one json-formatted document per line). The following fields are requred:

    {
            "document_id": "tc_mydoc_001",
            "cased_words": ["Hi", "!", "Bye", "."],
            "sent_id": [0, 0, 1, 1]
    }

You can optionally provide the speaker data:

    {
            "speaker": ["Tom", "Tom", "#2", "#2"]
    }

document_id can be any string that starts with a two-letter genre identifier. The genres recognized are the following:

bc: broadcast conversation
bn: broadcast news
mz: magazine genre (Sinorama magazine)
nw: newswire genre
pt: pivot text (The Bible)
tc: telephone conversation (CallHome corpus)
wb: web data

You can check a sample input file for reference.

Then run:

    python predict.py roberta input.jsonlines output.jsonlines

This will utilize the latest weights available in the data directory for the chosen configuration. To load other weights, use the --weights argument.

Citation

@inproceedings{dobrovolskii-2021-word,
title = "Word-Level Coreference Resolution",
author = "Dobrovolskii, Vladimir",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.605",
pages = "7670--7675"}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word-Level Coreference Resolution

Table of contents

Preparation

Training

Evaluation

Prediction

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
coref		coref
README.md		README.md
config.toml		config.toml
convert_to_heads.py		convert_to_heads.py
convert_to_jsonlines.py		convert_to_jsonlines.py
convert_to_spacy.py		convert_to_spacy.py
coval.py		coval.py
get_conll_data.sh		get_conll_data.sh
get_third_party.sh		get_third_party.sh
predict.py		predict.py
requirements.txt		requirements.txt
run.py		run.py
sample_input.jsonlines		sample_input.jsonlines

kadarakos/wl-coref

Folders and files

Latest commit

History

Repository files navigation

Word-Level Coreference Resolution

Table of contents

Preparation

Training

Evaluation

Prediction

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages