CASENT

Calibrated Seq2Seq Models for Efficient and Generalizable Ultra-fine Entity Typing
Yanlin Feng, Adithya Pratapa, David R Mortensen
EMNLP Findings 2023

CASENT is a lightweight multi-label entity classification model designed for extremely large label space (e.g., UFET and WikiData). It can also be used for entity extraction and tagging when integrated with a span detector.

CASENT offers several advantages compared to previous methods: 1) Standard maximum likelihood training; 2) Efficient inference through a single autoregressive decoding pass; 3) Calibrated confidence scores; 4) Strong generalization performance to unseen domains and types.

Installation

1. Install dependencies

conda create -n casent python=3.10
conda activate casent
git clone https://github.com/yanlinf/CASENT.git
cd CASENT
pip install -r requirements.txt

Note: Install PyTorch manually first if the above commands do not work.

2. Install CASENT

pip install -e .

Quick start

Pretrained models are available on HuggingFace for running inference.

⚠️ Please note that we eliminate underscores from all UFET types by default.

Usage 1: Predict UFET types

from casent.entity_typing_t5 import T5ForEntityTypingPredictor

predictor = T5ForEntityTypingPredictor.from_pretrained('yanlinf/casent-large')
predictor.predict_raw(
    ['A court in Jerusalem sentenced <M> a Palestinian </M> to 16 life terms for forcing a bus off a cliff July 6 , killing 16 people']
)

[EntityTypingOutput(types=['person', 'criminal', 'male'], scores=[0.975004041450473, 0.6304963191225533, 0.5362320213818272])]

Usage 2: Predict WikiData types

We also offer a model that predicts WikiData types along with their WikiData Qnode IDs. (note: this is done by mapping the UFET type vocabulary to WikiData using automatic methods, so the mapping might not be entirely correct. The mapping file is available here)

from casent.entity_typing_t5 import T5ForWikidataEntityTypingPredictor

predictor = T5ForWikidataEntityTypingPredictor.from_pretrained(
    'yanlinf/casent-large',
    ontology_path='ontology_data/ontology_expanded.json'
)
predictor.predict_raw(
    ['A court in Jerusalem sentenced <M> a Palestinian </M> to 16 life terms for forcing a bus off a cliff July 6 , killing 16 people']
)

[WikidataEntityTypingOutput(wd_types=[Concept(Q215627, person), Concept(Q2159907, criminal), Concept(Q6581097, male)], scores=[0.975004041450473, 0.6304963191225533, 0.5362320213818272])]

Usage 3: Entity extraction / tagging

CASENT can also be used to extract entities of a specific type from text, when used in conjunction with a span detector. We provide a simple API that leverages a constituency parser that considers all noun phrases as potential entity mentions (this allows us to extract non-named entities).

from casent.entity_typing_t5 import T5ForEntityTypingPredictor, extract_entities_by_type
import stanza

extract_entities_by_type(
    T5ForEntityTypingPredictor.from_pretrained('yanlinf/casent-large'),
    stanza.Pipeline(lang="en", processors="tokenize,pos,constituency", use_gpu=False),
    text='The Tenerife airport disaster occurred on March 27, 1977, when two Boeing 747 passenger jets collided on the runway at Los Rodeos Airport (now Tenerife North Airport) on the Spanish island of Tenerife. The collision occurred when KLM Flight 4805 initiated its takeoff run during dense fog while Pan Am Flight 1736 was still on the runway.', 
    target_ufet_type='aircraft'
)

[Mention("two Boeing 747 passenger jets", 0.82), Mention("KLM Flight 4805", 0.77), Mention("Pan Am Flight 1736", 0.72)]

Training CASENT

1. Download data

bash scripts/download_ufet.sh

2. Training

python train_t5.py -m t5-large --save_dir checkpoints/exp0/

3. Inference

python predict_t5.py --model checkpoints/exp0/

Predictions will be saved to dev_pred.json and test_pred.json under the model checkpoint directory.

4. Evaluation

python evaluate_ufet_prediction.py --input_path checkpoints/exp0/test_pred.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CASENT

Installation

1. Install dependencies

2. Install CASENT

Quick start

Usage 1: Predict UFET types

Usage 2: Predict WikiData types

Usage 3: Entity extraction / tagging

Training CASENT

1. Download data

2. Training

3. Inference

4. Evaluation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
casent		casent
ontology_data		ontology_data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate_ufet_prediction.py		evaluate_ufet_prediction.py
predict_t5.py		predict_t5.py
requirements.txt		requirements.txt
setup.py		setup.py
train_t5.py		train_t5.py

License

yanlinf/CASENT

Folders and files

Latest commit

History

Repository files navigation

CASENT

Installation

1. Install dependencies

2. Install CASENT

Quick start

Usage 1: Predict UFET types

Usage 2: Predict WikiData types

Usage 3: Entity extraction / tagging

Training CASENT

1. Download data

2. Training

3. Inference

4. Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages