Modeling Fine-Grained Entity Types with Box Embeddings
Yasumasa Onoe, Michael Boratko, Andrew McCallum, Greg Durrett
ACL 2021
@inproceedings{onoe2021boxet,
title={Modeling Fine-Grained Entity Types with Box Embeddings},
author={Yasumasa Onoe, Michael Boratko, Andrew McCallum, Greg Durrett},
booktitle={ACL},
year={2021}
}
$ git clone https://github.com/yasumasaonoe/Box4Types.git
This code has been tested with Python 3.7 and the following dependencies:
torch==1.7.1
(Please install the right version of Pytorch depending on your CUDA version.)transformers==4.9.2
wandb==0.12.1
If you're using a conda environment, please use the following commands:
$ conda create -n box4et python=3.7
$ conda activate box4et
$ pip install [package name]
box4et/main.py
: Main script for training and evaluating models, and writing predictions to an output file.box4et/models.py
: Defines a Transformer-based entity typing model.box4et/data_utils.py
: Contains data loader and utility functions.box4et/constant.py
: Defines paths etc.box4et/scorer.py
: Compute precision, recall, and F1 given an output file.box4et/train_*.sh
: Sample training command.box4et/eval_*.sh
: Sample evaluation command.
This code assumes 3 directories listed below. Paths to these directories are specified in box4et/constant.py
.
./data
: This directory contains train/dev data files../data/ontology
: This directory contains type vocab files../model
: Trained models will be saved in this directory. When you runmain.py
with the test mode, the trained model is loaded from here.- Download model checkpoints (box and vector models for 4 datasets) from here (NOTE: total size is around 30GB).
- UFET: We do not include the augmented UFET training set since it is derived from English Gigaword, which belongs to LDC. If you have a LDC membership and want to use the augmented data, please contact at yasumasa@utexas.edu.
Run this to download these folders.
$ bash download_data.sh
The data files are formatted as jsonlines. Here is an example from UFET:
{
"ex_id": "dev_190",
"right_context": ["."],
"left_context": ["For", "this", "handpicked", "group", "of", "jewelry", "savvy", "Etsy", "artisans", ",", "their", "passion", "is", "The", "Hunger", "Games", ",", "the", "first", "of", "3", "best", "selling", "young", "adult", "books", "by"],
"right_context_text": ".",
"left_context_text": "For this handpicked group of jewelry savvy Etsy artisans , their passion is The Hunger Games , the first of 3 best selling young adult books by",
"y_category": ["name", "person", "writer", "author"],
"word": "Suzanne Collins",
"mention_as_list": ["Suzanne", "Collins"]
}
Field | Description |
---|---|
ex_id |
Unique example ID. |
right_context |
Tokenized right context of a mention. |
left_context |
Tokenized left context of a mention. |
word |
A mention. |
right_context_text |
Right context of a mention. |
left_context_text |
Left context of a mention. |
y_category |
The gold entity types derived from Wikipedia categories. |
y_title |
Wikipedia title of the gold Wiki entity. |
mention_as_list |
A tokenized mention. |
main.py
is the primary script for training and evaluating models. See box4et/train_*.sh
.
$ cd box4et
$ bash train_box.sh
If you would like to evaluate the trained model on another dataset, simply set --mode
to test
and point to the test data using --eval_data
. Make sure put -load
so that the trained model will be loaded. See box4et/eval_*.sh
.
$ cd box4et
$ bash eval_box.sh