Pyramid

Introduction

Pyramid is a novel layered model for Nested Named Entity Recognition (nested NER). This code is based in the paper Pyramid: A Layered Model for Nested Named Entity Recognition by Jue Wang et al.

Note that this code is based in my own understanding of the paper. Nevertheless, the authors released the code of the paper at https://github.com/LorrinWWW/Pyramid/.

This repository also contains a step-by-step execution in the notebooks contained in the folder notebooks.

Set up

Clone this repository, create default folders and install dependencies:

git clone https://github.com/DimasDMM/pyramid.git
cd pyramid
mkdir data
mkdir artifacts
pip install -r requirements.txt

Download GloVe embeddings:

cd data
wget http://nlp.stanford.edu/data/glove.6B.zip --no-check-certificate
unzip glove.6B.zip
cd ..

It is necessary that you also download the tokenizer and pretrained LM* beforehand:

python run_download_lm.py --lm_name dmis-lab/biobert-v1.1

*Feel free to use any pretrained model from HuggingFace: https://huggingface.co/models

Dataset

GENIA is the dataset where I have tested this repository. You can download and prepare this dataset with these commands:

cd data
wget http://www.nactem.ac.uk/GENIA/current/GENIA-corpus/Term/GENIAcorpus3.02.tgz --no-check-certificate
mkdir GENIA
tar -xvf GENIAcorpus3.02.tgz -C GENIA
cd ..
python run_preprocess.py \
    --dataset genia \
    --raw_filepath "./data/GENIA/GENIA_term_3.02/GENIAcorpus3.02.xml" \
    --lm_name dmis-lab/biobert-v1.1 \
    --cased 0

If you want to use a different dataset, it must be a JSON file as follows:

{
  "tokens": ["token0", "token1", "token2"],
  "entities": [
    {
      "entity_type": "PER", 
      "span": [0, 1],
    },
    {
      "entity_type": "ORG", 
      "span": [2, 3],
    },
  ]
}

Commands

Fine-tune model:

python run_training.py \
    --model_ckpt ./artifacts/genia/ \
    --wv_file ./data/glove.6B.200d.txt \
    --use_label_embeddings 0 \
    --use_char_encoder 1 \
    --dataset genia \
    --max_epoches 500 \
    --max_steps 1e9 \
    --total_layers 16 \
    --batch_size 64 \
    --token_emb_dim 200 \
    --char_emb_dim 100 \
    --cased_lm 0 \
    --cased_word 0 \
    --cased_char 0 \
    --hidden_dim 100 \
    --dropout 0.45 \
    --lm_name dmis-lab/biobert-large-cased-v1.1 \
    --lm_emb_dim 1024 \
    --device cuda \
    --continue_training 0 \
    --log_to_file logger_genia.txt

Once the model is fine-tunned, run the evaluation script:

python run_evaluator.py \
    --model_ckpt ./artifacts/genia/ \
    --dataset genia \
    --device cuda

Parameters

The parameters that you can use are the following ones:

device: Device to use: cpu or cuda.
model_ckpt: Path to store the model.
wv_file: (Optional, default=None) Path to file with embeddings of words. If not provided, it won't use the Word Encoder described in the paper.
use_label_embeddings: (Optional, default=0) Uses a label embedding layer in the top of the model.
use_char_encoder: (Optional, default=1) Uses the Char Encoder described in the paper.
dataset: Name of the dataset to use. The dataset files must be located in the folder ./data with the names train.<dataset>.json, valid.<dataset>.json and test.<dataset>.json for the train, validation and test datasets respectively.
max_epoches: (Optional, default=500) Maximum number of epoches for training.
max_steps: (Optional, default=1e9) Maximum number of steps for training.
total_layers: (Optional, default=16) Number of layers in the pyramid.
batch_size: (Optional, default=64) Batch size for training.
token_emb_dim: (Optional, default=100) Dimension of token embeddings.
char_emb_dim: (Optional, default=100) Dimension of char embeddings.
cased_lm: (Optional, default=1) Use cased LM Encoder.
cased_word: (Optional, default=1) Use cased Word Encoder.
cased_char: (Optional, default=1) Use cased Char Encoder.
hidden_dim: (Optional, default=100) Hidden dimension of LSTM layers in the pyramid. Since the LSTM layers are bidirectional, the actual hidden dimension will be twice the value.
dropout: (Optional, default=0.45) Dropout rate.
lm_name: (Optional, default=dmis-lab/biobert-large-cased-v1.1) Pretrained language model from Hugging Face. The model must be already downloaded in the folder ./artifacts (use the script run_download.py to download and store it).
lm_emb_dim: (Optional, default=1024) Hidden dimension of the language model.
continue_training: (Optional, default=0) In order to avoid overriding a trained model, this flag must be set to 1 if we want to continue training a model from a checkpoint. If the model already exists and the flag is 0, it will throw an error.
log_to_file: (Optional, default=None) File to store the standard output.

Additional comments

This repository includes sbatch files to run the scripts with Slurm. See: https://slurm.schedmd.com/.

To do

Add option to load pretrained label embeddings for training.

Have fun! ᕙ (° ~ ° ~)

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
images		images
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
job_hpc_evaluator.sbatch		job_hpc_evaluator.sbatch
job_hpc_evaluator.sh		job_hpc_evaluator.sh
job_hpc_training.sbatch		job_hpc_training.sbatch
job_hpc_training.sh		job_hpc_training.sh
requirements.txt		requirements.txt
run_download_lm.py		run_download_lm.py
run_evaluator.py		run_evaluator.py
run_predictor.py		run_predictor.py
run_preprocess.py		run_preprocess.py
run_training.py		run_training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pyramid

Introduction

Set up

Dataset

Commands

Parameters

Additional comments

To do

About

Releases 3

Packages

Languages

DimasDMM/pyramid

Folders and files

Latest commit

History

Repository files navigation

Pyramid

Introduction

Set up

Dataset

Commands

Parameters

Additional comments

To do

About

Topics

Resources

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages