Multi-Cell Compositional LSTM for NER Domain Adaptation, code for ACL 2020 paper.
Cross-domain NER is a challenging yet practical problem. Entity mentions can be highly different across domains. However, the correlations between entity types can be relatively more stable across domains. We investigate a multi-cell compositional LSTM structure for multi-task learning, modeling each entity type using a separate cell state. With the help of entity typed units, cross-domain knowledge transfer can be made in an entity type level. Theoretically, the resulting distinct feature distributions for each entity type make it more powerful for cross-domain transfer. Empirically, experiments on four few-shot and zero-shot datasets show our method significantly outperforms a series of multi-task learning methods and achieves the best results.
For more details, please refer to our paper:
Multi-Cell Compositional LSTM for NER Domain Adaptation
The Entity Typed cells (ET cells) correspond to the source- and target-domain entity types (including O, which is used as the outside tagger in NER).
Python 3
PyTorch 1.0+
allennlp 0.8.2 (Optional)
pytorch-pretrained-bert 0.6.1 (Optional)
GloVe 100-dimension word vectors (Download from Here with key ifyk
)
PubMed 200-dimension word vectors (Refer to Here) (Download from Here with key dr9k
)
ELMo Weights (Download from Here with key a9h6
)
BERT-base Weights (Download from Here with key gbn1
)
BioBERT-base Weights (Download from Here with key zsep
)
CoNLL-2003 English NER data (In: SDA/data/conll03_En
)
Broad Twitter corpus (In: SDA/data/broad_twitter_corpus
) (or download from Here with key 0yko
)
BioNLP'13PC and BioNLP'13CG dataset
Twitter corpus (Refer to Here) (Download from Here with key bn75
)
CoNLL-2003 English NER data (In: SDA/data/conll03_En
).
CBS SciTech News (test set) (In: UDA/data/tech/tech.test
).
SciTech news domain raw data Download with key 4834
, and put it in UDA/data/tech
.
The named entity dictionary is collected by Peng et. al. and In UDA/data/tech/conll2003_dict
.
SDA
and UDA
can use the following command to make it run.
python main.py --config train.config
The file train.config
contains dataset path and model hyperparameters.
SDA
and UDA
can use the following command to make it run.
python main.py --config decode.config
The file decode.config
contains dataset path and paths for tarined models.
For example, you can download our trained models with key matp
, unzip the two files .dset
and .model
and put them into SDA/saved_models
. Then you can use the above comment to get our reported result on the broad twitter corpus. UDA models with key 2s6n
are decoded similarly.
If you use our code, please cite our paper as follows:
@inproceedings{jia-zhang-2020-multi,
title = "Multi-Cell Compositional {LSTM} for {NER} Domain Adaptation",
author = "Jia, Chen and Zhang, Yue",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.524",
pages = "5906--5917"
}