Source code for paper "CoLAKE: Contextualized Language and Knowledge Embedding". If you have any problem about reproducing the experiments, please feel free to contact us or propose an issue.
We recommend to create a new environment.
conda create --name colake python=3.7
source activate colake
CoLAKE is implemented based on fastNLP and huggingface's transformers, and uses fitlog to record the experiments.
git clone https://github.com/fastnlp/fastNLP.git
cd fastNLP/ & python setup.py install
git clone https://github.com/fastnlp/fitlog.git
cd fitlog/ & python setup.py install
pip install transformers==2.11
pip install sklearn
To re-train CoLAKE, you may need mixed CPU-GPU training to handle the large number of entities. Our implementation is based on KVStore provided by DGL. In addition, to reproduce the experiments on link prediction, you may also need DGL-KE.
pip install dgl==0.4.3
pip install dglke
Download the pre-trained CoLAKE model and embeddings for more than 3M entities. To reproduce the experiments on LAMA and LAMA-UHN, you only need to download the model. You can use the download_gdrive.py
in this repo to directly download files from Google Drive to your server:
mkdir model
python download_gdrive.py 1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b ./model/model.bin
python download_gdrive.py 1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI ./model/entities.npy
Alternatively, you can use gdown
:
pip install gdown
gdown https://drive.google.com/uc?id=1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b
gdown https://drive.google.com/uc?id=1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI
Download the datasets for the experiments in the paper: Google Drive.
python download_gdrive.py 1UNXICdkB5JbRyS5WTq6QNX4ndpMlNob6 ./data.tar.gz
tar -xzvf data.tar.gz
cd finetune/
python run_re.py --debug --gpu 0
python run_typing.py --debug --gpu 0
cd ../lama/
python eval_lama.py
Download the latest wiki dump (XML format):
wget -c https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Download the knowledge graph (Wikidata5M):
wget -c https://www.dropbox.com/s/6sbhm0rwo4l73jq/wikidata5m_transductive.tar.gz?dl=1
tar -xzvf wikidata5m_transductive.tar.gz
Download the Wikidata5M entity & relation aliases:
wget -c https://www.dropbox.com/s/lnbhc8yuhit4wm5/wikidata5m_alias.tar.gz?dl=1
tar -xzvf wikidata5m_alias.tar.gz
Preprocess wiki dump:
mkdir pretrain_data
# process xml-format wiki dump
python preprocess/WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o pretrain_data/output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4
# Modify anchors
python preprocess/extract.py 4
python preprocess/gen_data.py 4
# Count entity & relation frequency and generate vocabs
python statistic.py
Initialize entity and relation embeddings with the average of RoBERTa BPE embedding of entity and relation aliases:
cd pretrain/
python init_ent_rel.py
Train CoLAKE with mixed CPU-GPU:
./run_pretrain.sh
If you use the code and model, please cite this paper:
@inproceedings{sun2020colake,
author = {Tianxiang Sun and Yunfan Shao and Xipeng Qiu and Qipeng Guo and Yaru Hu and Xuanjing Huang and Zheng Zhang},
title = {CoLAKE: Contextualized Language and Knowledge Embedding},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics, {COLING}},
year = {2020}
}