Code for "Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation" (ACL 2021)
- Most of codes of this work come from Fairseq and Transformers.
poattention
(modified from Fairseq): Training the Position-Aware Embedding Generator for seq2seq models.use_poattention
(modified from Fairseq): Generating embeddings for unseen tokens as well as fine-tuning the seq2seq model with a vocabulary for downstream data under the downstream task.bert_poattention
(modified from Transformers): Training the Position-Aware Embedding Generator for bert-like models.bert_use_poattention
(modified from Fairseq): Generating embeddings for unseen tokens, converting parameters of bert-like model to seq2seq one, as well as fine-tuning the seq2seq model with a newly generated vocabulary under the downstream task.
For seq2seq pretrained model
-
Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in
data-bin
-
Move the seq2seq pretrained model (generated by Fairseq) to
./checkpoints
and rename it ascheckpoint_last.pt
.cp path_to_pretrained_model ./checkpoints/checkpoint_last.pt
-
Train the embedding generator
pip install .; bash train.sh
-
Stop training when model tends to coverage.
-
Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in
data-bin
-
Get the mapping between upstream and downstream vocabulary.
python get_map_index.py
Note: please change the data name in
get_map_index.py
-
Move the well-trained embedding genearator checkpoint (generated by
poattention
) to./checkpoints
and rename it ascheckpoint_last.pt
.cp path_to_embedding_generator ./checkpoints/checkpoint_last.pt
-
Generate unseen tokens and finetune the downstream model with downstream vocabulary.
pip install .; bash train.sh
For bert-like pretrained model
-
Prepare the upstream data (plain text) at
./examples/language-modeling/data
. -
Train the embedding generator
pip install .
cd ./examples/language-modeling
bash train_mlm.sh
-
Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in
data-bin
.Note: Sentences should be cutted by WordPiece, I suggest the bert-vocab-builder for building the vocabulary of downstream data.
-
Get the mapping between upstream and downstream vocabulary.
python get_map_index.py
Note: please change the data name in
get_map_index.py
-
Generate unseen tokens and finetune the downstream model with downstream vocabulary.
pip install path_to_bert_poattention
pip install .; bash train.sh