Code for "Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation" (ACL 2021)

Most of codes of this work come from Fairseq and Transformers.

Description for directories

poattention (modified from Fairseq): Training the Position-Aware Embedding Generator for seq2seq models.
use_poattention (modified from Fairseq): Generating embeddings for unseen tokens as well as fine-tuning the seq2seq model with a vocabulary for downstream data under the downstream task.
bert_poattention (modified from Transformers): Training the Position-Aware Embedding Generator for bert-like models.
bert_use_poattention (modified from Fairseq): Generating embeddings for unseen tokens, converting parameters of bert-like model to seq2seq one, as well as fine-tuning the seq2seq model with a newly generated vocabulary under the downstream task.

How to run

For seq2seq pretrained model

`poattention`

Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in data-bin
Move the seq2seq pretrained model (generated by Fairseq) to ./checkpoints and rename it as checkpoint_last.pt.

cp path_to_pretrained_model ./checkpoints/checkpoint_last.pt
Train the embedding generator

pip install .; bash train.sh
Stop training when model tends to coverage.

`use_poattention`

Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in data-bin
Get the mapping between upstream and downstream vocabulary.

python get_map_index.py

Note: please change the data name in get_map_index.py
Move the well-trained embedding genearator checkpoint (generated by poattention) to ./checkpoints and rename it as checkpoint_last.pt.

cp path_to_embedding_generator ./checkpoints/checkpoint_last.pt
Generate unseen tokens and finetune the downstream model with downstream vocabulary.

pip install .; bash train.sh

For bert-like pretrained model

`bert_poattention`

Prepare the upstream data (plain text) at ./examples/language-modeling/data.
Train the embedding generator

pip install .

cd ./examples/language-modeling

bash train_mlm.sh

`bert_use_poattention`

Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in data-bin.

Note: Sentences should be cutted by WordPiece, I suggest the bert-vocab-builder for building the vocabulary of downstream data.
Get the mapping between upstream and downstream vocabulary.

python get_map_index.py

Note: please change the data name in get_map_index.py
Generate unseen tokens and finetune the downstream model with downstream vocabulary.

pip install path_to_bert_poattention

pip install .; bash train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for "Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation" (ACL 2021)

Description for directories

How to run

`poattention`

`use_poattention`

`bert_poattention`

`bert_use_poattention`

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bert_poattention		bert_poattention
bert_use_poattention		bert_use_poattention
poattention		poattention
use_poattention		use_poattention
readme.md		readme.md

DeepLearnXMU/embedding-transfer

Folders and files

Latest commit

History

Repository files navigation

Code for "Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation" (ACL 2021)

Description for directories

How to run

poattention

use_poattention

bert_poattention

bert_use_poattention

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`poattention`

`use_poattention`

`bert_poattention`

`bert_use_poattention`

Packages