Skip to content

Latest commit

 

History

History
168 lines (153 loc) · 6.54 KB

adapters.md

File metadata and controls

168 lines (153 loc) · 6.54 KB

Lightweight Adapter Tuning for Multilingual Speech Translation

This is the codebase for the paper Lightweight Adapter Tuning for Multilingual Speech Translation (ACL-IJCNLP 2021).

Data Preparation

All our experiments were performed on MuST-C. Please refer to the data preparation steps provided by fairseq S2T for the MuST-C dataset here.

Training

Multilingual training

To train a multilingual ST backbone model, please run the following command:

fairseq-train ${MUSTC_ROOT} \
  --config-yaml config_st.yaml \
  --train-subset train_de_st,train_nl_st,train_es_st,train_fr_st,train_it_st,train_pt_st,train_ro_st,train_ru_st \
  --valid-subset dev_de_st,dev_nl_st,dev_es_st,dev_fr_st,dev_it_st,dev_pt_st,dev_ro_st,dev_ru_st \
  --save-dir ${MULTILINGUAL_BACKBONE} --num-workers 4 --max-tokens 40000 --max-update 100000 \
  --task speech_to_text --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --report-accuracy \
  --arch s2t_transformer_m --ignore-prefix-size 1 --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 \
  --load-pretrained-encoder-from ${PRETRAINED_ASR}

where:

  • ${MUSTC_ROOT} is the path to MuST-C data.
  • ${MULTILINGUAL_BACKBONE} is the path to save the outputs of the experiments.
  • ${PRETRAINED_ASR} is the path to the pretrained ASR model used to initialize the ST encoder.

Adapter-based fine-tuning

To perform multilingual fine-tuning using adapters, please run the following command:

fairseq-train ${MUSTC_ROOT} \
  --config-yaml config_st.yaml \
  --train-subset train_de_st,train_nl_st,train_es_st,train_fr_st,train_it_st,train_pt_st,train_ro_st,train_ru_st \
  --valid-subset dev_de_st,dev_nl_st,dev_es_st,dev_fr_st,dev_it_st,dev_pt_st,dev_ro_st,dev_ru_st \
  --lang-pairs en-de,en-es,en-fr,en-it,en-nl,en-pt,en-ro,en-ru \
  --save-dir ${EXP_DIR} --num-workers 4 --max-tokens 40000 --max-update 100000 \
  --task speech_to_text --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --report-accuracy \
  --arch s2t_transformer_m --ignore-prefix-size 1 --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 \
  --find-unused-parameters \
  --homogeneous-batch \
  --adapter-enc-dim 256 \
  --adapter-enc-type 'per_lang' \
  --adapter-dec-dim 256 \
  --adapter-dec-type 'per_lang' \
  --finetune-enc-modules adapter \
  --finetune-dec-modules adapter \
  --load-pretrained-encoder-from ${MULTILINGUAL_BACKBONE}/${CHECKPOINT_FILENAME} \
  --load-pretrained-decoder-from ${MULTILINGUAL_BACKBONE}/${CHECKPOINT_FILENAME}

Full/Partial fine-tuning

For full fine-tuning on a specific language pair, for example, en-de:

fairseq-train ${MUSTC_ROOT} \
  --config-yaml config_st.yaml \
  --train-subset train_de_st \
  --valid-subset dev_de_st \
  --save-dir ${EXP_DIR} --num-workers 4 --max-tokens 40000 --max-update 100000 \
  --task speech_to_text --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --report-accuracy \
  --arch s2t_transformer_m --ignore-prefix-size 1 --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 \
  --finetune-from-model ${MULTILINGUAL_BACKBONE}/${CHECKPOINT_FILENAME}

For decoder-only fine-tuning:

fairseq-train ${MUSTC_ROOT} \
  --config-yaml config_st.yaml \
  --train-subset train_de_st \
  --valid-subset dev_de_st \
  --save-dir ${EXP_DIR} --num-workers 4 --max-tokens 40000 --max-update 100000 \
  --task speech_to_text --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --report-accuracy \
  --arch s2t_transformer_m --ignore-prefix-size 1 --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 \
  --finetune-dec-modules dropout_module,embed_tokens,embed_positions,layers,layer_norm,output_projection \
  --load-pretrained-encoder-from ${MULTILINGUAL_BACKBONE}/${CHECKPOINT_FILENAME} \
  --load-pretrained-decoder-from ${MULTILINGUAL_BACKBONE}/${CHECKPOINT_FILENAME}

Inference and Evaluation

For decoding using trained models, run the following command:

# Average last 10 checkpoints
CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py \
  --inputs ${EXP_DIR} --num-epoch-checkpoints 10 \
  --output "${EXP_DIR}/${CHECKPOINT_FILENAME}"

for LANG in de nl es fr it pt ro ru; do
  fairseq-generate ${MUSTC_ROOT} \
    --config-yaml config_st.yaml --gen-subset tst-COMMON_${LANG}_st --task speech_to_text \
    --prefix-size 1 --path ${EXP_DIR}/${CHECKPOINT_FILENAME} \
    --max-tokens 50000 --beam 5 --scoring sacrebleu
done

Results

The table below shows the performance on the MuST-C test set.

Model Params
(trainable/total)
en-de en-es en-fr en-it en-nl en-pt en-ro en-ru
Multilingual baseline 76.3/76.3 24.18 28.28 34.98 24.62 28.80 31.13 23.22 15.88
Best adapting 8 x 4.8/76.3 24.63 28.73 34.75 24.96 28.80 30.96 23.70 16.36
Best fine-tuning 8 x 4.8/76.3 24.50 28.67 34.89 24.82 28.38 30.73 23.78 16.23

Citation

Please cite as:

@inproceedings{le2021lightweight,
  author    = {Le, Hang and Pino, Juan and Wang, Changhan and Gu, Jiatao and Schwab, Didier and Besacier, Laurent},
  title     = {Lightweight Adapter Tuning for Multilingual Speech Translation},
  booktitle = {{ACL}},
  publisher = {Association for Computational Linguistics},
  year      = {2021}
}