Code and data for the paper "Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation"
For real-life online applications of speech-to-text translation, it is crucial that end-to-end spoken language translation models perform well on continuous audio, without relying on human-supplied segmentation. Our findings on five different language pairs show that a simple fixed-window audio segmentation can outperform a state-of-the-art automatic segmentation approach. This repository provides our model outputs and code to reproduce our experiments.
Our model outputs can be found in the model_outputs
directory (ending in *slt
) together with all files needed to rerun SLTev (files ending in *ost
and *ostt
).
Note that we made some modifications to SLTev in order to evaluate delay on translations produced with different audio segmentation methods. To reproduce our results, please use our fork of SLTev.
After installing SLTev, you can then run the following command to reproduce our results, selecting the EXPERIMENT
(model type and segmentation type) and LANGPAIR
(en-de, es-en, fr-en, it-en, pt-en) of interest:
SLTeval -i EXPERIMENT.LANGPAIR.slt LANGPAIR.ost LANGPAIR.ostt -f slt ref ostt
We provide the code for preparing the training data for fine-tuning, the scripts for fine-tuning the models and the scripts for translating.
First, download the train and test data for MuST-C (note that we used version 1.0) or mTEDx (note that we used mtedx_iwslt2021.tgz for testing).
You can then use our scripts to resegment the training data for finetuning on prefixes, prefixes + context or windows as described in our paper. The only files needed are the yaml-file with the segmentation information, the file with the source text (transcription) and the file with the target text (translation).
For prefixes:
python finetune_scripts/resegment_prefixes.py -y YAML -s SRC -t TRG
For context:
python finetune_scripts/resegment_context.py -y YAML -s SRC -t TRG
For windows:
python finetune_scripts/resegment_windows.py -y YAML -s SRC -t TRG
The yaml, source and target outputs will be saved with the same file names plus the ending .prefix
, .context
, .window
.
Please follow the steps in the respective fairseq docs to prepare the training data for MuST-C and mTEDx.
Download the pretrained checkpoints for the en-de MuST-C and the multilingual mTEDx models and all other necessary files from linked in the fairseq docs.
You can then add the necessary paths and use our training scripts as follows:
For en-de:
bash finetune_scripts/finetune_must-c.sh
For multilingual (es-en, fr-en, it-en and pt-en):
bash finetune_scripts/finetune_mtedx.sh
First, adapt the yaml file of the testset of your choice to use the segmentation method of your choice.
For the gold segmentation, you don't need to change anything except for copying the file to the path mentioned below.
For SHAS, please follow the steps described in the SHAS repo. Make sure to use the pSTREAM algorithm if you want to simulate an online setting.
For fixed window segmentation, you can also use the code in the SHAS repo for length-based segmentation described under "Segmentation with other methods".
For the merging windows approach, the segmentation will happen automatically in the translation script, so here you also don't need to do anything.
Save the resulting yaml files in the location of the original file either as FILE.gold.yaml
, FILE.shas.yaml
or FILE.fixed.yaml
. Do not remove the original files as these will be used by the window merging scripts.
For the original, SHAS and fixed segmentation, you need to set the paths to the data directory, the fairseq repo and the model directory (with the checkpoint and other files).
If you want to run the translations with biased beam search enabled, you have to uncomment the options in the script and use this fairseq fork branch instead.
Then you can call the following script for en-de:
bash translate_scripts/translate_must-c.sh SRCLANG TRGLANG MODELTYPE SEGTYPE PORT
and for es-en, fr-en, it-en and pt-en:
bash translate_scripts/translate_mtedx.sh SRCLANG TRGLANG MODELTYPE SEGTYPE PORT
where MODELTYPE is either "original", "prefix", "context" or "window" and SEGTYPE is either "gold", "shas" or "fixed". PORT is the port where the translation server should be running.
Finally, before you can evaluate the output with SLTev (see above), you need to create a specific input format with the following command:
python translate_scripts/postprocess_gold_shas_fixed.py -i OUTFILE.TRGLANG -a OUTFILE.annotation -o FILE.slt
For the merging window approach, you also need to set the path variables in the scripts.
Then you can call the following script for en-de:
bash translate_scripts/merge_windows_must-c.sh SRCLANG TRGLANG MODELTYPE
and for es-en, fr-en, it-en and pt-en:
bash translate_scripts/merge_windows_mtedx.sh SRCLANG TRGLANG MODELTYPE
where MODELTYPE is either "original", "prefix", "context" or "window".
Finally, before you can evaluate the output with SLTev (see above), you need to create a specific input format with the following command:
python translate_scripts/postprocess_merged.py -i OUTFILE.log -o OUTFILE.slt
If you use this code or data, please cite our paper:
@inproceedings{amrhein-haddow-2022-dont,
title = "Don{'}t Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation",
author = "Amrhein, Chantal and
Haddow, Barry",
booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.wmt-1.13",
pages = "203--219",
}