This repository contains the code for the paper Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models published in IJCAI 2024.
Our approach leverages contextual information to enhance speech recognition, particularly when dealing with diverse accents, as demonstrated on the SQuAD-SRC dataset.
Key features include:
- Utilizing pre-trained speech models to generate diverse transcription candidates.
- Exploiting contextual information for on-the-fly in-domain adaptation.
- Employing large language models to refine transcriptions with rich linguistic knowledge.
To run zero-shot prompting:
python run_regenerate.py
This method achieves a 13.6% performance improvement without tuning the pre-trained speech and language models.
To perform LoRA tuning:
- Tune the model
- Run the re-generation using finetuned model
python peft.py
python run_regenerate.py
Results show consistent performance gains with increasing training examples:
- Tuning with just 100 examples results in a 19.8% improvement with Whisper Tiny and a 12% improvement with Whisper Medium.
- Whisper Tiny tuned with 500 examples can outperform Whisper Medium, despite having about 20x fewer parameters.