This repository helps you to build language models (LM) for automatic speech recognition (ASR) systems like Kaldi.
By default it uses a given Kaldi-ASR nnet3 chain model (e.g. from Zamia-Speech.org) and a custom text corpus (list of normalized sentences) to build a new 4-gram/5-gram custom LM. JSGF grammar language models are supported as well.
The whole process is fully automated, you simply have to create you sentences file or JSGF grammar, run the 'adapt' script and collect the new model :-). Out-Of-Vocabulary (OOV) words are created via pre-trained G2P models (default: Phonetisaurus, currently for 'en' and 'de') and stored next to the custom dictionary for fine-tuning.
This whole repository is optimized to be very lightweight (~300MB including Kaldi binaries, ASR model and text corpus) and if you use a small text corpus the adaptation process should finish in a few minutes, even on a Raspberry Pi 4 :-). Here are the steps to get started:
- Make sure you have 'git' available (
sudo apt-get install git
). - Clone the repository:
git clone --single-branch https://github.com/fquirin/kaldi-adapt-lm.git
- Enter the directory:
cd kaldi-adapt-lm
- Install requirements (e.g. pre-built Kaldi, KenLM, etc.):
- Default (no G2P tool for OOV words):
bash 1-download-requirements.sh
- With G2P:
bash 1-download-requirements.sh phonetisaurus
orbash 1-download-requirements.sh sequitur
- Default (no G2P tool for OOV words):
- Download base model to adapt:
bash 2-download-model.sh en
(currently included models: 'en', 'de') - Edit text corpus inside
lm_corpus
folder or create a new one. Check convert_corpus folder if you need help converting raw data. - (optional) Edit dictionary for your language inside
lm_dictionary
to add unknown words, e.g.my_dict_en.txt
- Start adaptation process:
- Default:
bash 3-adapt.sh -l en -d
(use same language code as in previous step, use '-d' for very small models) - Automatically generate OOV words:
bash 3-adapt.sh -l en -g
- Use JSGF grammar:
bash 3-adapt.sh -l en -g -f jsgf
- Recommended (for a corpus > ~1000 lines):
bash 3-adapt.sh -l en -f ngram -n 5 -u -g -p phonetisaurus
- See all options:
bash 3-adapt.sh -h
- Default:
- Wait for a few minutes (around 15min on RPi4, small language model)
- Optional:
bash 4a-build-vosk-model.sh
(repackage model to use with Vosk-ASR) - Clean up with
bash 5-clean-up.sh
and copy the new model to your STT server
This is a more detailed description of the adaptation step (see script 3-adapt.sh). If you haven't done already please follow the quick-start steps up to this point.
The whole purpose of adaptation is to optimize the ASR model for your own use-case and increase recognition accuracy of your domain. To make this happen you first need a list of sentences that will represent the domain. Simply open a new file and write down your sentences, just make sure everything is lower-case and don't use special characters, question marks, comma etc. (note: different models might actually support upper-case ... in theory).
You can start with one of the files inside 'lm_corpus':
mkdir adapt
cp lm_corpus/sentences_en.txt adapt/lm.txt
It makes sense to limit the language model to the vocabulary the ASR model supports, so let's extract the vocabulary next:
MODEL="$(realpath model)"
cut -f 1 -d ' ' ${MODEL}/data/local/dict/lexicon.txt > vocab.txt
Please note: This assumes your model actually has the data available at: ${MODEL}/data/local/dict/
.
With those files in place you can now build the new language model using KenLM:
KENLM_DIR="$(realpath kenlm)"
export PATH="$KENLM_DIR:$PATH"
cd adapt
lmplz -S 50% --text lm.txt --limit_vocab_file vocab.txt --arpa lm.arpa --order 4 --prune 0 0 1 2 --discount_fallback
Notes:
--limit_vocab_file vocab.txt
seems to fail on ARM32 systems atm and creates empty ARPA models! (see adapt script for manual check).- You might be able to skip
--discount_fallback
if your model is big enough (and should!). --order 4
generates a 4-gram model. You can experiment with 3-gram or 5-gram as well.- If your model is very small consider to skip pruning (
--prune 0
) or reduce the thresholds further. -S 50%
reduces memory usage, higher values might work for your setup.- Check out
lmplz --help
for more info and options to optimize your LM.
After you've created you LM you can start the kaldi model adaptation process:
KALDI_DIR="$(realpath kaldi)"
KENLM_DIR="$(realpath kenlm)"
export PATH="$KALDI_DIR:$KENLM_DIR:$PATH"
MODEL="$(realpath model)"
ARPA_LM="$(realpath adapt)/lm.arpa"
MODEL_OUT="sepia-custom"
python3 -m adapt -f -k ${KALDI_DIR} ${MODEL} ${ARPA_LM} ${MODEL_OUT}
You should find a tar-file of the resulting model inside the auto-generated work
folder.
If you're planning to use the model with Vosk-ASR (e.g. via SEPIA STT server) you can use bash 4a-build-vosk-model.sh
to repackage it. The result can be found inside 'adapted_model'.
When you're done you can use bash 5-clean-up.sh
to zip the content of 'adapted_model' to 'adapted_model.zip' and delete all working folders.
If you see strange errors during the adaptation process it might be that you ran out of memory. I've tested the scripts on a Raspberry Pi 4 2GB using small language models and it worked fine, but requirements might increase exponentially depending on the size of you model.
If you create a dictionary based language model (as compared to an end-to-end system with subword tokens), you will have to add new words and their pronunciations (phonemes) to your dictionary (lexicon.txt) from time to time.
To make this procedure as simple as possible you can use a G2P (grapheme-to-phoneme) tool like Phonetisaurus G2P (BSD license) or Sequitur G2P (GPLv2 license).
Support for both tools is implemented and pre-trained models based on the Zamia Kaldi lexicon.txt files are included for 'de' and 'en'. To train new ones check the lm_dictionary
folder.
- Add pre-trained models for more languages
model.conf
file is kind of random. Understand and improve if necessary.- Optimize
kaldi_adapt_lm.py
andtemplates
to build new model more efficiently if possible
- SEPIA STT Server
- Kaldi ASR
- KenLM
- Zamia Speech
- Phonetisaurus G2P (optional)
- Sequitur G2P (optional)
- Gruut-IPA (watchlist)
Everything is included inside 1-download-requirements.sh
but here are the basics:
- Python (recommended: 3.9)
- Kaldi ASR
- KenLM
- Optional: Sequitur or Phonetisaurus G2P for generation of OOV words
- Code in general: Apache-2.0 licensed unless otherwise noted in the script’s copyright headers.
- Phonetisaurus G2P: BSD 3-Clause License (https://github.com/AdolfVonKleist/Phonetisaurus/blob/master/LICENSE)
- If you use Sequitur G2P: GPL-2.0 license (https://github.com/sequitur-g2p/sequitur-g2p/blob/master/LICENSE)
Original by Guenter Bartsch
Modified by Florian Quirin for https://github.com/SEPIA-Framework
Pre-built Kaldi, KenLM and Phonetisaurus by Michael Hansen