DeepSpeech based forced alignment tool
It is recommended to use this tool from within a virtual environment.
After cloning and changing to the root of the project,
there is a script for creating one with all requirements in the git-ignored dir venv
$ bin/
$ ls venv
bin include lib lib64 pyvenv.cfg share
will automatically use it.
Internally DSAlign uses the DeepSpeech STT engine. For it to be able to function, it requires a couple of files that are specific to the language of the speech data you want to align. If you want to align English, there is already a helper script that will download and prepare all required data:
$ bin/
$ ls models/en/
alphabet.txt lm.binary output_graph.pb output_graph.pbmm output_graph.tflite trie
A typical application of the aligner is done in three phases:
- Preparing the data. Albeit most of this has to be done individually, there are some tools for data preparation, statistics and maintenance. All involved file formats are described here.
- Aligning the data using the alignment tool and it algorithm.
- Exporting aligned data using the data-set exporter.
There is a script for downloading and preparing some public domain speech and transcript data.
It requires ffmpeg
for some sample conversion.
$ bin/
$ ls data
test1 test2
Now the aligner can be called either "manually" (specifying all involved files directly):
$ bin/ --audio data/test1/audio.wav --script data/test1/transcript.txt --aligned data/test1/aligned.json --tlog data/test1/transcript.log
Or "automatically" by specifying a so-called catalog file that bundles all involved paths:
$ bin/ --catalog data/test1.catalog