Well within 24 hours, transcribe 40 hours of recorded speech in a surprise language.
Build an ASR for a surprise language L from a pre-trained acoustic model, an L pronunciation dictionary, and an L language model. This approach converts phones directly to L words. This is less noisy than using multiple cross-trained ASRs to make English words from which phone strings are extracted, merged by PTgen, and reconstituted into L words.
A full description with performance measurements is on arXiv,
and in:
M Hasegawa-Johnson, L Rolston, C Goudeseune, GA Levow, and K Kirchhoff,
Grapheme-to-phoneme transduction for cross-language ASR,
Stat. Lang. Speech Proc.:3‒19, 2020.
- Install software: * Kaldi * brno-phnrec * This repo * Extension of ASpIRE * CVTE Mandarin
- For each language L, build an ASR: * Get raw text. * Get a G2P. * Build an ASR.
- Transcribe speech: * Get recordings. * Typical results.
If you don't already have a version of Kaldi newer than 2016 Sep 30, get and build it following the instructions in its INSTALL files.
git clone https://github.com/kaldi-asr/kaldi
cd kaldi/tools; make -j $(nproc)
cd ../src; ./configure --shared && make depend -j $(nproc) && make -j $(nproc)
Put Brno U. of Technology's phoneme recognizer next to the usual s5 directory.
sudo apt-get install libopenblas-dev libopenblas-base
cd kaldi/egs/aspire
git clone https://github.com/uiuc-sst/brno-phnrec.git
cd brno-phnrec/PhnRec
make
Put this next to the usual s5
directory.
(The package nodejs is for ./sampa2ipa.js
.)
sudo apt-get install nodejs
cd kaldi/egs/aspire
git clone https://github.com/uiuc-sst/asr24.git
cd asr24
- Get the ASpIRE chain model, extended by Krisztián Varga.
cd kaldi/egs/aspire/asr24
wget -qO- http://dl.kaldi-asr.org/models/0001_aspire_chain_model.tar.gz | tar xz
steps/online/nnet3/prepare_online_decoding.sh \
--mfcc-config conf/mfcc_hires.conf \
data/lang_chain exp/nnet3/extractor \
exp/chain/tdnn_7b exp/tdnn_7b_chain_online
utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \
exp/tdnn_7b_chain_online exp/tdnn_7b_chain_online/graph_pp
In exp/tdnn_7b_chain_online this builds the files phones.txt
, tree
, final.mdl
, conf/
, etc.
This builds the subdirectories data
and exp
. Its last command mkgraph.sh
can take 45 minutes (30 for CTVE Mandarin) and use a lot of memory because it calls fstdeterminizestar
on a large language model, as Dan Povey explains.
- Verify that it can transcribe English, in mono 16-bit 8 kHz .wav format.
Either use the provided 8khz.wav,
or
sox MySpeech.wav -r 8000 8khz.wav
, orffmpeg -i MySpeech.wav -acodec pcm_s16le -ac 1 -ar 8000 8khz.wav
.
(The scripts cmd.sh
and path.sh
say where to find kaldi/src/online2bin/online2-wav-nnet3-latgen-faster
.)
. cmd.sh && . path.sh
online2-wav-nnet3-latgen-faster \
--online=false --do-endpointing=false \
--frame-subsampling-factor=3 \
--config=exp/tdnn_7b_chain_online/conf/online.conf \
--max-active=7000 \
--beam=15.0 --lattice-beam=6.0 --acoustic-scale=1.0 \
--word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt \
exp/tdnn_7b_chain_online/final.mdl \
exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \
'ark:echo utterance-id1 utterance-id1|' \
'scp:echo utterance-id1 8khz.wav|' \
'ark:/dev/null'
- Get the Mandarin chain model (3.4 GB, about 10 minutes). This makes a subdir cvte/s5, containing a words.txt, HCLG.fst, and final.mdl.
wget -qO- http://kaldi-asr.org/models/0002_cvte_chain_model.tar.gz | tar xz
steps/online/nnet3/prepare_online_decoding.sh \
--mfcc-config conf/mfcc_hires.conf \
data/lang_chain exp/nnet3/extractor \
exp/chain/tdnn_7b cvte/s5/exp/chain/tdnn
utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \
cvte/s5/exp/chain/tdnn cvte/s5/exp/chain/tdnn/graph_pp
- Into
$L/train_all/text
put word strings in L (scraped from wherever), roughly 10 words per line, at most 500k lines. These may be quite noisy, because they'll be cleaned up.
-
Into
$L/train_all/g2aspire.txt
put a G2P, a few hundred lines each containing grapheme(s), whitespace, and space-delimited Aspire-style phones.
If it has CR line terminators, convert them to standard ones in vi with%s/^M/\r/g
, typing control-V before the^M
.
If it starts with a BOM, remove it:vi -b g2aspire.txt
, and justx
that character away. -
If you need to build the G2P,
./g2ipa2asr.py $L_wikipedia_symboltable.txt aspire2ipa.txt phoibletable.csv > $L/train_all/g2aspire.txt
.
./run.sh $L
makes an L-customized HCLG.fst.
- To instead use a prebuilt LM,
./run_from_wordlist.sh $L
. See that script for usage.
On ifp-serv-03.ifp.illinois.edu, get LDC speech and convert it to a flat dir of 8 kHz .wav files:
cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Russian/LDC2016E111/RUS_20160930
cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Tamil/TAM_EVAL_20170601/TAM_EVAL_20170601
cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Uzbek/LDC2016E66/UZB_20160711
mkdir /tmp/8k
for f in */AUDIO/*.flac; do sox "$f" -r 8000 -c 1 /tmp/8k/$(basename ${f%.*}.wav); done
tar cf /workspace/ifp-53_1-data/eval/8k.tar -C /tmp 8k
rm -rf /tmp/8k
For BABEL .sph files:
cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Assamese/LDC2016E02/conversational/training/audio
tar cf /tmp/foo.tar BABEL*.sph
scp /tmp/foo.tar ifp-53:/tmp
On ifp-53,
mkdir ~/kaldi/egs/aspire/asr24/$L-8khz
cd myTmpSphDir
tar xf /tmp/foo.tar
for f in *.sph; do ~/kaldi/tools/sph2pipe_v2.5/sph2pipe -p -f rif "$f" /tmp/a.wav; \
sox /tmp/a.wav -r 8000 -c 1 ~/kaldi/egs/aspire/asr24/$L-8khz/$(basename ${f%.*}.wav); done
On the host that will run the transcribing, e.g. ifp-53:
cd kaldi/egs/aspire/asr24
wget -qO- http://www.ifp.illinois.edu/~camilleg/e/8k.tar | tar xf -
mv 8k $L-8khz
./mkscp.rb $L-8khz $(nproc) $L
splits the ASR tasks into one job per CPU core, each job with roughly the same audio duration.
It reads$L-8khz
, the dir of 8 kHz speech files.
It makes$L-submit.sh
../$L-submit.sh
launches these jobs in parallel.- After those jobs complete, collect the transcriptions with
grep -h -e '^TAM_EVAL' $L/lat/*.log | sort > $L-scrips.txt
(or ...^RUS_
,^BABEL_
, etc.). - To sftp transcriptions to Jon May as
elisa.tam-eng.eval-asr-uiuc.y3r1.v8.xml.gz
, with timestamp June 11 and version 8,
grep -h -e '^TAM_EVAL' tamil/lat/*.log | sort | sed -e 's/ /\t/' | ./hyp2jonmay.rb /tmp/jon-tam tam 20180611 8
(If UTF-8 errors occur, simplify letters by appending to the sed command args such as-e 's/Ñ/N/g'
.) - Collect each .wav file's n best transcriptions with
cat $L/lat/*.ascii | sort > $L-nbest.txt
.
If your transcriptions used nonsense English words, convert them to phones and then, via a trie or longest common substring, into L-words:
./trie-$L.rb < trie1-scrips.txt > $L-trie-scrips.txt
.make multicore-$L
; wait;grep ... > $L-lcs-scrips.txt
.
RUS_20160930 was transcribed in 67 minutes, 13 MB/min, 12x faster than real time.
A 3.1 GB subset of Assam LDC2016E02 was transcribed in 440 minutes, 7 MB/min, 6.5x real time. (This may have been slower because it exhausted ifp-53's memory.)
Arabic/NEMLAR_speech/NMBCN7AR, 2.2 GB (40 hours), was transcribed in 147 minutes, 14 MB/min, 16x real time. (This may have been faster because it was a few long (half-hour) files instead of many brief ones.)
TAM_EVAL_20170601 was transcribed in 45 minutes, 21 MB/min, 19x real time.
Generating lattices $L/lat/*
took 1.04x longer for Russian, 0.93x longer(!) for Arabic, 1.7x longer for Tamil.