A simple rust program for calculating the Word Error Rate. This is part of my learning process for getting to know Rust.
Also, I wanted to see how much faster Rust can be when compared to interpreter languages such as Python. The file
python-equivalent/wer.py
has the exact same algorithm written in Python.
Word Error Rate (WER) is a way to evaluate the performance of Speech-To-Text systems. It takes into account how many words are needed to be inserted/deleted or substituted between the predicted text (the output of the ASR system) and the ground truth (manually transcribed text). In my implementation I am returning the average WER from every separate sentence.
clap = "2.33.3"
for the command line parsing.cute = "0.3.0"
for easier loops.
- Build the project by running
cargo build
inside the directory (orcargo build --release
in order to avoid recompiling the code when runningcargo run ...
). - If you used the
--release
flag then run the program bycargo run --release $FILE1 $FILE2
where$FILE1
is the file of the predicted transcriptions and$FILE2
is the file containing the true transcriptions. The format of these files is explained below. - You can also run the code by doing
./target/release/rust-wer $FILE1 $FILE2
. - In order to time the execution time, I used the
time
command. For example,time ./target/release/rust-wer $FILE1 $FILE2
. - The output WER will be simply printed in the console.
The input files must be simple text files that contain one sentence per line. The line numbers must be the same between
the predicted and the true texts. So, if the predicted text was empty for a certain sentence then you need to leave
an empty line there. See the data
directory for examples.
Also, the texts must be already preprocessed, i.e. remove punctuations and convert to lower case.
To time the rust program do the following:
time ./target/release/rust-wer ./data/mytranscripts.txt ./data/truth.txt
For me, this took 0.003
seconds.
To time the python script do the following:
time python main.py ./data/mytranscripts.txt ./data/truth.txt
While this took 0.149
seconds. So, even with such a small example the execution-time difference is huge.
I also wanted to see how julia performs on the same task and so I created the juliawer
project. By running the
juliawer.jl
with julia
, it takes more than a second which is way worse than the above two executions and I am
not sure why.
I also tried using PackageCompiler
in order to compile the julia code but the result did not change.
- Change input file format (json?)
- Maybe handle punctuations/lowercase