-
Notifications
You must be signed in to change notification settings - Fork 4k
Home
Table of Contents
- Introduction
-
Frequently Asked Questions
- Where do I get pre-trained models?
- How can I train using my own data?
- How can I import trained weights to do inference?
- How can I train on Amazon AWS/Google CloudML/my favorite platform?
- I get an error about
native_client/libctc_decoder_with_kenlm.so: undefined symbol
during training - I get an error about
lm::FormatLoadException
during training - I get an error about
Create kernel failed: Invalid argument: NodeDef mentions attr ‘identical_element_shapes’
when running inference - I get an error about
Error: Alphabet size does not match loaded model: alphabet has size ...
when running inference - Is it possible to use AMD or Intel GPUs with DeepSpeech?
- Are we using Deep Speech 2 or Deep Speech 1 paper implementation?
- What is the accuracy of this speech recogniser compared to others using the pretrained models?
- Why can't I speak directly to DeepSpeech instead of first making an audio recording?
- I would like to use this to send voice commands to my Linux Desktop to run commands, open programs and transcribe emails for example.
- What is the process of making "custom trained DeepSpeech engines" available to the end user on web, Android apps and iOS apps?
- In what form (java lib, C lib, javascript, objectiveC, ...) is the "recognition engine" integrated into the mobile apps?
- Why does the pretrained model always return an empty string? Am I doing it wrong?
- Add your own question/answer
Welcome to the DeepSpeech wiki. This space is meant to hold answers to questions that are not related to the code or the project's goals, so it should not to be an issue, and are common enough to warrant a dedicated place for documentation. Some examples of good topics are how to deploy our code on your favorite cloud provider, how to train on your own custom dataset, or how to use the native client on your favorite platform or framework. We don't currently have answers to all of those questions, so contributions are welcome!
DeepSpeech cannot do speech-to-text without a trained model file. You can create your own (see below), or use pre-trained model files available on the releases page.
The easiest way to train on a custom dataset is to write your own importer that knows the structure of your audio and text files. All you have to do is generate CSV files for your splits with three columns wav_filename
, wav_filesize
and transcript
that specify the path to the WAV file, its size, and the corresponding transcript text for each of your train, validation and test splits.
To start writing your own importer, run bin/run-ldc93s1.sh
, then look at the CSV file in data/ldc93s1 that's generated by bin/import_ldc93s1.sh
, and also the other more complex bin/import_*
scripts for inspiration. There's no requirement to use Python for the importer, as long as the generated CSV conforms to the format specified above.
DeepSpeech's requirements for the data is that the transcripts match the [a-z ]+
regex, and that the audio is stored WAV (PCM) files.
We save checkpoints (documentation) in the folder you specified with the --checkpoint_dir
argument when running DeepSpeech.py
. You can import it with the standard TensorFlow tools and run inference. A simpler inference graph is created in the export
function in DeepSpeech.py
, you can copy and paste that and restore the weights from a checkpoint to run experiments. Alternatively, you can also use the model exported by export
directly with TensorFlow Serving.
Currently we train on our own hardware with NVIDIA Titan X's, so we don't have answers for those questions. Contributions are welcome!
You are using a libctc_decoder_with_kenlm.so
that is not compatible with your installed version of TensorFlow. Please check that you are downloaded using util/taskcluster.py
from the same branch as the tensorflow version documented in requirements.txt
If you get an error that looks like this:
Loading the LM will be faster if you build a binary file.
Reading data/lm/lm.binary
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
terminate called after throwing an instance of 'lm::FormatLoadException'
what(): native_client/kenlm/lm/read_arpa.cc:65 in void lm::ReadARPACounts(util::FilePiece&, std::vector<long unsigned int>&) threw FormatLoadException.
first non-empty line was "version https://git-lfs.github.com/spec/v1" not \data\. Byte: 43
Aborted (core dumped)
Then you forgot to install Git LFS before cloning the repository. Make sure you follow the instructions on https://git-lfs.github.com/, including running git lfs install
once before you clone the repo.
I get an error about E tensorflow/core/framework/op_segment.cc:53] Create kernel failed: Invalid argument: NodeDef mentions attr ‘identical_element_shapes’
when running inference
This is because you are trying to run inference on a model that was trained with a version of TensorFlow that added identical_element_shapes
. This happened for TensorFlow r1.5, and latest v0.1.1 binary published are built using TensorFlow r1.4. You can either re-train with an older version of TensorFlow, or use newer (but potentially unstable) binaries:
- https://pypi.org/project/deepspeech/#history
- https://www.npmjs.com/package/deepspeech and the "version" tab
- https://tools.taskcluster.net/index/artifacts/project.deepspeech.deepspeech.native_client.v0.2.0-alpha.8
Starting with binaries after v0.1.1
, we changed the ordering of the command line arguments to make it more consistent. It is likely you followed the steps described in README.md
from master
or another tag than v0.1.1
, and what is happening is that you are passing the audio file instead of the alphabet file. Please verify you version and the usage information with the --help
argument. This should give you the proper ordering.
This is not yet possible. This depends on TensorFlow's OpenCL support through SYCL and CodePlay's CompteCpp library, which is kind of not as perfect. We are hacking on that, and we know it kind of works at least on Intel GPUs using the in-development Neo
driver. Unfortunately, the currently stable driver Beignet
does not work, and it will not because Intel is putting efforts on the new one.
Please chime in on Discourse or on Matrix if you are interested in hacking / helping on that topic.
The current codebase's implementation is a variation of the paper described as Deep Speech 1. There are differences in term of the recurrent layers, where we use LSTM, and also hyperparameters.
As documented on (https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/), we achieve 6.5% of Word Error Rate on LibriSpeech’s test-clean
set.
We are providing inference tools as a way to easily test the system, but building upon that is open to anyone. Having to deal with more interactive UX is out of the scope of the current target of those tools.
I would like to use this to send voice commands to my Linux Desktop to run commands, open programs and transcribe emails for example.
This requires integration with your system, and the same answer as above applies. Anyone is welcome in contributing that kind of tooling, however.
What is the process of making "custom trained DeepSpeech engines" available to the end user on web, Android apps and iOS apps?
Use TFLite for Android and iOS. Currently we don't have a good answer for TFLite. Last attempt that was made to leverage TensorFlowJS got blocked by missing support for some of the operations used in our graph (e.g. LSTM). Any explorations into testing TensorFlowJS again are welcome, feel free to open an issue to share your results or track your progress.
In what form (java lib, C lib, javascript, objectiveC, ...) is the "recognition engine" integrated into the mobile apps?
As of now, we don't have anything really tailored to be efficient on those platforms, but you should be able to rely on the C++ deepspeech
library being part of the native_client/
subdirectory. We got reports of people being able to link that with NDK's apps on Android.
Maybe, or maybe not. It's hard to give a definitive answer without more context provided. It might be more efficient to ask that on our Discourse https://discourse.mozilla.org/c/deep-speech with more explanations on what you are doing: system you are running on, inference you are doing, how you sourced the audio and its format, version of deepspeech used, etc.
Please add your own questions and answers above, or ask questions below.
Don't edit this footer for questions, add them to the page with the edit button at the top.