There are many cloud-based speech recognition APIs available today. The Google Cloud Speech API and the IBM Watson Speech-to-Text API are the most widely-used ones. But, what if you don’t want your application to depend on a third-party service. Or, what if you want to create a speech recognition-based application that can work offline. Well, you should consider using Mozilla DeepSpeech.
DeepSpeech is an open source Tensorflow-based speech-to-text processor with a reasonably high accuracy. Needless to say, it uses the latest and state-of-the-art machine learning algorithms.
Installing and using it is surprisingly easy. In this tutorial, I’ll help you get started.
To be able follow along, you’ll need:
Let’s start by creating a new directory to store a few DeepSpeech-related files.
The easiest way to install DeepSpeech is to the
pip tool. Make sure you have it on your computer by running the following command:
And now, you can install DeepSpeech for your current user.
DeepSpeech needs a model to be able to run speech recognition. You can train your own model, but, for now, let’s use a pre-trained one released by Mozilla. Here’s how you can download it:
You’ll be downloading about 1.4 GB of data, so be patient.
Once the download is complete, extract it using the
You should now have the following files:
models/ models/lm.binary models/output_graph.pb models/trie models/alphabet.txt
At this point, I suggest you try running the
deepspeech command with the
-h option to make sure that it is ready.
You should see the following output:
usage: deepspeech [-h] model audio alphabet [lm] [trie] Benchmarking tooling for DeepSpeech native_client. positional arguments: model Path to the model (protocol buffer binary file) audio Path to the audio file to run (WAV format) alphabet Path to the configuration file specifying the alphabet used by the network lm Path to the language model binary file trie Path to the language model trie file created with native_client/generate_trie optional arguments: -h, --help show this help message and exit
We’re almost ready now. All we need is a sound file containing speech. DeepSpeech, however, can currently work only with signed 16-bit PCM data. So, you’ll need an audio conversion utility that can convert a regular MP3 to a WAV file. Audacity is good enough.
Just launch Audacity, open your sound file, and then select File > Export Audio. In the dialog that pops up, choose WAV(Microsoft) Signed 16-bit PCM.
Finally, you can run the DeepSpeech CLI and pass your WAV file as one of its command-line arguments.
If there are no errors, you should see something like this:
Loading model from file models/output_graph.pb 2018-02-10 16:27:24.676863: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Loaded model in 0.300s. Running inference. there are a lot of people on this planet how do we go about making sure that every single one of them is virtuous Inference took 35.881s for 15.319s audio file.
Congratulations! You’ve managed to run DeepSpeech and convert speech in a sound file to text. I’m sure you’ll be amazed with its accuracy.
The CLI is usually not enough if you want to use DeepSpeech programmatically. So, let me now show you how to work with it in a Python script.
Start by creating a new file called mystt.py and opening it with your favorite text editor.
Now, you can import the DeepSpeech library with the following line:
Next, you’ll need SciPy
wavfile to handle WAV files.
Our script will need the locations of the model, alphabet, and sound files. So, it’s a good idea to pass them all to it as command-line arguments. To be able to handle those arguments you will have to import
You must now initialize a
Model instance using the locations of the model and alphabet files. The constructor also expects the number of Mel-frequency cepstral coefficient features to use, a size for the context window, and a beam width for the Connectionist temporal classification decoder. The values of those numbers should match the values used during the training. If you are using the pre-trained model from Mozilla, here’s what you can use:
Next, you can read the WAV file using the
read() method available in
Finally, to perform the speech-to-text operation, use the
stt() method of the model.
processed_data variable will contain the result of the operation in the form of a string. You are free to do anything you want with it. For now, let’s just write it to a file.
And that’s all! You can run your Python script and expect the same result.
You now know how to use the Mozilla DeepSpeech project in your applications. On my computer, which has 4 GB of memory and 4 cores, the recognition operations were quite fast. The accuracy too was quite high. Not as high as Google’s Cloud Speech API, but usable nonetheless.