How to Install and Use Mozilla DeepSpeech

  Programming

There are many cloud-based speech recognition APIs available today. The Google Cloud Speech API and the IBM Watson Speech-to-Text API are the most widely-used ones. But, what if you don’t want your application to depend on a third-party service. Or, what if you want to create a speech recognition-based application that can work offline. Well, you should consider using Mozilla DeepSpeech.

DeepSpeech is an open source Tensorflow-based speech-to-text processor with a reasonably high accuracy. Needless to say, it uses the latest and state-of-the-art machine learning algorithms.

Installing and using it is surprisingly easy. In this tutorial, I’ll help you get started.

Prerequisites

To be able follow along, you’ll need:

  • A computer running Ubuntu 16.04 or higher. You are free to use a Google Compute Engine VM or a DigitalOcean Droplet.
  • A basic understanding of Python 2.7

1. Installing DeepSpeech

Let’s start by creating a new directory to store a few DeepSpeech-related files.

mkdir speech
cd speech

The easiest way to install DeepSpeech is to the pip tool. Make sure you have it on your computer by running the following command:

sudo apt install python-pip

And now, you can install DeepSpeech for your current user.

pip install deepspeech --user

DeepSpeech needs a model to be able to run speech recognition. You can train your own model, but, for now, let’s use a pre-trained one released by Mozilla. Here’s how you can download it:

wget https://github.com/mozilla/DeepSpeech/releases/download/v0.1.1/deepspeech-0.1.1-models.tar.gz

You’ll be downloading about 1.4 GB of data, so be patient.

Once the download is complete, extract it using the tar command.

tar -xvzf deepspeech-0.1.1-models.tar.gz

You should now have the following files:

models/
models/lm.binary
models/output_graph.pb
models/trie
models/alphabet.txt

At this point, I suggest you try running the deepspeech command with the -h option to make sure that it is ready.

deepspeech -h

You should see the following output:

usage: deepspeech [-h] model audio alphabet [lm] [trie]

Benchmarking tooling for DeepSpeech native_client.

positional arguments:
  model       Path to the model (protocol buffer binary file)
  audio       Path to the audio file to run (WAV format)
  alphabet    Path to the configuration file specifying the alphabet used by
              the network
  lm          Path to the language model binary file
  trie        Path to the language model trie file created with
              native_client/generate_trie

optional arguments:
  -h, --help  show this help message and exit

2. Using DeepSpeech CLI

We’re almost ready now. All we need is a sound file containing speech. DeepSpeech, however, can currently work only with signed 16-bit PCM data. So, you’ll need an audio conversion utility that can convert a regular MP3 to a WAV file. Audacity is good enough.

Just launch Audacity, open your sound file, and then select File > Export Audio. In the dialog that pops up, choose WAV(Microsoft) Signed 16-bit PCM.

Audacity convertor

Finally, you can run the DeepSpeech CLI and pass your WAV file as one of its command-line arguments.

deepspeech models/output_graph.pb /tmp/changed.wav models/alphabet.txt

If there are no errors, you should see something like this:

Loading model from file models/output_graph.pb
2018-02-10 16:27:24.676863: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.300s.
Running inference.
there are a lot of people on this planet how do we go about making sure that every single one of them is virtuous
Inference took 35.881s for 15.319s audio file.

Congratulations! You’ve managed to run DeepSpeech and convert speech in a sound file to text. I’m sure you’ll be amazed with its accuracy.

3. Using DeepSpeech as a Library

The CLI is usually not enough if you want to use DeepSpeech programmatically. So, let me now show you how to work with it in a Python script.

Start by creating a new file called mystt.py and opening it with your favorite text editor.

Now, you can import the DeepSpeech library with the following line:

from deepspeech.model import Model

Next, you’ll need SciPy wavfile to handle WAV files.

import scipy.io.wavfile as wav

Our script will need the locations of the model, alphabet, and sound files. So, it’s a good idea to pass them all to it as command-line arguments. To be able to handle those arguments you will have to import sys.

import sys

You must now initialize a Model instance using the locations of the model and alphabet files. The constructor also expects the number of Mel-frequency cepstral coefficient features to use, a size for the context window, and a beam width for the Connectionist temporal classification decoder. The values of those numbers should match the values used during the training. If you are using the pre-trained model from Mozilla, here’s what you can use:

ds = Model(sys.argv[1], 26, 9, sys.argv[2], 500)

Next, you can read the WAV file using the read() method available in wavfile.

fs, audio = wav.read(sys.argv[3])

Finally, to perform the speech-to-text operation, use the stt() method of the model.

processed_data = ds.stt(audio, fs)

The processed_data variable will contain the result of the operation in the form of a string. You are free to do anything you want with it. For now, let’s just write it to a file.

with open('/tmp/data.txt', 'a') as f:
    f.write(processed_data)

And that’s all! You can run your Python script and expect the same result.

python mystt.py models/output_graph.pb models/alphabet.txt /tmp/input1.wav

Conclusion

You now know how to use the Mozilla DeepSpeech project in your applications. On my computer, which has 4 GB of memory and 4 cores, the recognition operations were quite fast. The accuracy too was quite high. Not as high as Google’s Cloud Speech API, but usable nonetheless.

If you found this article useful, please share it with your friends and colleagues!