DeepSpeech: Mozilla's speech recognition engine

Currently Mozilla not only works in its popular web browser, but also has a variety of projects under its umbrella, of which Today we will talk about DeepSpeech. This is a speech recognition engine which implements the homonymous speech recognition architecture proposed by the Baidu researchers.

DeepSpeech stands out for offering different trained models, sample audio files, and command-line recognition tools, to integrate the speech recognition function into your programs. For it ready-made modules are provided for Python, NodeJS, C ++ and .NET, although external developers also prepared separate modules for Rust and Go.

The finished model is delivered only for the English language, but for other languages according to the attached instructions, the system can be trained using the voice data collected by the Common Voice project.

About DeepSpeech

DeepSpeech is much simpler than traditional systems and at the same time it provides a higher quality of recognition in the presence of extraneous noise.

The development does not use traditional acoustic models and the concept of phonemes; instead, use a machine learning system Well optimized neural network based, which eliminates the need to develop separate components to model various deviations such as noise, echo, and speech characteristics.

The flip side of this approach is that to get high-quality recognition and training of a neural network, the motor DeepSpeech requires a large amount of data heterogeneous dictated in real conditions by different voices and in the presence of natural noise.

The Common Voice project created in Mozilla is responsible for collecting such data, providing a proven data set with 780 hours in English, 325 in German, 173 in French and 27 hours in Russian.

The end goal from the Common Voice project is the accumulation of 10 thousand hours with recordings of various pronunciations phrases typical of human speech, which will achieve an acceptable level of recognition errors. In the current form, the project participants have already taught a total of 4.3 thousand hours, of which 3.5 thousand have passed the test.

In teaching the final English model for DeepSpeech, 3816 hours of speech were used, except for Common Voice which encompasses project data from LibriSpeech, Fisher and Switchboard, as well as including around 1700 hours of transcribed radio program recordings.

When using the English ready-to-download model, the level of recognition error in DeepSpeech is 7,5% when evaluated with the LibriSpeech test suite. By way of comparison, the level of errors in human recognition is estimated at 5.83%.

DeepSpeech consists of two subsystems: an acoustic model and a decoder. The acoustic model uses deep machine learning methods to calculate the probability of the presence of certain characters in the input sound. The decoder uses a ray search algorithm to convert the character probability data to a text representation.

About the new version of DeepSpeech

DeepSpeech is currently in its version 0.6 in which the following changes are highlighted:

A new transmission decoder is proposed that provides greater responsiveness and does not depend on the size of the processed audio data.
Changes have been made to the API and work has been done to unify function names. Functions have been added to obtain additional metadata about the synchronization, allowing not only to receive a text representation in the output, but also to trace the binding of individual characters and sentences to a position in the audio stream.
Support for using the CuDNN library to optimize work with recurrent neural networks (RNN) has been added to the toolkit for training modules.
The minimum requirements for the TensorFlow version have been raised from 1.13.1 to 1.14.0.
Added support for TensorFlow Lite Light Edition, which reduces the DeepSpeech package size from 98MB to 3.7MB.
The language model has been transferred to another data structure format, allowing files to be allocated to memory at boot time.
Support for the older format has been discontinued.

The implementation is written in Python using the TensorFlow machine learning platform and is distributed under the free MPL 2.0 license. The job It is supported on Linux, Android, macOS and Windows. There is enough performance to use the motor on LePotato, Raspberry Pi 3 and Raspberry Pi 4 boards.

DesdeLinux

DeepSpeech: Mozilla's speech recognition engine

About DeepSpeech

About the new version of DeepSpeech

Leave a Comment Cancel reply