Mozilla Introduces DeepSpeech 0.9 Speech Recognition Engine

DeepSpeech1

Launch has been published voice recognition engine DeepSpeech 0.9 developed by Mozilla, which implements the architecture of speech recognition of the same name proposed by Baidu researchers.

The implementation is written in Python using the machine learning platform TensorFlow and is distributed under the free MPL 2.0 license.

About DeepSpeech

DeepSpeech consists of two subsystems: an acoustic model and a decoder. The acoustic model uses deep machine learning techniques to calculate the probability that certain characters are present in the input sound.

The decoder uses a ray search algorithm to transform the character probability data into a textual representation. DeepSpeech is much simpler than traditional systems and at the same time provides a higher quality of recognition in the presence of extraneous noise.

The development does not use traditional acoustic models and the concept of phonemes; instead, a well-optimized neural network-based machine learning system is used, which eliminates the need to develop separate components to model various anomalies such as noise, echo, and speech characteristics.

El kit offers trained models, sample sound files and command line recognition tools.

The finished model is supplied for English and Chinese only. For other languages, you can learn the system yourself according to the attached instructions, using the voice data collected by the Common Voice project.

When the ready-to-use model of the English language offered for download is used, the level of recognition errors in DeepSpeech is 7.06% when evaluated using the LibriSpeech test suite.

For comparison, the human recognition error rate is estimated at 5,83%.

In the proposed model, the best recognition result is achieved with a clean recording of a male voice with an American accent in an environment without extraneous noises.

According to the author of the Vosk Continuous Speech Recognition Library, the disadvantages of the Common Voice set are the one-sidedness of the speech material (the predominance of men in their 20s and 30s and the lack of material with the voice of women, children and elderly), the lack of vocabulary variability (repetition of the same phrases) and the distribution of MP3 recordings prone to distortion.

Disadvantages of DeepSpeech include poor performance and the high memory consumption in the decoder, as well as important resources to train the model (Mozilla uses a system with 8 Quadro RTX 6000 GPUs with 24GB VRAM in each one).

The downside to this approach is that for high-quality recognition and training of a neural network, the DeepSpeech engine requires a large amount of data heterogeneous dictated in real conditions by different voices and in the presence of natural noises.

This data is compiled by the Common Voice project created in Mozilla, which provides a verified data set with 1469 hours in English, 692 in German, 554 in French, 105 hours in Russian and 22 hours in Ukrainian.

When training the final English model for DeepSpeech, in addition to Common Voice, data from the LibriSpeech, Fisher and Switchboard projects are additionally used, as well as approximately 1700 hours of recordings of transcribed radio programs.

Between the changes in the new branch, the possibility of forcing the weight of the words is highlighted selected during the decoding process.

It also highlights the support for the Electron 9.2 platform and an optional implementation of the layer normalization mechanism (Layer Norm) when training the neural network.

Download and get

The performance is sufficient to use the motor in LePotato, Raspberry Pi 3 and Raspberry Pi 4 boards, as well as in Google Pixel 2, Sony Xperia Z Premium and Nokia 1.3 smartphones.

Ready modules are offered to use for Python, NodeJS, C ++, and .NET to integrate speech recognition functions into your programs (third-party developers have separately prepared modules for Rust, Go, and V).


Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.