1. Developers

10 Good Open Source Speech Recognition Systems [2020]

A speech-to-text (STT) system is as its name implies; A way of transforming the spoken words via sound into textual files that can be used later for any purpose.

Speech-to-text technology is extremely useful. It can be used for a lot of applications such as a automation of transcription, writing books/texts using your own sound only, enabling complicated analyses on information using the generated textual files and a lot of other things.

In the past, the speech-to-text technology was dominated by proprietary software and libraries; Open source alternatives didn’t exist or existed with extreme limitations and no community around. This is changing, today there are a lot of open source speech-to-text tools and libraries that you can use right now.

Here we list 10 of them.

Open Source Speech Recognition Libraries

  1. Project DeepSpeech: This project is made by Mozilla; The organization behind the Firefox browser. It’s a 100% free and open source speech-to-text library that also implies the machine learning technology using TensorFlow framework to fulfill its mission. In other words, you can use it to build training models yourself to enhance the underlying speech-to-text technology and get better results, or even to bring it to other languages if you want. You can also easily integrate it to your other machine learning projects that you are having on TensorFlow. Sadly it sounds like the project is currently only supporting English by default. It’s also available in many languages such as Python (3.6). However, after the recent Mozilla restructure, the future of the project is unknown, as it may be shut down (or not) depending on what they are going to decide.
  2. Kaldi: Is an open source speech recognition software written in C++, and is released under the Apache public license. It works on Windows, macOS and Linux. Its development started back in 2009. Kaldi’s main features over some other speech recognition software is that it’s extendable and modular; The community is providing tons of 3rd-party modules that you can use for your tasks. Kaldi also supports deep neural networks, and offers an excellent documentation on its website. While the code is mainly written in C++, it’s “wrapped” by Bash and Python scripts. So if you are looking just for the basic usage of converting speech to text, then you’ll find it easy to accomplish that via either Python or Bash. You may also wish to check Kaldi Active Grammar, which is a Python pre-built engine with English trained models already ready for usage.
  3. Julius: Probably one of the oldest speech recognition software ever; Its development started in 1991 at the University of Kyoto, and then its ownership was transferred to as an independent project in 2005. A lot of open source applications use it as their engine (Think of KDE Simon). Julius main features include its ability to perform real-time STT processes, low memory usage (Less than 64MB for 20000 words), ability to produce N-best/Word-graph output, ability to work as a server unit and a lot more. This software was mainly built for academic and research purposes. It is written in C, and works on Linux, Windows, macOS and even Android (on smartphones). Currently it supports both English and Japanese languages only. The software is probably available to install easily using your Linux distribution’s repository; Just search for julius package in your package manager.
  4. Wav2Letter++: If you are looking for something modern, then this one is for you. Wav2Letter++ is an open source speech recognition software that was released by Facebook’s AI Research Team just 2 months ago. The code is released under the BSD license. Facebook is describing its library as “the fastest state-of-the-art speech recognition system available”. The concepts on which this tool is built makes it optimized for performance by default; Facebook’s also-new machine learning library FlashLight is used as the underlying core of Wav2Letter++. Wav2Letter++ needs you first to build a training model for the language you desire by yourself in order to train the algorithms on it. No pre-built support of any language (including English) is available; It’s just a machine-learning-driven tool to convert speech to text. It was written in C++, hence the name (Wav2Letter++).
  5. DeepSpeech2: Researchers at the Chinese giant Baidu are also working on their own speech-to-text engine, called DeepSpeech2. It’s an end-to-end open source engine that uses the “PaddlePaddle” deep learning framework for converting both English & Mandarin Chinese languages speeches into text. The code is released under BSD license. The engine can be trained on any model and for any language you desire. The models are not released with the code; You’ll have to build them yourself, just like the other software. DeepSpeech2’s source code is written in Python, so it should be easy for you to get familiar with it if that’s the language you use.
  6. OpenSeq2Seq: Developed by NVIDIA for sequence-to-sequence models training. While it can be used for way more than just speech recognition, it is a good engine nonetheless for this use case. You can either build your own training models using it, or use Jasper, Wave2Letter+ and DeepSpeech2 models which are shipped by default. It supports parallel processing using multiple GPUs/Multiple CPUs, besides a heavy support for some NVIDIA technologies like CUDA and its strong graphics cards. Check its speech recognition documentation page for more information.
  7. Fairseq: Another sequence-to-sequence toolkit. Developed by Facebook and written in Python and the PyTorch framework. Also supports parallel training. Can be even used for translation and more complicated language processing tasks.
  8. Vosk: One of the newest open source speech recognition systems, as its development just started in 2020. Unlike other systems in this list, Vosk is quite ready to use after installation, as it supports 10 languages (English, German, French, Turkish…) with portable 50MB-sized models already available for users (There are other larger models up to 1.4GB if you need). It also works on Raspberry Pi, iOS and android devices, and provides a streaming API which allows you to connect to it to do your speech recognition tasks online. Vosk has bindings for Java, Python, JavaScript, C# and NodeJS.
  9. Athena: An end-to-end speech recognition engine which implements ASR (Automatic speech recognition). Written in Python and licensed under the Apache 2.0 license. Supports unsupervised pre-training and multi-GPUs processing. Built on the top of TensorFlow.
  10. ESPnet: Written in Python on the top of PyTorch. Also supports end-to-end ASR. It follows Kaldi style for data processing, so it would be easier to migrate from it to ESPnet. The main marketing point for ESPnet is the state-of-art performance it gives in many benchmarks, and its support for other language processing tasks such as text-to-speech (STT), machine translation (MT) and speech translation (ST). Licensed under the Apache 2.0 license.

Which Open Source Speech Recognition System to Choose?

If you are building a small application which you want to be portable everywhere, then Vosk is your best option, as it is written in Python and works on iOS, android and Raspberry pi too, and supports up to 10 languages. It also provides a huge training dataset if you shall need it, and a smaller one for portable applications.

If, however, you want to train and build your own models for much complex tasks, then any of Fairseq, OpenSeq2Seq, Athena and ESPnet should be more than enough for your needs, and they are the most modern state-of-the-art toolkits.

As for Mozilla’s DeepSpeech, it lacks a lot of features behind its other competitors in this list, and isn’t really cited a lot in speech recognition academic research like the others. And its future is concerning after the recent Mozilla restructure, so one would want to stay away from it for now.

Traditionally, Julius and Kaldi are also very much cited in the academic literature.

Conclusion

The speech recognition category is starting to become mainly driven by open source technologies. A situation which seemed to be very far-fetched few years ago. The current open source speech recognition software are very modern and bleeding-edge, and one can use them to fulfill any purpose instead of depending on Microsoft’s or IBM’s toolkits.

If you have any other recommendations for this list, or comments in general, we’d love to hear them below!

.
guest
13 Comments
Oldest
Newest
Inline Feedbacks
View all comments
Pierre Mainstone-Mitchell

Is the Android speech to text app going to be ported to, at least, Linux (which I use)? I have it on my phone and it’s really good!

Also are there any text to speech programs available, again for at least Linux?

UbisoftP

I could be mistaken, but I believe your Android phone sends the audio to a Google server, which performs the speech to text conversion and then sends the result back to your phone.

M.Hanny Sabbagh

As far as I know nobody is working on porting individual applications from android to GNU/Linux.

There’s a program called KDE Simon, you can check for it.

Bob Putnam

There’s a Chrome browser extension that works extraordinarily well.

Lootosee

All these projects seem pretty useless if they aren’t packaged in an executable or binary format for use on a particular OS. Short of techie or geek types, regular people are not going to tweak or compile source code. The Windows OS already has SAPI, so what is the incentive to try one of these projects? These projects are not making themselves accessible to the masses.

Roger

Programmers will take these projects and from them, develop easy-to-use projects.
These projects are the vital first step. Almost nobody can do both cutting-edge neural-network research, and user-friendly GUIs. They’re not the same skill set.

We should all be very grateful that the developers of these projects have released them as free software so that other developers can build on them.

M.Hanny Sabbagh

Those projects are simply not for regular people, they are for programmers and those who are building a system that requires speech renegotiation, then they can use those systems instead of the proprietary ones.

Sarah

All fine and good, except even for stuff like pocketsphinx, nobody bothers to explain how to write out a terminal command for it in Linux.

Rather than they saying it’s for programmers, why not say “it’s for a subset of programmers that can self-learn their own terminal commands”.

I am a programmer, and there is no tutorials on it worth anything.

David Roper

i want a program I can talk into a microphone and Ascii text will be formed. I am not a programmer. Is there one?

Christian

This article provided me a good starting point, thank you for publishing it. For future me\’s I\’d like to add a reference to https://github.com/alphacep/vosk-api (found it through https://cmusphinx.github.io/wiki/arpaformat/, haven\’t tried it yet). As a native german speaker I like the fact, that VOSK seems to support German (as well as English, \”French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese. More to come.\”). It uses Kaldi underneath. The list of models can be found under https://alphacephei.com/vosk/models.

Roger

Useful article as far as it goes, but I expected to see some information about how well the different packages worked. How accurate they are transcribing text, and so on. Of course the exact % accuracy will depend on the speaker’s accent and other factors but it would still be useful to give some rough measurements.

Geoff

I’ll second that. I wonder if there’s some benchmark for this (perhaps a set of famous speeches, or sample of youtube videos) which could be run against the various packages, to evaluate them.

Aaron Chantrill

This article did a good job of listing what is available right now, and the basic pros and cons of each. If you just download any of these with the default models and just try to talk to it, you are going to be disappointed/amused. But if you can limit your vocabulary and language model and then adapt the acoustic model to your voice, you’ll get much better results quickly. Understanding what someone is saying requires a lot of different skills. You are creating meaning as you listen which allows you to anticipate what you expect to hear and fill… Read more »

Subscribe for $5

Instead of using your adblocker, join us now on Patreon to unlock a complete ad-free experience + access to private FOSS Post forum where many internals are discussed.

Ad

Email Newsletter

.