Getting Started with Mozilla DeepSpeech on Windows
Learn to install and run Mozilla's DeepSpeech for speech-to-text conversion on Windows, and discover the benefits of the Common Voice project.
File
Real-time Speech to Text with DeepSpeech - Getting Started on Windows and Transcribe Microphone Free
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hello everyone, in this video I'm going to talk about deep speech. Alright, so in this video I'm going to talk about DeepSpeech, an open-source speech-to-text engine by Mozilla based on deep learning which allows us to convert speech audio files into text and I'm going to show you how you can get yourself up and running on Windows. So without further ado, let's get started. The first thing we need is a model and while creating your own model is possible, it is very computationally expensive. So for this video and in general for most of your projects you should start with a pre-trained model that you can find in the DeepSpeech repository on GitHub, then on releases and by the time of this video we are looking for release 0.7.3 and we go down and download two files which are models.square and models.pbmm and depending on your connection this can take a few minutes so I recommend you to do that first. Alright, so while our models are downloading the next thing we should do is installing DeepSpeech. So what I prefer to do is creating a new folder in the documents folder called for example DeepSpeech, then I'm going to cd into it and then the next thing we should do is making sure that we have Python 3.6 installed and the best thing to do is just type in Python in the terminal and then check in the version here. This is crucial because Python 3.7 or Python 3.8 are not compatible with DeepSpeech as of now. So if you don't have the right version the best thing you should do is going into the Python download page and then searching for Python version 6.7 or 6.9 for example and download it and install it. Okay, after we check the version we are ready to create a new virtual environment and the command to do that let me type in Python minus m vm dot which means create the virtual environment here. Alright, so it seems like it creates the environment. Now we have to activate that environment and the command to do that is script activate and as you can see we now have DeepSpeech at the beginning of our prompt line. Okay, that means that we are into that environment. Now we are going to pip install DeepSpeech and then be patient because it usually takes a while. Alright, so as you can see we already have DeepSpeech enabled now and this is enough to convert audio files into text but in this video I wanted to show you how you can use your microphone to convert speech to text in real time and to do that we first need to install another thing which can be found into the DeepSpeech examples repository also on GitHub. In particular we want the MicVed streaming folder and we want to copy it into the DeepSpeech folder we created earlier. Alright, so as you can see if I type dir I now have the MicVed streaming folder. I'm now going to cd into that and install all the required dependencies with pip install minus r requirements. Wait for a bit because it takes a while. Alright, so now that the dependencies are installed we can try the example and to do that we type this command python then the name of the script MicVed streaming minus m and then the path of the .pbmm model we downloaded earlier space minus s and then the path of the square we found before. Then we can start. Hello everyone Goodbye. As you can see my speech was converted into text. We can experiment with this a lot and I'm going to show you some cool tricks you can do at this point. If we open the MicVed streaming script and we go down to about this line right here 194 here we have the conversion between the stream and the text. One very interesting thing is that if we change a finish stream with finish stream with metadata and we save it and execute it again the script will now give us time information for each token and this is super cool I'm going to show you. Hello everyone. As you can see for each letter it says the time at which that letter was detected and this is super useful for many applications such as for example if you want to add subtitles to videos you now have the exact moment in which you should put that letter and I'm going to show you a nice little project in one of the next videos but for now you can experiment with this as much as you want. Alright so before leaving you I'm going to show you another very cool thing related to deep speech which is the common voice project so most of the cloud speech APIs such as the one from Google use big data sets that are proprietary and they are not free to use for everybody. This makes creating open source speech-to-text engines such as deep speech very difficult so Mozilla created this project common voice in which you can donate your voice and also validate voices made by other people and this is super useful if you have some spare minutes during the day I highly suggest you to do that and another very useful thing is that because they are creating this huge data set and they created this platform you also offer the same thing for many languages. I personally contributed quite a bit to the Italian data set because as of now there are no good Italian data sets and if you are not a native English speaker I highly suggest you to contribute to common voice in your native language because it really helps the project and eventually we will get open source speech-to-text engines for everybody. An open source speech-to-text engine is a great thing to have because as of now we only have good performances on those cloud services such as the Google speech API which is only running on the cloud and is proprietary. So two good things about deep speech is that it is not proprietary you can do whatever you want with it and you can also run it locally and this is very good for many applications. The thing is the performance of deep speech is not as good as the proprietary counterparts as of now so please contribute to those great projects if you have a bit of time because eventually we will all benefit from these projects. So thank you very much I hope you liked this video if you did please consider subscribing to the channel and leaving a like because it really helps and I hope to see you next time

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript