Build a Local Live Speech Recognition System

Convert Your Audio To Text

4.9/5

3720 customer reviews

Learn to create a speech-to-text system using Jupyter widgets, PyAudio, and Vosk on your local machine, without requiring high-end hardware.

Real-Time Speech Recognition With Your Microphone [Beginner Tutorial With Full Code]

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hi, my name is Vic, and today we are going to build a live speech recognition system that works with your microphone. So you're going to be able to press a button and record some audio, and then have that audio be automatically converted into a text transcript. And the amazing thing is it's all going to run on your local machine, and it's going to use your CPU, so you're not going to need a fancy computer or a GPU to make this work. So we'll start out building a Jupyter widget that lets you press a button to start and stop recording audio, and then we'll build a voice recognition pipeline that can take that audio and convert it into a text transcript. Let's get started. We'll end up with a Jupyter Notebook that has two interactive buttons. One that says record, and one that says stop. When we click the record button like I just did, the system will actually start recording your microphone and transcribing it live. And as you can see, the transcript has appeared. And as I keep talking, it will continue to add to the transcript. And when I'm done talking, I can just go ahead and hit stop, and the live transcription will stop. So this system can be used for almost anything where you want to record your microphone. So you can use it for live events, you can use it for meeting notes, or you can use it to record interviews. And to create this system, we're going to do three things. First we're going to create Jupyter widgets, like I mentioned, interactive buttons and an output area. Then we're going to install a library called PyAudio that we can use to actually record our microphone in the background. Then we're going to install a library called Vosk that will be able to do speech recognition for us, and will add in the output of speech recognition to our Jupyter widget. And these three components together will give us the system that I just showed you to generate live transcripts of your microphone audio. So enough setup, let's go ahead and dive in. I've gone ahead and created a fresh Jupyter notebook inside of JupyterLab to actually code this. You can use JupyterNotebook or JupyterLab or any other ID that lets you create notebooks. You can't unfortunately run this on Google Colab or anywhere in the cloud, since it does need access to your local microphone. The first thing we're going to do is we're going to do pip install ipywidgets. So these are the interactive widgets that actually create the buttons for us. So I'll go ahead and run that. You can see I've already installed it, so I'll go ahead and delete the cell. Then we can go ahead and import it. So we'll say import ipywidgets as widgets. Then what we'll do is we'll create our first button. So we'll say recordButton equals widgets.button. And description will be the text that appears on the button. So this text is going to be record. We'll say disabled equals false, since we want the button to be enabled. And then we'll say buttonStyle equals success. And this determines the color of the button. This will make it green. And then we'll add a little icon to the button that looks like a microphone. Then what we need to do is import a function that will actually display this widget for us. So from ipython.display, we can import display. And then what we do is we call the display function and pass in our widget. And this will actually render the widget for us. So we now have a record button, and we can click it. But it doesn't do anything yet, because we didn't wire up anything to run when we click on it. We will do that in a little bit. But first, let's set up our other widgets. The next widget we need to set up is actually our stop button. So we're actually going to set that up in a very similar way. So I'll copy and paste this code. I'll rename this to stopButton. And then I will just change the description. So we'll call it a stopButton. Instead of a successStyle, we're going to have a warningStyle, which is going to give it that orange color. And then the icon will be a stopIcon. Then we can go ahead and pass this in as well to render it. And if we run this cell again, we now see both buttons. We also need an output widget. So this is the widget that's actually going to show the transcript as it is generated. And this widget we will create by calling widgets.output. And that will create an instance of an output widget. And we can pass that into the display function as well. So you can't really see the output widget. There's no text in it right now. But it's right below the stopButton. So we now have all of the widgets we need. They just don't do anything. And the fun part of this project is actually making them do something. So we need to write two functions. One that will run when we click record. And another that will run when we stop recording. So we'll call this startRecording. And we're going to have a data argument to this function. When you click on record and you call a function, IPython, Jupyter, automatically passes in this data parameter. So that's why we have it in the function. And I'll just say pass. And then we'll have stopRecording data. And we'll pass on that too. And then we'll wire up these widgets to actually call those functions when we click on. So recordButton.onClick we'll call startRecording. So when we click on the record button, it will call this function. And then when we click on the stop button, we'll call stopRecording. All right. So I'll go ahead and run this. And if you click record, it now calls this function, although the function doesn't do anything. So nothing happens. And when you click on stop, it will call this stopRecording function. Now we need to make these functions actually do something. So what we're going to do is in a normal program, in a normal Python program where let's say you're assigning a value to x and then you're printing x, each line of code is executed sequentially. So you'll run this line of code, then this line of code, and then the program is finished after those two lines of code run. So that's the typical way we execute programs. Now for this, we actually need to listen to the microphone in the background and transcribe in the background. So we actually need to perform two tasks at the same time, one task to continuously record the microphone and another task to actually transcribe the audio into text. So to do this, we actually need to create threads. So threads are a Python concept. They're essentially functions that run in the background. So they don't interrupt your main function, but they can just do something like record audio continuously. So in order to use them, we're going to import from threading, we're going to import thread, which will let us create a thread in the background. We're also going to need to pass some messages to the threads. So when we want the thread to stop, we're going to need to send a message to it. And then the thread that's recording the microphone is going to need to pass the data it records to the thread that's actually doing the transcription. So we need to import something else called a queue. So we'll say from queue import queue. And a queue will let us actually pass messages between threads. And we'll create two queues to actually pass the messages. So we'll create one queue called messages, which will basically tell the thread when to stop, when to stop recording and when to stop transcribing. And we'll create another one called recordings. And this will actually store the audio data that we record from the microphone and pass it to the transcription. Now we can actually write some code in the start recording function. So we'll say messages.put true. So this will just tell, we're going to write the thread logic in a second, but this will basically tell a thread to keep running and keep recording the microphone. And then we'll say with output. So using this context manager enables us to write print data directly into that output widget that we created. So we'll say display starting. And that will display starting into the output widget. Then what we'll do is we'll say record equals thread target equals record microphone. We have not written this function yet. We need to write it. But basically what this is doing is it's creating a thread that will record the microphone. And then we'll say record.start. That will start the thread and have it run in the background. So it will continuously listen to the microphone and record it. Then we'll create another thread called transcribe. And this one is going to have its target as a function called speech recognition. We have not written this function yet. So this code won't work. But we need to just write the code before we write the function. Just so you can see how it works. So we'll write transcribe.start. So what this function is going to do is it is going to put a message on this message's queue. And that message is going to tell these threads to keep running. And if that's not super clear yet, don't worry about it. When we actually write the code for these record microphone and speech recognition functions, you'll see what this message's queue is doing. Then we're going to create two threads and run them in the background. One to record our microphone audio. And another to actually transcribe the audio into text. Then for stop recording, what we're going to do is we're going to say with output again. And we're going to say messages.get. This will actually take the message off the queue. So this puts a message on the queue called true. And this just takes the message off the queue. And then what we'll say is display stopped. All right. Let me comment these lines out and you can see how this code actually works. So I'll hit run. And then when we hit record, we'll see starting. When we hit stop, we'll see stopped. And basically when we're clicking record, we're running this function. And when we're clicking stop, we're running this function. All right. So that is basically everything we have to do to get the Jupyter widgets to work. And now our goal is to do the next two pieces, which is to write this record microphone function and to write the speech recognition function. So let me go ahead and run this. And next, we will go on and write our microphone functions. To do this, we're going to need to use a library called PyAudio. And let me show you that library real quick. So this library lets you actually interact with system audio devices like your speaker or your microphone. So if you wanted to play a sound on your computer speaker or record your microphone, which is what we want to do, then PyAudio will help you do it. And the installation is a little bit different depending on which OS you're using. I use Mac. So for me, I follow these instructions here. If you're using Windows, they're instructions. If you're using Linux, they're instructions as well. So just keep in mind that it'll be a little bit different based on what OS you're using. So for me on Mac, what I'm going to do is I'm going to hit exclamation mark, which means run a command in the terminal. And I'm going to type brew install port audio. I've already installed this, so I'm not going to run it. And then after I do that, I will type pip install PyAudio, which will go ahead and install the PyAudio package. After I do this, we need to actually figure out which of our sound devices is the microphone that we want. So in order to do this, we need to first import PyAudio. Then we need to initialize our PyAudio interface. I missed the capital. So this will actually initialize our PyAudio connection to our system sound devices. Then what we want to do is we want to loop through all of the sound devices on our system and print out the info for them. So we'll say for i in range p.getDeviceCount. So getDeviceCount is just going to show how many audio devices are connected to our system. And we're going to print out getDeviceInfo by index for each of those. And when we're finished, we're going to run p.terminate. This just terminates our PyAudio connection to all of our audio devices. So let's go ahead and run this. And we can see I have all of these audio devices connected to my computer. The first is my monitor, which has speakers. The second is my cam link, which I use to connect to my webcam. Then I have my microphone. So this is the microphone that I want to use to record. And the index is two. So just remember the index that you want and ignore some of the other devices that you don't need. Just grab the microphone that you want to record from and remember the index. We'll be using that in a second. Now we can actually go ahead and write the function that will record our microphone. Before we actually do, there's a couple of constants we need to define. So these just define how our audio is going to be recorded. And all of these are just values that are optimal for speech recognition. So audio can be recorded in multiple channels. If you listen to headphones, you might notice your left headphone sounds a little bit different from your right headphone because they're different audio channels. But speech recognition works best with one channel audio. So I'm going to set channels to one. And then frame rate determines how high quality the recording is. It's how quickly your audio signal is sampled. I'm going to set frame rate to 16,000. Again that's a good default for speech recognition. Then I'm going to set something called record seconds. So this is how many seconds we want to record audio for before we send it off for transcription. So every 20 seconds we will generate a transcript. And we'll talk a little bit about how you can get this closer to exactly live. There's going to be a little bit of a delay now between you saying something and it being transcribed. But if you want it to be totally live, there's a couple things you can do to fix that. And I'll talk about that at the end. And then we're going to define the audio format. So this is the format that we're going to record our audio in. And I'm just going to use this pyaudio.paint16, which is again what we're going to use with our speech recognition engine. And then we will define sample size. And then we can go ahead and write our function to record from our microphone. And we're going to pass in a parameter called chunk, which just defines how often we're going to read from the microphone. All right. So we're going to again initialize pyaudio. Then what we're going to do is we're going to create a stream. And this is going to connect to our microphone and record. So we're going to say format equals audio format. Channels equals channels. Rate equals frame rate. This is just specifying all the options for how we want to record. We're going to pass a parameter called input equals true, which just specifies we're recording from a microphone. And then our input device index. So you should use the index of your microphone. So my microphone was at index two. So you should use whatever index is there for your microphone. And then we're going to say frames per buffer equals chunk. So that just determines how often we're going to read data from our microphone. So read our audio. Then we're going to create a list called frames. And this is actually going to store all of our audio that's recorded from our microphone. And what we'll say is while not messages.empty. And what this is doing is you'll recall earlier when we started and stopped recording we actually put data onto this messages queue and took data off. So what this function is going to do is as long as there is a message in the messages queue we're going to keep recording. But once the message comes off the queue we'll stop recording. So what this stop recording button is doing is it's taking the message off the queue, which means recording will stop. Communicating between threads is tough, which is why we kind of have to use the queue to do it. So what we'll do is we'll say stream.read chunk. So we're going to read 1024 audio frames at a time from our microphone. And then we'll add that to our frames list. And then what we're going to say is we're going to say if the length of the frames is greater than or equal to frame rate. So this is the number of frames you record per second. And record seconds divided by chunk. Then add our audio data to our recordings queue. And then we're going to say frames equals is empty. Let me just talk through and explain what this is doing. So this is saying if we've recorded more than 20 seconds of audio. So you'll remember we want to send our audio out to be transcribed every 20 seconds. So if we've recorded more than 20 seconds of audio, then we are going to add our audio data to this recordings queue. And it'll be picked up from that queue by our other thread. And we're going to put a copy of frames on that queue. And then we are going to empty out our frames list. So basically every 20 seconds, we are going to pass our audio data that we've recorded to our transcription engine and then start recording another 20 seconds of audio. As long as we haven't told it to stop recording. And then once we have told it to stop recording, we're going to say stream.stopStream, stream.close, and then p.terminate. It's really important to add these three lines because PyAudio actually opens up a connection to your microphone. And if you don't close that connection, it can cause some weird system issues that are hard to debug. So just make sure you write these lines. And then we can go ahead and run this. And we now have a function to record our microphone audio. So when we click record, this thread will actually run that function in the background. And it will run the recordMicrophone function. Now all we have to do is write the transcribe function. And that function will actually turn the audio into text. Let's go ahead and do that. And in order to do that, we'll be using a couple of libraries. So let me talk you through what we will be doing. We're going to use a Python library called Vosk in order to actually do our speech recognition. And Vosk has pre-trained models that support 20 plus languages and dialects. We're going to be using English language models because we're going to be recording English audio. But feel free to use a different model if you'd like. The amazing thing about Vosk is it works offline. So all the speech recognition will happen on your own computer. And you can run it even on devices that don't have a GPU. So it can even run on a Raspberry Pi or an Android phone. In this case, we'll be running it on a computer. But you don't need a really fancy machine or a GPU to actually run this. And it installs pretty easily with just pip install Vosk. So looking at Vosk models, you can see there's several different models for every language. For English, there's three models. We're going to be using this model in the video. It is a really big model. So if you don't feel like downloading a 1.8 gigabyte file, you can feel free to use this model. And I'll show you how to use it when we get to that code. The unfortunate thing about Vosk is it outputs transcripts without any punctuation. With no capitalization, no periods, or anything. So we need to use another model to actually add in the punctuation. So if you scroll down on this models page and you go to punctuation, you can see that there's this Vosk recase punk model. So if you want to add punctuation, you'll also need to download this. It's not necessary, right? If you're okay with a transcript without punctuation, you don't need to do this. But I'm going to show you how to do it and I recommend doing it. So that's just, you'll have to click right here to download that. Okay, let's head back over to our notebook. And we will actually install all of the packages and actually run our speech recognition. So the first thing you'll need to do is pip install Vosk, which I've already done. Then you'll need to run pip install transformers, which this is needed for that recase punk that adds the punctuation back into the transcript. And then you'll need to install PyTorch using pip install torch. And this is a requirement for that recase punk model. So those are the three things you'll have to install. And then we're going to go ahead and import a few packages. So we'll import subprocess. We're going to use this to call our punctuation model. It's basically a Python package that creates a separate process that lets you call commands on the command line. We're going to import JSON. And then we're going to import from Vosk. We're going to import model and call the recognizer. And then what we can say is we can say model equals model. Model name equals, and I'll just copy and paste this in, Vosk model ENUS. So this is that 1.8 gigabyte model. But if you want to use a smaller one, you just use the name of the smaller one instead of this name. So there was that 40 megabyte model. You can definitely use the name of that model instead. And for me on Mac with the latest version of Vosk as of now, it will automatically download the model when I pass in model name. If it doesn't automatically download the model for you, you may have to actually go back to that models page and download it. So let's go ahead and do that. And then what we'll need to do is create a recognizer. So this will use the model to actually do the speech recognition. So we'll pass in our model and we'll pass in the frame rate of our audio, which was 16,000 hertz. And then we're going to say rec.setwords true. So this will give us confidence levels for each individual word. So for example, if I say a word and the system is not confident that I actually said the word that it thinks I said, it'll give you a probability that the transcription is correct. Now we can write our speech recognition function. And this function is going to take in the output widget that we created earlier. If you remember, we created it up here. And this widget is what we're going to use to display that transcript live. So what we'll say is while not messages.empty frames equals recordings.get. So messages.empty, this is just making sure that we have not clicked stop recording. Once we click stop recording, that messages queue will be empty and this will not run. And recordings.get. So you'll remember in this record microphone function, we are putting our audio data into this recordings queue. So this is essentially pulling the data out of the queue. So it's grabbing our microphone audio from the other function so we can use it in our speech recognition engine. Then we're going to say rec.acceptWaveform. And our frames are going to be several different chunks. You remember we're reading 1024 frames at a time from a microphone. So this is just joining all of the chunks together into one single binary string. So we're passing that in. And then we're going to say result equals rec.result. And then we'll say text equals json.loadsResult. Text for whatever reason, Vosk actually returns its results in JSON format. So we need to use the JSON library to load the results and get the text key. The next thing we need to do is actually add punctuation into our transcript. So I'm going to open up my files here. So earlier you saw that I showed you that RecasePunk model on the Vosk models page. I extracted it into the RecasePunk folder. And you can see the notebook I'm working from is called microphone2. And the RecasePunk folder is here. So make sure it's in the same directory as your notebook. And when I click in, I should see at least this checkpoint file, which is the actual model we're going to be using to add in punctuation. And this RecasePunk.py file, which is what we're going to be calling to actually run the model for us. So make sure you see these two files. And this is optional. If you don't feel like downloading it, just ignore the next couple of lines of code. So what we'll say is cased equals subprocess.checkOutput. And what we'll say is Python, RecasePunk, RecasePunk.py. And then we'll pass in our checkpoint, which is the trained model, RecasePunk.checkpoint. And we're going to run it in the shell. So we'll say shell equals true. We're running it in the user shell. We're going to say text equals true, which means I'm going to pass in some text as input. And then my input will be the actual transcript, so this text variable. So this will go ahead and take our transcript with no punctuation and add punctuation. So the reason I'm using subprocess.checkOutput is because I don't want to write a lot of code to actually do the recasing. And this file has already been created to do it. So if you call this file on the command line, it'll do the recasing. But if we click into RecasePunk on the left and you click on RecasePunk.py, you can see that it's a Python file. So if you wanted to make this code that we're writing a little bit more efficient, you could actually import the same modules as that file and actually preload the model so we're not loading and running it every time. Because right now, every time we do our 20 seconds of speech recognition, we're actually reloading the model, reinitializing everything, and that's not super efficient. So if you actually, instead of calling this as a command line command, if you actually just wrote the Python to do the recasing, it would be a lot faster. And you could reduce this record setting, second setting, you could probably get it closer to two or three, which would feel a lot more live in terms of the transcription. Okay, so we're going to add in our casing. And then what we'll do is we'll say output.appendStdout.cased. And what this is doing is it's adding our transcript into our output widget. All right, so that is actually all the code we need. So I'm going to go ahead and run these two functions. And what this is doing is it's just Vosk loading the pre-trained speech recognition models. And I can go back up here, run this again. And now when I hit record, it should say starting. And as I talk, it will start actually transcribing what I'm saying live, which is really, really cool. So let's go ahead and see that starting. While we're waiting, let me go ahead and just explain all of this code again. What we started out doing was we started out creating Jupyter widgets. So widgets are just interactive elements that we can add to notebooks. In this case, we created a record button that we can click to actually start our recording, a stop button that we can click to stop recording, and an output widget that will show our live transcription. We then wrote a function called startRecording that actually initializes two background threads, one to record our microphone continuously, and the other to actually transcribe the audio from our microphone into text. And these two need to be threads because they both need to run at the same time in the background. And then we created this messages queue that we added a value to. So when we put a value onto the queue, that means that we want our recording and our transcription to run. When we click stop, we actually take the value out of the queue, which signals to these two threads to stop running. Then we hooked up our widgets properly. So when you click on them, they start recording and stop recording. And you can see our transcript is actually showing up as I'm talking, which is really cool. And it's actually pretty accurate. Also very cool. Then what we did is we used PyAudio to actually find the microphone that we need to record from. In my case, it was index2. Then we wrote a function called recordMicrophone that opened up a stream to our microphone and records audio frames from our microphone until we click stop recording. It reads at 1024 frames at a time. Once we have 20 seconds of audio, it passes our recording over to the next function, which does the speech recognition. Then we used Vosk to actually do the speech recognition. So again, while our queue is not empty, we are continuously grabbing our recordings from the queue. Then we're passing them into our speech recognition engine, and we're adding them to our output widget using this output.appendstdout. All right, so there's a lot you can do with this if you want to continue extending it and adding to it. The first thing you can do is make it more efficient. So it's not quite real time now, and a big reason for that is because this command has to reload the model every time. So I mentioned how you can make this more efficient. You could also just leave this out if you want to not worry about adding in casing, and it would run a lot faster in that case. Another thing you could do is you could aim to do some sort of live translation. So you could add in a translation model here to translate from English into another language or vice versa. And we installed a library called Transformers earlier, and you can actually use that library to do translation as well. So if we jump over here to HuggingFaceHub, this is connected to the Transformers Python library. You can see that they have a lot of different models that do translation, and one of those models is T5Base. So you could use a model like this to actually translate between different languages. So if you click Translation here, you can see a lot of different translation models you can try. So you could actually build a live translation system. In a previous video, I also showed you how you can do live summarization. So one extension you could do is you could actually summarize what people are saying in real time, which is very cool. So if you're listening to a long lecture, and you want to catch up on what happened earlier, or you're listening to a long interview, it gives you something quick you can scan and get a summary. So there's a lot you can do with this, and I hope this was a good overview that showed you everything that you need to actually get started and run this.