Building an Offline Digital Assistant with Vosk: Step-by-Step Guide

Convert Your Audio To Text

4.9/5

3727 customer reviews

Learn how to create an offline digital assistant using the Vosk library in Python. Follow this detailed tutorial to set up and run speech recognition without internet.

How to use Vosk -- the Offline Speech Recognition Library for Python

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: I've used the speech recognition library extensively in many of my projects on this channel, specifically on one of the most popular videos on my channel where I made my own Alexa or Siri and you can watch that by clicking here. The problem with this speech recognition library is that it uses the internet connection so So you have to be connected to the internet for it to go out to the model and return what you're saying. But many of my future projects will require an offline speech recognition capability. And so as such, we're going to look at the Python library called Vosk. And it uses a model that you download to your own computer to do speech recognition the same way the Python speech recognition library works. So let's check that out, but first welcome to the 175th video on my channel where I'm building my own digital assistant named Shane like Jarvis from the Iron Man movies and comics. Please consider subscribing to my channel if you're interested in watching me make my own digital assistant. So now let's check out the code that we're going to use and the offline speech recognition is pretty important because when I use the internet of things, or if I build some sort of wearable suit, a lot of that will just be an intra-connected local area connection between the devices and not necessarily the internet. So having an offline capability is definitely important. And besides, we're going to want a digital assistant or an AI, if you will, that is redundant and can, you know, work in a dystopian world where there is no internet. And that's what we'll do today. So the first thing you're gonna do is you're going to pip install Vosk. If you're using PyCharm, you go to file, settings, you go to your project, whatever it's called, interpreter. I get a lot of comments, ignore Jarvis, that's just the original file that I called it. But you go here and you simply type in Vosk here and you pip install that. So this is a 3.7 project, however I have 3.8 and it's important to know what kind of project you have because before you pip install Vosk, depending on if this is a brand new project in PyCharm, you may need to pip install PyAudio. And if you're on a Windows machine, it's not going to let you install PyAudio. So I'll leave a link in the description, but you need to go to these Windows wheel library site. And that's here, this Python extension packages for Windows. And you'll see these PyAudio versions here, right? And then you're going to download whatever version of Python you're using. So I'm using 3.7 for this project. I have 3.8 also installed on my computer and I have a Windows Intel. So I'd pick, or I have a 64-bit processor, so I would use this wheel right here. So you go ahead and download it. And once it's in your download file, downloads. So I'll just click on one of these, but I'm not going to reinstall it. So find whatever one is best for you, have it install, then go to the actual folder that it's in. Once it downloads, alright, now from here, if you're on a Windows machine, just click right in here and type in command, or I'm sorry, first, see I've already downloaded this one. Rename it. You won't have a dash one. So just copy and paste this right here. Now go up here and type in command. And it'll take you to the root folder that you're in. This is very important. If you just go to here and go to command prompt, you'll have just your C drive or whatever your initial drive is called. So then you go pip install, Then copy and paste this file and then just go .whl and install PyAudio. That's how you install a wheel file. And once you do that, you can go back to your Python interpreter and pip install VOSK. The next thing you're going to do is import, okay, so once that's done, then you're going to go to this website, which I will put a link in the description, and you're going to download whatever model you're going to use. So this supports like 27 languages, Indian, Chinese. You can see the file sizes are rather large, or Russian, French. So assuming you want English, you just go here to this Vosk model, and then download it, and what it's going to do is it's going to download a zip file, so I'm assuming you're using Windows, click on it, extract all, and then it'll extract it into your folder. And then so what I've done is I took the extracted folder and went to my PyCharm project here. See how it says projects Jarvis. So let's look at that. And then it's right here. So go to properties, right click on it, go to properties. And then, well, actually double click on it because it's within a folder. And then go up here and copy and paste this. And make sure you're in the actual folder, like before all the files. So you can see it has dual folders here. So you're going to have to copy this path right here if yours looks the same. All right. So now that we have this copy and pasted, first you're going to, from Vosk, import capital M for model and capital K and R for CalDRecognizer. Then you're going to import PyAudio. And then we're going to do model equals capital M model, and in parentheses, you're going to pass the folder to your files right here as a string, right? like I said, go into the folder where the projects are, see how it's that, and then just copy and paste it, and copy and paste it into here, and before the string put in R. And what that does is it saves the absolute path when you add the R, okay? Make sure you add this R. You'll get it, you'll get an error when it's trying to compile the model, and you just go here, and it'll say you need the absolute path that means you forgot this R or you did not save the right path to your model. Then we're going to do recognizer equals CAUD recognizer. We're going to pass it this model then we're going to press it the frequency which is 16,000. Then I'm going to establish my mic equals pi audio dot pi audio, capital P, capital A. I just like to use PyCharm's autofill so I don't mess that up, then call it. Then we're going to start a stream, so stream equals mic.open. The format is going to be this pyaudio.paint16, that's not what it really stands for, paint, it's just P-A, and then the integer is 16 but anyways paint 16 channels equals 1 this is an integer don't put it in a string rate equals 16,000 which matches here the input equals true frames per buffer don't worry about this just copy it it's 8192 then we're going to start a loop so it's continuously listening so while true so So I have no break, so you have to actually end the program. Data, so we can add a keyboard interrupt if we wanted to. Data equals stream, which we started up here. I'm sorry, stream.startStream, then we're going to start it. So stream.read, 4,096. Again, don't ask me what that is. It's actually half of that now that I look at it. So if nothing's being told, we were going to, You know, this is an if, but we're gonna skip that. So if recognizer, which we started up here, dot accept waveform, that's a capital W right there, accept waveform, which is data, which is your reading. So if it recognizes it as actual speech, then the text equals recognizer dot result, call it, and then we're gonna print this text. And I'm printing it from the 14th space to the negative three space. And let me show you what that is about. Let's just print the text also so you can see what I'm doing here. It passes it as a string that looks like a dictionary. So I got kind of tripped up on this. So I just removed all the stuff I didn't want. So let's run this. And what it's going to do down here is it's going to run the model first down here, and then it's going to wait and listen. Test. Input overflowed. Okay, okay, note to self, I can't record my video and use the same microphone to use this code on. code on. So what I did was I ran it off, I paused the camera, and then I ran it. So what it does is it runs this model. And then so I printed both, right? Let's go back to the code. I printed this text. And then I printed this text minus, you know, kind of skimming the or trimming the string. And what it does is it passes this string right here. So this curly brace space quotes for text, like a dictionary, a space, another space, a colon, a space, another space. So I said test, and then it printed this, and then it just printed the test. Then I pressed it, I said test again, so it's printing this, the whole text, so that's what you get out of the result. And then I tried to parse it like this, and it doesn't like that because it's a string, not a dictionary or a list. So I just did it like this, I just trimmed it manually, if you have a better way please let me know. Then I said Shane, which is the name of my digital assistant, then you have Shane here. Then I said subscribe to my channel, it's just a random thought. And then I said goodbye world and you can see that by adding this right here, this trim, I just get the result. So I'm super excited that this works because it's going to be valuable in a lot more future projects. So I hope you enjoyed this video and I hope this is useful to you. If it is, drop a like and thanks for watching. Goodbye world.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support