Achieving Offline Speech Recognition with Vosk Toolkit
Learn about the Vosk toolkit for offline speech recognition, supporting 20+ languages and easy implementation in various programming languages.
File
Vosk Speech Recognition Made Easy
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: If you want to do speech recognition on your computer without the need of a constant internet connection, or even a super high-end computer, I mean, with 4 gigs of RAM, you can do this on your own device without connecting to an external server and sending your voice data to that server. I know if you're a bit paranoid about that things, the Vosk toolkit is the best option you can get. It can do transcriptions for more than 20 languages and dialects, as you can see in the website. We got English, Indian English, German, French, Spanish, Portuguese, my language, Chinese, Russian, and a lot more. And something really interesting about this particular toolkit library, I don't know, is that the models can be as small as 50 megabytes. Sorry, I'm opening Visual Studio. For the English model, you can get a 40 megabytes model, which is pretty small for this kind of software. It's really small. I can download this right now and do a transcription. I've already done with C, because this model is written in C. It has bindings for pretty much any language. You can do transcriptions in C-sharp, Go, an iOS application, Java, Kotlin, Node.js, Python, Ruby, ROS, and so on. There's a lot of different options. And as you can see, this is not a software that you can download and run on your computer. You need a bit of, how can I say, knowledge about languages if you don't know anything about it. You can just download Python and run PyP, install Vosk, and just follow a simple instruction, which is basically typing the name of the file you want to transcribe. Sorry, my English is garbage. And that's it. It's pretty easy. But really, something that really, how can I say, it's adventurous about this toolkit is that you can only do transcriptions of WAV files. You can't use MP3 files or any kind of encoding you can only do with WAV. I think it's about the compression that MP3 has. And the WAV file also has to be on a, how can I say, specific encoding. It's not encoding, but it has to be a 16-bit file with, I don't know, I don't remember. I'm just going to put this on the screen. But aside from that, it's pretty easy to use. I've done my implementation in C because it's just the most convenient way to do it. But you can do it pretty much in any language. I'm just going to load an example that I did before. Yeah, this is a simple example code that I found on the GitHub page. I went under C because I didn't want to download any bindings for other language. But if you're using any other language, you can just use the example code because it's pretty simple. All that it's doing is loading the model, the model that I downloaded and placed on the same folder as my project, which is, I believe, the 124 megabytes model. And here I'm loading the WAV file, I mean, the audio that we want to transcript. And all that it's doing is evoking this function, which accepts the recognizer and chunks of data of this audio file and just spits out the result. It's basically, it's not very complicated. And here it's just freeing the memory that it allocated for the model and the recognizer. I have already pre-recorded a simple example, which is a WAV file with a bit depth of 16 bits and a sample rate of 16 kilohertz, I believe. And yeah, that's it. I'm just gonna run it. It's going to be a bit slow because it's my computer, it's not that fancy, but it's for a voice recognizer tool. It's pretty fast, faster than it should be. Look at this, that is, this is a simple example of speech recognition using the VOSC toolkit. Okay, I said this is a simple example of speech recognition using the VOSC toolkit. But as you can see in the page, what is it? Yeah, you can kind of tweak these models and decode to certain, to specific vocabularies, especially the small, yeah, the most small, most small models allow dynamic vocabulary reconfiguration. Big models or static vocabulary cannot be modified in runtime. That's basically it. You can change a bit the code and the models to fit your vocabulary and, you know, to recognize things like this. Because here I said the VOSC, the VOSC toolkit, and maybe it's my, I'm gonna say my accent, which is really weird since I'm not a native speaker. I just picked up whatever this accent is, but it's basically that. It's a very, pretty simple, it's a really simple toolkit. It's really, it's really good for, for example, one of the things that I'm trying to do is, what can I say, you, if you're watching this video, I think you probably know ChatGPT. I'm trying to use VOSC recognition to basically use, talk to ChatGPT through my voice. And it's, it kind of fits for that kind of things. And you, if you're not pleased with these results, you can just use the bigger models, which are way more effective. I haven't tried the 2.3 gigabytes model. I think the 1.8 gigabytes, I think I've used before. I don't remember. But that's basically it. If you want to know more things about the VOSC toolkit, I will leave the links to this, to this tool down below and on the description. And if you want, if you want, I can make more videos about the VOSC toolkit using a different language. I can do it in Node.js, in Python, which is the, which is a very simple process. In Rust, I think I've already did that, but I can do it for you. And that's it. I hope you liked the video and yeah, see you next time.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript