Setting Up Whisper for Raspberry Pi Speech-to-Text
Learn to setup Whisper for real-time speech-to-text on Raspberry Pi, compare models, and enhance speed with Fastwhisper. Includes setup advice and tips.
File
You wont believe how fast it is Raspberry Pi Speech-to-Text
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: You want to know the best model for speech-to-text on Raspberry Pi? It's Whisper. Okay, work is done, let's go home. I know that many of you are finding this video while coding the next big thing. What I'm going to do is to give you a really quick, noble sh** guide on how to easily set up Whisper model for real-time inference on Raspberry Pi or any other computing system. Except maybe the abacus, that won't work. And then we're going to make it faster. After that, you can stick around for a little bit of important advice and the follow-up. If you are starting on the fresh Raspberry Pi OS image, that's going to be the latest bookworm. The first thing you'll need to do, if it's a light image, you'll need to install a couple of things with sudo apt-get. So first run sudo apt-get update. You're going to need the python3 pip and git. That's going to be installed, and after that you're going to git clone my fork of the Whisper CPP Python bindings. If you go to the upstream repository, it looks like it's being updated and everything is great, but it's just an illusion. Go to issues, you'll see that the current state of this repository is complete full bar. I fixed that for you. And by the way, if you're using it, please let me know in the comments, because if I think that nobody's using it, I'm probably not going to be maintaining it much, and it's going to be broken again. Let's go and do git clone of that. What we're going to do is we're going to go to the repository, and we're going to execute the commands from readme. On new Raspberry Pi, you'll need to create a virtual environment. It will not allow you to install the build package that you need. You'll do something like... All right, so after creating that virtual environment, we're going to source it. So we'll do source whisper-bin-activate. And from there, run the build command for the wheel. It's going to take some time. You're going to grab your favorite beverage, whatever is applicable to your time of the day, and I think once you're back, it's going to be done. Now you're able to run the next command and install the wheel. And we're going to run the streaming example with a quantized model. All right, this is the transcription test. Everything is going great so far. It can recognize everything I say. Well, almost everything. Yeah, ISO show is pretty good. It's a really small model, and it was running faster than real time, which we actually can confirm by running the inference on the file. But I want to do something else. While researching for this video, I found out that whisper.cpp is not the fastest. The fastest one is a package called fastwhisper, and it's exactly what it says on the team. Go to their repository, to their GitHub repository, and you can just do pip install fastwhisper. Then I have a script I shared on GitHub as a GitHub gist, so you're able to run the same script as I do, which will use two loaders. First, I will use whisper.cpp, Python binding, and then I will use fastwhisper to transcribe 11-second file, which is JFK speech, and tell us the results. So let's do that first of all with the unquantized float32 model. We're going to do Python.

Speaker 2: And so, my fellow Americans, ask not what your country...

Speaker 1: All right, as you can see, unquantized whisper.cpp, tiny English model, takes about 11 seconds for 11-second file, which is really close to real time. But the fastwhisper, it's mind-blowing. It's just 1.5 seconds for 11-second file. I mean, this is not just real time. It's basically, what, five to six times faster than real time. I couldn't believe my eyes. But anyway, this is real, and you can rerun the test. Let's also take the quantized version of whisper with whisper.cpp. It's going to perform slightly better. That will give us the real-time inference with whisper.cpp, but it's still not going to be as fast. All right, so we can see that it's slightly over nine seconds for 11-second file, so this is real time. Still pretty good quality. I mean, it's really great quality of transcription. But it's not the same as 1.5 seconds, you know? Now how to use it. For whisper.cpp, Python binding is really simple, easy-peasy. You get the segments in this callback, and then it's up to you to what to do with them. Here, for example, I check if a specific string is found in the segment, and then I pause the audio stream, and actually, like in another video, I am passing this segment to a language model. Then for faster whisper, what you'll need to do is you'll need to create a Python script that captures the audio from the microphone. Probably you'll want to have voice activity detection to make sure that you're not wasting your computer resources and time on transcribing the silence. Then you'll just pass the recorded audio chunks into the faster whisper model, like that. And you'll get your results over there. Now for that important piece of advice I told you about earlier. What about other programming languages? Like, okay, I gave you the Python stuff, but what about C++? Well, Whisper.cpp, for example, is written in C++, so you just take the headers and you use the function that I've written for you, so that's easy. There are bindings for other programming languages, but I cannot vouch for the current state of those bindings. If you want to know how I use that Whisper.cpp real-time transcription and combine it with the language model on the Raspberry Pi, follow on to the next video.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript