Enhance Raspberry Pi 5 with Whisper for Live Transcription
Explore faster speech transcription, portable voice changers, and microphone setups on Raspberry Pi 5 using Whisper and Faster Whisper tools.
File
You asked for it - and I delivered Live speech transcription with OpenAI Whisper STT
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: My video on running OpenAI Whisper on Raspberry Pi was a great success. I got tons of comments from you and suggestions. Today, I'm doing the follow-up because I love you so much. Also, the algorithm. I'll run Whisper.cpp and Faster Whisper on new Raspberry Pi 5 with different microphones and I will tell you how to run live speech transcription with microphone input. And as a bonus, we're going to create a portable voice changer device. As we found out in the last video, feel free to re-watch it after finishing this one. I'll leave it on the end screen. The fastest implementation of Whisper is currently Faster Whisper. Whisper.cpp gets really close depending on the platform and optimizations, but on Raspberry Pi, it still noticeably lags behind as you can see in these tests. The second time is Faster Whisper timing and it's much faster. If you want to use Whisper.cpp, I'm maintaining Python bindings for it. I even set up a CI here, so it's all very serious. To get that extra speed up, you want to play with the audio context parameter. Setting it to window size divided by 30 multiplied by 1500 plus 128. For example, for the standard JFK benchmark, that brings the transcription time from about 9 seconds with the float32 tiny model to 3 seconds. What is this mysterious audio context parameter and what are these magical numbers? I don't want to go into a very long and boring explanation right here, so I wrote a public post on my Patreon page. If you want to know, check it out. You don't have to be a member to read it. It's free access. A quick word from our sponsors. I don't have any. So as you heard just now, I do have a Patreon page. Feel free to support me there if you have disposable income. If not, it's okay. There are other ways you can help. You can like this video to please the algorithmic overlords. You can share it where you find relevant. You can even comment. That also counts. All those things mean a lot for a small YouTube channel like mine. I wrote a bare-bone real-time transcription code with Whisper package for you here. It really comes down to getting the audio with PyAudio and sending it over to WhisperModel in chunks. The obvious disadvantage here with this bare-bone implementation is that you will lose some audio when the model inference is running. No bueno. To do this properly, we need to have audio recording and model inference in two separate processes. I didn't want to reinvent the wheel and I found that for FastWhisper, there are multiple packages available that do kind of the same thing that I wanted. That's WhisperStreaming, WhisperLive, and FastWhisperTranscriber. I tested all of three for you to see if they work well. Starting with the last one, FastWhisperTranscriber is actually a GUI application with the possibility of copying and pasting the transcription to system clipboard, which is not exactly what we need here. I want to run it headless as a part of robot setup, for example. So I'll skip this one, leaving with two options, WhisperStreaming and WhisperLive. WhisperStreaming looks like a research project. It's very much self-contained, only a few files here. It's pretty easy to understand and it does seem to work, but there is no readily available client-side code. As you see, the person just using A record and piping the raw audio with NatCat tool. I noticed that it starts lagging after some time, maybe after 30 seconds of inference. If you want to use it as a base for your own work, I think that actually could be a good start. It's worth having a look, but WhisperLive actually has significantly more stars on GitHub and it's more feature-complete work. It uses WebSockets to maintain connection between server and client and functions much more stable. The only problem is strange design decisions. For once, there is no way to directly get the text. If you want to send it to LLM, you can only see it on the screen and there are multiple issues here like this, people asking this question. I quickly had the callback into WhisperLive and we'll be making a PR to make it upstreamed. Let's have a really quick look at the changes that I made. First of all, for the server side, the changes in the server, I changed the way we use the voice activity detection. And more importantly, I added the end of speech flag, which when we notice that there is no speech for three frames, we send this flag over to the client and then we know that the current transcription is over. And on the client side, I added the callback, as you can see. There's a default callback, which is doing exactly what was done before, just printing out the text. I added the paused flag, which pauses the recording. We need that because we don't want the recording to be happening all the time. We want it to stop when LLM processes the output or the output is processed by something else. And I wrote a simple example client, which implements a slightly more complicated callback than our default callback. Basically checks if the end of transcription flag is set. If we do indeed have the final transcription, we output it using the Piper text-to-speech. All right, so let's put it all together and make the portable voice changer. I'm going to need all the processing power I can have, and I'll use new Raspberry Pi 5, which is substantially faster when it comes to speech processing. We're starting with the fresh Raspberry Pi OS image, and we're using light image for that matter, 64-bit. On that image, first, we need to do sudo apt-get update and sudo apt-get install git. Second thing you'll need to do is to create a virtual environment for Python. python-m Okay, and then activate this virtual environment. Great, we're in. After that, in that virtual environment, install Raspi-TTS and my fork of Whisperlife. python-m Okay, and after it's all done, let's have a look at the example I have here in my fork. Example client is exactly the same one. So we create a transcription client, and then we start the recording here, and we have our callback here. On the left side, let's run the server commands. And then on the right side, let's run our example client. All right, let's try saying something. This is actually pretty good. Let's try saying a longer sentence.

Speaker 2: What is the weather like today?

Speaker 1: All right, great. This is working really well, and we're going to try to All right, great. This is working really well, and we're going to try saying an even longer sentence to whether today's sunny.

Speaker 2: All right, great. This is working really well, and we're going to try saying an even longer sentence to whether today's sunny.

Speaker 1: Yeah, loving. It's actually working really well, even with the tiny model. A quick word about other microphones. Just now, I tested with the Respeaker 2 microphone Raspberry Pi had, but I also tested with a standalone USB speaker, and it worked great as well. Many of you had issues with SDL2 for Whisper.cpp streaming example, and yeah, SDL2 can be quite finicky. I think you'll have more success with Whisperlife, which uses Pi Audio, and as long as A record and A play work for you, specifically A record, then you can use any microphone that you like. Here is the original video about running Whisper.cpp on Raspberry Pi 4, and in case you missed it, the video about robots that took my job.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript