Speaker 1: Hello, today we are going to test OpenAI Whisper audio transcription models on a Raspberry Pi 5. The main goal is to understand if a Raspberry Pi 5 is capable of transcribing audio from a microphone in real time. There are different types of models from different vendors and you can even build your own, but in this video, I will review OpenAI Whisper and its variety of models, tiny, base, small, medium and large, and I will compare the results. Let's start with understanding what real-time transcription is in the next example. When Rick says something, a sound wave goes to a microphone. The microphone converts it into an electric signal and sends it to a Raspberry Pi. The Raspberry Pi writes this input stream into a WAV file in an uncompressed audio file format. It can write a lot of data into one big file, but I want to transcribe audio while someone speaks, not after a day. So after each 10 seconds of recording, I close the file and add it into a queue. And the sound recording continues to write the stream into a new file. Then the audio transcriber gets the first element from the queue and transcribes it. Then it saves the result text from Whisper into a file or a database and graphs the following audio file from the queue. Really straightforward. But as usual, it's not so simple. The real-time transcription could work if the recognition time is less than a recording chunk, in our case 10 seconds. The queue will have one or two files in line and everything will go smoothly. However, if transcription takes more than a recording chunk, we will have an overcrowded queue that constantly grows and is never fully executed. And the first restriction is that the OpenAI Whisper model can use a CPU or NVIDIA CUDA graphics cards only. Unfortunately, Raspberry Pi has no NVIDIA chips and Whisper don't work with TPUs like Google Coral. So, we will run it on a CPU only. And the Raspberry Pi 5 has a 64-bit 2.4 GHz quad-core ARM processor. Could Raspberry Pi accomplish a real-time transcription through Whisper on a CPU? And that's what we are going to find out next. I'll use the latest Debian image and install it using the Raspberry Pi imager. There is nothing special, everything is straight out of the box. Then I'll insert a SD card into the Raspberry Pi and plug in an external USB microphone. Just like that. Next, I'll connect to the Raspberry Pi via SSH. The username and password are set by the Raspberry Pi imager, so I'll use those. Then I'll clone my repository, which has all the required code and scripts. You can find the link to this repository in the video description. Once it's downloaded, I'll navigate to OpenAI Whisper Raspberry Pi folder and then to its subfolder named System. The script install.sh will set up all the dependencies, mainly several Python libraries like NumPy and Torch, which Whisper requires. This process will take some time, but I'll speed up it for the video. When you see this line, it's ready to go. Now let's talk about Whisper AI models. They come in various sizes, like large, medium, small, base and tiny. For the first test, I will use a medium English-only model. The large model requires 10 GB of memory, but I have a budget version of Raspberry Pi with only 4 GB, so it wouldn't run. In this experiment, I'll open several terminal windows for the test. The first window will run and show the AI transcription process. The second window will handle the audio recording. The third window will display the transcribed text. The fourth window will show memory usage and CPU information. And a YouTube video that will play some recording for transcription. I'm opening my repository and Python folder in all terminal windows. To start the transcription queue, I'll run a Python script called daemon-ai with an integer 3 as an argument. It will download the medium model and attempt to open it. However, the Raspberry Pi will freeze. All terminals are not responsive, and I even tried to open a new connection and it doesn't work. I'll have to reboot it manually, and after re-establishing the SSH connection, I'll demonstrate why this happens. By running the htop command and the Python script again. You can see that the model consumes all 4 GB of memory and a swap file, eventually causing the system to freeze. That's it for the medium model. It requires a more powerful device. No way to run it on my Raspberry Pi. Let's move on to the small model. I'll use the same terminals, but this time with an argument of 2. The model will take some time to download, so I speed this up. Initialization of the model takes about 2 minutes. It's a one-time action, but still significant. Half of the memory is used, but we still have 2 GB free. The AI queue indicates there is no audio, so I'll start the audio recording in the second window. And for these purposes, I need to run daemon underscore audio dot pi file. It recognized the microphone, so we are good to go. Now I can start the video for transcription. The audio recorder creates chunks that are 10 seconds long. These recordings are added to a queue and stored in a data folder for the recording date. We can open this in the third window. The text will be here soon, indicating that the AI has processed the first item from the queue. Now we can wait until we have enough data to draw conclusions. Here is the transcription. The full text of the video chunk was saved into a file in a data folder. Let's review how fast it was processed on a Raspberry Pi. As we can see, we have 10 elements in the queue that are waiting for a transcription. Processing a 10 seconds audio chunk takes over 30 seconds. In other words, I recorded 3 audio chunks while Whisper transcribed 1. That's unfortunate. It works, but it makes live transcription impossible with these timings, because the queue will grow infinitely. Let's move on to the base module. I am using the same set of windows and the command daemon underscore AI, but this time we need to use a base module with an index of 1. The library downloads the required files for the first time. In HTOP output, we see that the system and Whisper consumed 800 MB of memory, which is pretty low. Now I can start the same audio recording process in the second window. I need to wait for the first output from the transcriber and speed up this again to collect some data. Ok, let's look at the transcription. We have far more transcribed text than in the previous case, and that's a good sign. If we look at the queue, we will see that it has only 3 elements compared to 10 in the previous case. Sometimes less, sometimes more, but overall the queue is also growing. So even if we have approximately the same time for the recording and transcribing, this slight difference makes it unsuitable for live transcribing. Theoretically, we can win this time on IO operations, like writing to the memory instead of the SD card, however, it's not the topic of this video. So the verdict is kind of possible, but needs adjustments. Let's move on to the final test of the Tiny model. And again, the same set of windows and the same commands. The index of Tiny model is 0, speeding up downloading and looking into HTOP command. Now we have approximately 700 MB used. It's similar to the base model. Next, we are starting the video and collecting some data. Finally, we can review the results. Let's start with the text. Now we have even more text than in the previous test. I ended the video at the exact same moment, but here we have a few sentences more. Next, let's look at the queue. It has zero elements that are waiting for their turn. A transcription process takes approximately 6 seconds on average. So it looks like we can transcribe more than real-time audio, it's a win. Also, if we look at the transcription quality, we will find that it understands everything fine. Of course, Robert Downey Jr. has excellent pronunciation and it was easy for a Tiny model. But I am not trying to test the model's quality. I am trying to test performance and the performance is good. We can use a Raspberry Pi 5 and a ChatGPT Whisper for live transcription without additional graphic cards. And that's cool. Let me know what you think in the comments. Thanks for watching.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now