Exploring Whisper Model's Streaming ASR Capabilities
Learn how OpenAI's Whisper model can be adapted for streaming ASR and its limitations in processing continuous speech input efficiently.
File
Can Whisper be used for real-time streaming ASR
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Can the OpenAI Whisper model do streaming ASR? My name is Bai. I'm a machine learning engineer and a PhD in actual language processing. And today, I will answer this question. So if you haven't seen the Whisper model, it is a speech-to-text model. It is trained on about 100 languages on 680,000 hours of data. The model architecture is an encoder-decoder transformer model, and it comes in five different sizes. And it is growing in popularity recently because it is robust to noise and accents. So the question of the day is, can this model do streaming speech recognition? So first of all, what do we mean by streaming ASR? Well, automated speech recognition can come in two forms, batch and streaming. Batch means that the model needs to take a bunch of speech and produce a bunch of text. But streaming ASR is when the model needs to produce an output as the speaker is saying things with a delay of not more than a few seconds. So for example, if you're listening to a live broadcast, you don't want to first record the entire broadcast and then produce the subtitles. You want it to appear at most a few seconds after it is said. And generally, you would expect streaming ASR to be a few percentage points lower accuracy than batch ASR, because the model does not have access to all the context, only the point up until the current word. The question of how to do streaming ASR came up when I was building the voice writer. This is an AI writing assistant that works in two steps. In the first step, you just speak your thoughts without worrying too much about the grammar. And in the second step, the AI corrects the grammar for you. I've been finding it super helpful for writing all kinds of things, including emails, blog posts, and Slack messages. You can try it out for free in the link here, and I also post a link in the description of the video. Now back to the video. First of all, why is this even difficult? Why is it not so easy to use the Whisper model for streaming ASR? The issue is Whisper is trained to process input audio of 30 seconds long. If the audio is less than 30 seconds, you can pad it to be longer, so this is not too much of a problem. But what if your audio is longer than 30 seconds? There is no way to feed into Whisper anything longer than 30 seconds. So the first thing you might think of is, what if you just split it up into chunks of 30 seconds at a time, and process each chunk one at a time? Well, the first problem is, you might split it in the middle of the word. If you split it while in the middle of the word, this word will probably not be recognized correctly. The second problem is, the latency will be really high. So imagine if you're processing audio, you're waiting for audio to fill up these 30 seconds, and only then will you process this entire chunk, and then the latency for the first word will be up to 30 seconds. Fortunately, there is an open source repo called Whisper Streaming that does this for you. And it turns the Whisper model into a streaming ASR system. There are instructions on how to set this up, so let's try it out. So I run this command to start a model running, it's using a Whisper small model, and then I go to this tab and run another command, and I just start speaking. So you can pretty quickly see that it's picking it up, and it's pretty quick and responsive. And it even gives timestamps, so it's a pretty cool demo. Now let's talk about how this works. So we start with this audio file, and what we do is, we feed chunks of increasing size into Whisper. The size of these chunks is configured by the minimum chunk size parameter, which by default is 1 second. So in each iteration, we increase the size of this buffer by 1 second, and feed all of it into Whisper. And this process continues until we hit an end of sentence marker, like a period or a question mark. Then we move the buffer forward and start the process again. So each piece of audio is processed by Whisper multiple times, as many times as it takes until we hit the end of the sentence. The reason for this is Whisper is trained on sentences, so it gives the best result when the start of the audio aligns with the start of a sentence. And that's why the buffer only moves forward when the previous sentence is complete, and we're beginning a new sentence. This way, Whisper never needs to start transcribing from the middle of a sentence, which would give suboptimal results. In many applications, it is useful to have some results available as soon as possible, even if they're not completely accurate. In the voice writer, I am showing the incomplete results in grey, whereas the confirmed results are in black. The difference is the grey part of the sentence is unconfirmed, so it may change as the model gets new information. But the black part is the confirmed results, and they are permanent. How this works is using an algorithm called local agreement, with n equals 2. This means that for a token to be confirmed, it needs to be generated in two consecutive audio buffers. Let's give an example. Let's say in the first step, the Whisper model outputs the three tokens, if you like, and nothing is confirmed at this step. In the second step, the model produces more tokens, but only the first two tokens agree with the previous step, so those two are confirmed. In the third and the fourth steps, more tokens are generated, but at each step, it does not confirm any tokens until it is at least generated in two consecutive chunks. So anything in the grey part is possible to change in given new information. For example, the word view might change to video once the model hears the rest of the sentence. But anything in the black part is permanent, so even if the model wants to change it to something else after more iterations, it cannot be changed anymore. One last thing that this model does when it generates a new sentence is it feeds the previous sentence into the model as prompt tokens. So this is something that you can do in the Whisper format, is you can give it a bunch of prompt tokens before you start generating. And this tends to improve the accuracy a little bit, because more context is always good. So in summary, the algorithm can be summarized in three basic ideas. The first idea is we feed longer and longer consecutive audio buffers into Whisper. And then we emit tokens as soon as they're confirmed by two iterations. And finally, we scroll forward the audio buffer whenever a sentence is completed. So it's a pretty simple algorithm that you can apply to any speech-to-text model that does not support streaming, and basically turn it into a streaming model. If you like, you can check out all the details in this paper, which I'll link in the description. One of the limitations comes from the fact that Whisper was not really designed to be a streaming model. And because of this, it assumes that each audio buffer has to start at the beginning of a sentence. Therefore, if you have a sentence that is quite long, then the beginning of the sentence has to be fed into the model and processed many times. This is inefficient, and not necessary if the model was trained from the beginning to do streaming ASR. Now, if we look at an architecture that was designed specifically to do streaming speech recognition, it looks a little bit different. Here is a model that was proposed in 2021. At each step, it predicts a token, and it has access to a fixed amount of past context and future context. During training, this rule is enforced by having an attention mask that is mostly zeros, meaning that the model cannot use information that is outside of this fixed window. And during prediction, the model predicts a token given a limited amount of fixed context. To make the next prediction, the entire receptive field moves forward by one chunk, but the size of the receptive field is fixed. This way, the beginning of the sentence will not be processed multiple times. However, this is not possible with the Whisper model, because it requires modifying the architecture and retraining the model, and we do not have access to Whisper's training data. That's it for this video, and I hope you enjoyed my explanation of how Whisper can be turned into a streaming model. If you enjoyed this video, please leave a comment, hit the subscribe button, and ring the bell icon to stay notified when I release future videos. Goodbye.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript