Explore OpenAI Whisper: ASR and Multilingual Transcription
Discover how to use OpenAI's Whisper for automatic speech recognition and build a multilingual web app with Gradio. Live demo included in the tutorial.
File
OpenAI Whisper - MultiLingual AI Speech Recognition Live App Tutorial
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: In this Python tutorial, we're going to talk about a very newly released model from OpenAI. OpenAI seems to be living up to their name and this model is completely open source. This is called Whisper and it is in ASR, Automatic Speech Recognition. And this was just launched a few hours ago. So I'm going to show you how you can use Whisper in your Python code. I'm going to show you a collab demo. This is just a code taken from the Whisper repository. Finally, we're going to build a web application using Gradio that uses recorded audio and then does automatic speech recognition for you using Whisper. And we're going to demo it using multiple languages. One, of course, English, which is in US accent. Second, English, my accent. Third, Tamil, which is my language, which is not English, of course. I'm going to do all these things. I was so excited that I could not make a video about this thing. This is quite impressive of what I've seen. So I wanted to show you also how impressive it is. Let's get started with the video. The first thing is, I would like to quickly show you the demo of Gradio application that I've built. It's a very simple application. If you see, we have got a title and we have got a button that says record from microphone. And then we have got an output. So let's see how it works for my own audio. So I can say, this is a quick demo. I wanted to just make this video. I could have never slept without making this video. I'm super impressed by this new model, Whisper from OpenAI. And it is an ASR, automatic speech recognition. And you know, like most of the times when I speak in Indian English, things do not work out. And most, most softwares do not do a good transcription of what I speak. But let's see what happens here. We have the audio here. And because this is live and you can see that it literally took a couple of seconds. And it says, this is a quick demo. I just wanted to make a video. I could have never slept without making this video. I'm super impressed by this new model, Whisper from OpenAI. And it is an ASR, automatic speech recognition. And you know, like most of the times when I speak in Indian English, things do not work out. And most softwares do not do a good transcription of what I speak. But let's see what happens. I could not believe it. I'm not kidding. I could not believe that I could find a non-paid or unpaid, which is not like an API-based solution, like a model running on Google Collab, that can do a very good transcription of my Indian English. I am super impressed. I'm blown away. And with this, let's start from the start, which is today, sometime a few hours back, OpenAI made a very new announcement. You can see 21st September, like I live in India, so it's 22nd September for me, but 21st September. Introducing Whisper. We have trained and are open sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition. What I was impressed with Whisper is, it's not just English, but it can also work with a different accent and different languages in itself. So you can see it's multilingual, multitask. And you can read more about this paper. You can see all the demo, all these things. But you know how impressive it is already. The good thing with this team is that they actually went ahead and then shared, including model weights, everything on GitHub. So you can see a blog, a paper, a model card, and a Google Collab demonstration. And that's something that OpenAI has not done in the recent times. So kudos to the team. And you can see like a very well written example that just simply works. How many projects these days get launched where you have got a couple of lines of Python code? You can just copy paste and it can run. That's like, again, another kudos for the team. Now, Whisper cannot just do transcription, but it can also detect what language it is if you do not specify what language it is. And now with that, kudos and credits to the team. And thanks for sharing the Collab notebook. I'm going to take you into the Google Collab notebook. How you can run Whisper on your Python environment, Google Collab notebook. Currently, I'm running it on GPU. I don't know the inference time on CPU. But on GPU, this is pretty, pretty good. And if you know the Whisper that I'm running is the base model. In the paper, you can see the differences between different models. I think even in the model card, you can see, I think. Yeah, you can see the difference. There is a tiny model. There's a base model. There's a small model, medium model, large model. So currently, as you know, like I'm running the base model, but you can experiment with different models. And this is quite early, like I'm going to make a lot of videos on Whisper, but this is quite early. So that's why I'm just sticking to a very simple example. The first step that you have to do is after you copy this notebook, this notebook will be the YouTube description. Please, please, or this GitHub, please make sure that you like and subscribe the video. 90% of our viewers do not subscribe. It means a lot. Coming back to the tutorial, install Whisper directly from GitHub right now. Once you install Whisper, the next thing that you can do is you can import Whisper, load the base model, and you can see it has downloaded 139 megabytes model. And then we have loaded a base model. Now the base model is a model object, model. The next thing that you want to do is I just wanted to validate what is the device that it is using. So it looks like it is auto-recognizing the environment that we are in. So you can see that it is using model.device cuda. So you can see. And now the next thing is I want to show you what are the audios that I'm going to transcribe or do speech recognition. The first one is one of my very favorite moments or dialogues in a movie. That is from Batman Begins, 2005, I think probably, where it says, it's not who you are underneath. It's what you do that defines me. I'm going to just play it. You might hear a faint voice, but it is the female lead saying that thing to Batman. So the next is an audio from an Indian Tamil movie, like my language, the language that I speak. And it's also one of the popular movie dialogues in Tamil. It's from a movie called Pokki. So let's, let's listen to it. Yeah, I mean, like this, this used to be, this used to be a huge trend when I was in university around that time. But at this point, you listen to a native English speaking audio and a Tamil audio, which is like totally different language where you don't have a lot of ESR solution. Now, let's start with maybe English. I mean, I want to blow your mind step by step. I don't want to blow you out, right? But maybe, maybe I did that at the start, but that's okay. So what do you have to do? So first you need to load the audio, whisper.load the audio. So that helps you load the audio. So next thing you need to keep in mind is Whisper works only for a clip maximum of 30 seconds. So you need to figure out some way like OpenAI itself has got some class that can help you. But whatever audio that you have got, you need to clip it into 30 minutes clip, sorry, 30 seconds clip. So that's, that's mandatory. If you go to OpenAI's Colab, you would see a helper class where you can pass on a bigger audio and it would chunk it into 30 seconds. Then you can use it like that. That helper script is available here. But like for now, I'm not going into that direction. I'm trying to fit my audio within 30 seconds. And this function pad or trim will help you trim that audio within 30 seconds. That's, that's what we are doing here. Once we do that, now we need to make a log meld spectrogram. And then we need to move it to the device model. Right now the device is GPU. If it is, had it been like CPU, then CPU. And now we are trying to detect the language. This is not mandatory all the time, but let's say we want to detect the language in which the content has been spoken. And now when you see this, you know that this is, I mean, sorry, you take the, the spectrogram and feed it into it and then it detects the language and the highest probability is what you're going to print. And finally, you're going to decode it. Right now I'm again going with very base options. So that's why I'm not specifying anything. So model, the spectrogram, and finally the options, if there is anything. You take the result, print the text of the result. That's it. It's a very simple API. I mean, I'm super impressed once again, like to say, you might hear me saying again and again that I'm super impressed. I'm going to run this code. You can see in real time, I'm not editing this video. So you can see literally how many seconds it takes. It, how many seconds it took. I mean, by the time I finish, it's not who you are underneath. It's actually, I think it says, it's what you do that defines you. But if you listen to the audio, you can actually, it's not a, it's not like a, it's not a TV clip. It's not an interview. It's, you know, it's, it's not made with professional. I mean, of course it's a movie it's made with professional recording equipment, but it's not like a podcast video. You can literally hear like a background noise and there is music, but despite that, it made this transcription possible. So now we know that it works super fine for English. So I did a couple of articles where it, or comments that it said the word error rate doesn't necessarily meet with the state of the art. Primarily because OpenAI is not just, not just trying to reach the highest accuracy possible, but they are also exploring different accent multi-language and all those things like multiple tasks. So having said that, but still, I consider it to be state of the art. So let's copy and paste the Tamil one. Now, again, like I said, it is Tamil. So if I play this, I can actually tell you, what is it? It says, This is what it says. So if you do not, then if you do not know the language, that's fine. But if I run this, you can see it as detected very soon that it is Tamil. And then it says, So it has not done the best job like a human being here would do. But what has happened is somebody could read it and they can actually fix it because a native speaker would not get misguided by this transcription. And this is good. The reason why it is gold is subtitling is a huge industry, a lot of human efforts. And this is gold because this level of accuracy is something that I have not seen in a lot of, in a lot of models. Maybe my knowledge is limited, but still, even for that matter, this is, this is really a great option. Now, if you ask me as a, as a programmer, like fine, this is good. Your code works like you've tried it with the different models. It works fine. What do you want to do next? I would simply say, you know, I want to build a web application. I want to build a web application where the user can record live and then it gets transcribed. And this is what I want to do. And I'm literally going to do that in front of you using radio with a very few lines of Python code and modification with this code. So you can drop off at this point, if you do not want the web application link. At this point, you have learned how to use Whisper in your Python code to do automatic speech recognition. But if you want to go one step ahead, let's build a web application. First, I'm installing Gradio. Of course, pip install Gradio. The next thing is I'm loading Gradio. In this case, import Gradio as GR. And the next thing is I'm literally copying this entire code and putting it inside a function. And if you are somebody who's been watching our videos for quite a while, you know, Gradio requires three things, input, output, a function. The function is going to be called when there is something happening. So transcribe is a function that takes an audio input and returns a text. So your input is an audio, your output is a text and transcribe is the function. And now I can actually return more things. Like, for example, I can return the language and I can return those things. But for now, I'm just sticking to the existing thing. So where I'm loading the audio, trimming it for 30 seconds, making a spectrogram, converting it to a model like the device, detecting the language, this is not going to have any impact and then decoding it, returning the text. That's it. And the entire application is going to look like this. This is Gradio interface. I'm using the older Gradio interface, not the blocks. There's a title. There is a function called transcribe, which we just created. And there is an input that says I want an audio input. And the input should be microphone. And the type should be file path. I mean, basically you want microphone input. The output should be text box like output you want to text. And then you want to say live means you don't have to necessarily submit it. You can launch it. So you can like it's going to do a live. So let's see, like I'm going to first try English. And then I'm going to try maybe Tamil and my broken Hindi. I'm going to try my broken Hindi. Okay. Let me detach it because there is nothing there. So I'm going to record from microphone at this point. And what I want to say is, let's see, this is my first attempt at, this is not necessarily my first attempt, right? Like I already tried. And, and, and let's see how it is going to work out. Let's see, stopping it. You can see the audio. This is my first attempt at this, not necessarily my first time, right? And you see a question mark. I mean, how, how many years have solutions work like this? Maybe I'm, I'm obsessed with this, but like I tried and let's see how it's going to work out. Let's see the punctuations. I mean, this is amazing. So what I'm going to do is I was actually watching a video, where like there are two movie actors, sorry, movie directors discussing, and I'm going to just see how it is going to capture. So I'm going to record from microphone.

Speaker 2: And there's a moment where he looks at him. It's almost like he's talking to somebody who's now settled down, accepted that this is life kind of a thing, right? The last sequence, you're right. He's in a good space. They're probably hiding.

Speaker 1: Okay, stop it. And from there is a moment where he looks at, he's almost like talking to everybody they're probably hiding from. Is there a good space? The movie director was also talking in Indian English, and this is pretty amazing. Now, what I'm going to do next is I'm going to do a demo in Tamil, my language. So I'm going to say, like, let's see. So intentionally I've mixed Tamil and English. Multiple languages. So let's see. So what I've realized is I think when you mix English with Tamil, I think it kind of gets mixed up. So what I'm going to do is I'm going to, I'm going to say something. Okay, in Tamil, only in Tamil, but let's see. So let's see. Like, did not go very well. But what I was trying to say is I'm a little coder, like a computer engineer. I think it works fine. I would say, like, let me try in Hindi now, maybe. Oh, this is, I think it assumed it is Arabic. Let me try again. My Hindi is, it didn't work out fine. Like, that's, that's probably because my Hindi is quite bad. Maybe somebody, a subscriber who knows Hindi well can try it out. But overall, like, I'm super impressed. This entire notebook will be, I'll share it in the GitHub. And you can, you can take what part of code you want, use it anywhere you want, deploy the application. But this is super impressive. Like, I might, I might even like actually diligently use it to create subtitle and other stuff for my video, because I believe this, the transcription is really good for my accent. And it also works fine for Tamil, like I said. And the other thing is you can transcribe from one language to another language. Like, one more thing that I saw OpenAI mentioning is that you can do any language to English. Like, you have that option. You can do that as well. And entirely, this entire thing is super impressive. Like, I'm, I'm definitely blown away by the way they've released this. Very simple, very detailed. Have a model, have a collab notebook, model weights are shared. And there is a demo that just simply works. And yeah, big, big kudos and thanks to OpenAI team. And if you have any question, let me know in the comment section. Otherwise, I hope you found this video helpful in learning how to use OpenAI Whisper and build a web application that can do live transcription for you in any language that you speak. Like, I don't have the list of languages, but many languages that you speak. See you in the next video.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript