Exploring Whisper: OpenAI's Multitask AI Model
Dive into Whisper by OpenAI for speech recognition and translation, analyzing its capabilities through real-world examples and model variations.
File
OpenAI Whisper model ASR for many languages AND other languages to English translation model nlp
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hi there. Welcome to today's video. My name is Tribjot Kaur and in this video we are going to discuss another great AI model from OpenAI called Whisper. As you probably know ChatGPT from OpenAI is getting a lot of attention. ChatGPT is a text generation model where we give it a prompt in the the form of a text and it gives us the answer also in the form of text. Whisper model however is a speech recognition model and in this video I am going to go over the specifications of the model but I will spend most of the time evaluating the Whisper model using real-world examples. So without further ado let's get started. So here what you are looking at is a github page of the Whisper model and this is open source which is always great. So if you look at the model overview, the core of the Whisper model is the transformer model. So here you can see this is the encoder part of the transformer and this is the decoder part of the transformer. It takes the log-mental spectrogram, which are derived from the audio as input. That's really the Whisper model. The beauty of the Whisper model, however, is that it's a multi-task model, which means it's capable of doing more than one task, and the task that it is able to do is from a given audio, it's able to do English transcription, so it's able to transcribe the audio into text for English language. It's also able to do translation from any other language to an English language, and then it's also able to detect what is the language in the audio, and then it's able to transcribe in a non-English language, and it's also able to know when the audio does not contain any speech. So this is what it means by multi-task model and it is trained with a lot of data. It's about 680 K hours of data and they think that the data that they used is very diverse so it's better than the other models that we have today because most models are not trained on this much quantity of the data. So let's So let's look at bottom here, because the Whisper model needs to do a lot of tasks. So the way it does is it defines a lot of tokens, and then the decoder is supposed to make the predictions. So if we start with here, so let's say we're given an audio. So it will start transcribing that audio. So first of all, it needs to detect what language is in the audio. And then also, if there is no speech in the audio, then it knows there is no speech. So once we have the language tag, then there are two tasks possible. The user can ask the model to either transcribe the audio or translate the audio. So if the task is to transcribe, then it goes and transcribes that audio. if the task is to translate, then it will go ahead and translate the audio. This is the overall Whisper model. This page has a lot of examples. Before we go into the examples, one other thing that I want to point out is that there are five different versions of the Whisper model and they have different sizes. Of course, the performance will be different, and we will test all of those five models in the examples that we'll see. Then here, they also show the word error rate for different languages. If you look at some of the high resource languages such as English, Spanish, Italian, German, the word error rate is pretty low, which means the model is good for these languages. The model is able to transcribe and translate. Then if you look at some of the language here in the bottom like Nepali, Marathi so these languages the word error rate is high which means the model is not as good for these kind of languages and there could be several reasons such as maybe the data for these languages didn't have as much representation as the other languages when training so there will be many reasons but so it performs differently for different languages so the examples that I'm going to use today in the video are actually adapted from their Colab example, but I will be using real-world data that I collected for evaluating this model. Now before we actually go into the code, I highly suggest that if you have time, I encourage you to actually read the paper. So the paper that goes along with the model is called Robust Speech Recognition via Large-Scale Weak Supervision, and this paper goes into the details of how they collected the data, how they trained it, and why is it called Whisper, and a lot of other good stuff. So please read the paper, that will help you to have a better understanding of the model. Without further ado, we will start with the model evaluation. We'll go into the code. So let's get started. Okay, so the other two languages that I use besides English to test the translation capability are the two Indian languages Hindi and Punjabi, but you could test the model for other languages as well based on whatever languages the model supports. So the way this notebook get structured is so first we are going to install and import all the necessary package and installed whisper and load the model and then we are going to test all of the capabilities so first we are going to evaluate how good the model performs for English transcription and then we are going to test the capability of language detection and transcription. So again, the two languages that I use here is Punjabi and Hindi. And then finally, I also wanted to test the translation from Punjabi and Hindi to English languages. So let's start from the beginning. So once we have imported the packages and installed the Whisper, and then we will load the model. So here, we're going to test the different versions of the model. So we have model tiny, model medium, and the large model. And loading this model does take a little bit of a time, so just be patient. So once these are loaded, then we can start evaluating the model for different tasks. So one thing I want to point out is, for me, when I tested the model in the Google Colab, if the audio was in a WAV, W-A-V format, it didn't work for some reason. So I had to convert all of my audios to the MP3 format. So if you want to test it with the WAV format, you can go ahead and try, but for me, it didn't work. So I had to convert all of my audios to the MP3 format format and the the conversion is actually pretty easy using the pydub and these are the two lines that you need to know to to convert an audio from the WAV to the mp3 format. For me because I am I have more than one audio in my folder that's why I'm just using a loop but it's really just these two lines to convert from the WAV to the mp3 format. Okay so now let's say we have the audios that are ready to be transcribed. So here first I'm going to just deal with the English data. So I have the audios that contain the English speech in it and some of the audios don't contain any speech. So first I'm going to test all of those audios. So for transcription, how do we use the the Whisper model. So the Whisper has a built-in function called transcribe, and all it takes is it takes the audio, and then here, optionally, you can give it the task. So in this case, the language you know is English, and then the task here is you want it to transcribe. So we take this function, which is built-in, and then we give it the audio, and then we tell the model what to do. And then it will return the result in the text format, which is the transcription of the audio. So it's really that simple. Here, as you can see, I have a variable called whisper model. So this whisper model could be any model. It could be tiny, it could be medium, it could be large, whichever model you want to test. So really this is your code for transcription. So then I just iterate through all of my files, all of my audio files, and I transcribe each of those files using the whisper transcribe function. And then one other important thing that I forgot to mention is that, well, once we finish the transcription, there has to be a way to know whether the transcription is accurate or not. So for that, we have to actually label the audio file. So we have to provide the actual transcription in order to evaluate how good the model is. Obviously, you won't have to provide the labels when you are doing the inference, but in this case, we are just evaluating it, we are testing it. So in this case, we want to compare how good the results is. So in my case, I created the labels for the audio, for the audio which contained the text for what is inside that audio and then I compared the results from the whisper model to my actual text. First we are going to evaluate the tiny whisper model and we're going to evaluate it using the word error rate. So I as I mentioned earlier when we use the word error rate we have to first normalize the text, which means we have to remove all the punctuations, we have to lowercase all the text. So I use built-in function from Whisper called normalizers to normalize my text. So here what you see are the results. So my actual text, so my first three audios didn't contain any speech, so they had just some random noise like knock on the door or telephone bell ring things like that and whisper moral correctly predicted it which means it knew that there was no speech so I'll put there is no text my other three audios however did contain the speech and this is the actual text in this I mean this is the actual words in the speech for example the first one is hi how are you hello good morning as you can see the whisper prediction is Not that great, I mean it's able to predict some of the words but not all of it. So now we want to evaluate it using the word arabic. So first we normalize and then let's look at what the normalization does. So here is the actual text and this is the whisper prediction and after we normalize this text this is the actual text again after normalization and this is the whisper prediction after normalization. So as you can see, if you compare the whisper prediction clean to whisper prediction. So the normalization, it took off everything such as punctuations, question marks, and it converted everything into the lower case. So that's what normalization means. Then if we look at the word at a rate, it's 60%, which is pretty high. I mean, if we compare it to the whisper model for different languages, the worst case that they have is 47. And in my case, it's 60% just for six audios of English speech, so it's pretty high. So now let's look at how good the medium model is. So as we know, there are different versions of the model. So medium model is a bigger model than the tiny model. If we look at the performance of the medium model for the same audios that we had before, without going into the details, same pre-processing and everything, the WER rate with the medium model is 24%. It is an improvement when we compare it to the tiny model. So just to show that different models, different versions of the model perform differently for the same audio and then intuitively we can say that the the larger models perform better in this case. Alright so that was the example of the English transcription. Now let's move on to the second task which is if we give it an audio which contain a speech from another language other than English then we want it to we want it to detect what language is in it and then we want it also to transcribe that language in a different in a different language. I'm sorry. So what we wanted to do is we want it to detect what language is in the speech and then we also wanted to do transcription in that same language. Okay, so the languages that I use here is Punjabi and Hindi. So again, in my case, firstly because my audios were in the WAV format, I converted them into mp3. And now let's look at the example. So let's load the audios. And then another important thing is we have to format the audio into 30 seconds window. When we use the transcribe function we didn't do this because the transcribe function inherently does take care of this for us but when we are using the other features such as translate or detecting the language, we have to explicitly trim or pad the audios to make it to 30 seconds. And in this case, we can again use the built-in function from Whisper called pad or trim and it will do it for us, that preprocessing. So let's do that. Now we have all the audios that we want to test. So in my case, I have four audios and I'll explain what the content of those four audios are as we look at the example. So firstly I want to test, so I have an audio that contains the speech and if the speech is in Punjabi language and because I collected my own data I can tell you what is in the speech. So basically the speech in Punjabi says that, I hope you like the video, so this is the English translation of what is in in the Punjabi speech, in that audio. So when I gave it to the whisper model, it was able to detect that the language in that video, in that audio was Punjabi, but it was not able to understand what is in it. So as you can see, even though you may not know how to read Punjabi, but as you can see it's the same letter. So this letter is sa-sa which is so. So as you can see it's just saying sa-sa-sa-sa-sa-sa-sa. So the model performs really poorly when it comes to the Punjabi language I would say. And I just gave it like a simple sentence and it couldn't properly understand what's in the sentence. Now let's look at an example of Hindi language. So in Hindi language I gave it the same sentence but in Hindi. So basically the sentence in Hindi said Umeed hai ke aapko ye video pasand aayi. And it was able to predict that accurately. First of all it's able to detect that the language in the audio was Hindi and then it's able to properly transcribe that. So it's pretty good for Hindi at least for this sentence then I wanted to investigate more I wanted to for Punjabi language specifically so I collected two more audios and firstly in this audio that I'm gonna test here I I spoke the words very slowly so let's say basically I wanted to say how are you in Punjabi so I how I would say it in Punjabi is tusi kime ho? it's just three words so when I gave that to the whisper model here interestingly it was not able to detect the language properly so it says the language in the audio was Hindi and it did transcribe what I said in Punjabi in Hindi to see he may hope so this is actually Punjabi dialect or a language if I say to see key may hope but it's transcribed in Hindi I'm not totally surprised because Hindi at Punjabi when when they are spoken they do have some similarities so I can understand why the model is confused. It's the Punjabi data probably is not as much as the Hindi data during training so that could be a reason. But this was an interesting finding. Then lastly I also gave it another audio in Punjabi language. In this case I said the same thing as I said here to Sikhi Mihal which means how are you but I said it in a normal tone so I was I was I was not speaking like word slowly I was just saying in a normal tone tone as I would so here same thing it thought that the language was Hindi and then it also transcribed it into a Hindi language to see kime ho so the transcription is actually pretty accurate I would say to what I said but it's just in a Hindi language so what I conclude from here is is that the model does not perform as good for a Punjabi language. So that's what I need to work on next because Punjabi is my native language. So more videos on that later. But let's continue with the Whisper model. So the fourth thing that we want to test, which is another task that the Whisper model does is translation. So we want it to translate from English, sorry, from Punjabi and Hindi to English language. So I gave it the same audios that I did before in my non-English transcription and language detection example. So here I gave it an audio and then I specify what language is in the audio and then I specify the task as well. In this case, I want it to translate. So I specify the task. So then I can simply use the transcribe function as I used for the English transcription. The only difference here is the options field, it has a task to translate, not transcribe. So that's the only difference here. And if I look at the results, so the audios in Hindi and Punjabi both said the same same thing. It said umeer hai ke aapko ye video pasanda hai which in English is actually I hope you liked this video. So it's very good at translating from from Hindi and Punjabi to to English which is pretty strange because it's able to properly translate, but it wasn't able to transcribe in Punjabi language. Just a few things, few differences that I noticed. Then this is another interesting example. So here in my Punjabi audio, I said, tusi kime ho? which means, how are you? Interestingly, it transcribed, but it transcribed it did not translate it transcribed so it says to see came in home which is Punjabi word just written in English so I would say the results are I don't think I can conclude anything concrete from the results they are just interesting to say the least and there is much more work to do in languages such as Punjabi which is my research project so stay tuned and for now I think that's pretty much what I wanted to show. So if we go back to the Whisper model to summarize, it's a speech recognition model and it is able to do many tasks which include English transcription, any to English speech translation, and even transcription in non-English languages, and it's able to detect when the audio does not contain any speech. So that's a Whisper model in nutshell if you if you want to learn more I encourage you to read the paper which is pretty good and that's all for this video lastly if you enjoy the videos on this channel please consider subscribing your help means a lot to me and I will continue to make these videos and I hope you will continue to support me and enjoy these videos thank you and have a good day bye

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript