Exploring Whisper: Multilingual Speech Recognition
Discover Whisper, an AI speech recognition model by OpenAI, supporting multiple languages and accents, with insights into its architecture and performance.
File
whisper ai explane شرح برنامج الذكاء الاصطناعي Whisper ai chatgpt elonmusk openai
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hey, what's up guys, this is Akshay from AES learning and today in this video we'll be seeing about Whisper. So let's first get somewhat brief about what Whisper is and then we'll be seeing its code base is GitHub repository it's a research paper and finally we'll be seeing the hosted whisper model okay so whisper is an automatic speech recognition system trained on six lakh eighty thousand hours of multilingual and multitask supervised data collected from the web so that so the data source is web, it is trained on 6,80,000 hours of audio data and it is multilingual, it will work for different languages and it is also multitask, okay. It adds robustness to assets, now what we mean by assets is British asset, English asset, So the same way of speaking English but with different accent so that is exactly what it means by accent background noise ok technical sound analysis and modeling. So it has certain use cases which where it is good at and maybe it can be used further for different sound analysis and play with background noise and accents ok. It has enabled transcription for multiple languages, okay. You speak something and you will be getting a transcription of it, okay. And OpenAI has open sourced it for further building and development, okay. Now this is what the internal system architecture of Whisper is. Here we have, here we get the inputs. So here the input is being passed that that is the audio inputs and they are passed in the form of log mil spectrogram ok and then we have an encoder decoder based architecture and and this is how the audio inputs is being mapped with the transcribed output ok. We have a Cov1D layer, then we have a GELU with the Cov1D, and we have different kind of layers there. Okay, but majorly it is an encoder-decoder based architecture. Then it is implemented as an encoder-decoder architecture. It takes 30 second chunks of audio input. So if you pass it any input, it will convert it into 30 chunks of audio input and it is passed to this encoder decoder architecture in the form of log-mail spectrogram. Decoder predicts the corresponding text caption that is the audio to text conversion text caption. Special tokens have been added which helps in performing further tasks like language detection so it will tell which language was being spoken in the audio. level time stamp so phrase by phrase it does the time stamp sorry the phrase by phrase it does the time level analysis also multilingual speech transcription you can pass it any language and it will does the transcription for you and it will also do to English speak translation okay at the last part of of this video will be also seeing the demo of whisper so maybe if you have read some article about whisper then you can straight away jump to the last part ok. I hope you know the difference between translation and transcription ok. So they are not same they are different. Then let's speak somewhat about what the dataset was. So one third of the whispers data set was non-English. So it shows that it is multilingual and it shows the importance, the amount of importance given to other languages. But if let's say one third was non-English, so means two third was English. So it still shows the significance given to English language. After all English is official language and English is still prioritized in many things and even here we can see that two third of the whispers dataset was English and the rest of the languages got just one third of it so definitely the accuracies will be very less on those languages comparative to English. Then it does a good job on transcribing the audio in its original language and translating it to English. Okay fine. Even I have tested it and it does a good job for English while testing it on some other languages like Marathi I felt it is not up to the mark the transcription was not up to the mark Hindi it worked good for me okay then it does a good job in learning speech to text task and outperforms the supervised state-of-the-art on Coffice 2 to English short short so basically what what has happened here is is they have trained, they have not fine-tuned Whisper on this one dataset. Okay, they have trained it on a diverse set of datasets and because of which it is good in an overall way to perform when tested on other datasets as compared to other models. But for any specific task, Uh, maybe those models might be better than whisper, but on a overall spectrum, if let's say you want to use just one model for your, for your application, then that is where of a whisper beats all of them. So we can say whisper is acting like a jack of all, but it is not yet master of anything. So even there's a term, uh, this is a very popular metric called as livery speech benchmark and it did not beated it. Okay. on diverse datasets as compared to other models, it was good. So we can say jack of all, but maybe not the master of one. So yeah, that's the point. Maybe this is what there was in the slide for us. Now what we'll go is, so this is what the paper is, the open source paper by OpenAI. The title is Robust Speech Recognition via large-scale weak supervision. It's a big paper, 28 pages. This is a diagram. It is in much more depth, okay. This is also the entire flow of how the training is happening, okay. So maybe if you are much more interested into research-oriented, and if you want to have any research kind of a thing. So it's so you can explore it and maybe whatever pitfalls are there your maybe you can think of improving it Okay This is where the whisper is hosted by hugging face. Okay, we'll come to this at last I'll just give you a brief about the source code This is the github repo still it is just very new just open source like two three days back only Still there are three Full request here, okay So, this is the source code. You can download it and play with it. And this is where the model card is. Okay. You will have much more details about the model. It is released into five versions, tiny, base, small, medium and large. And accordingly, you can read it more for more details like what languages it supports what's the architecture and into much depth okay I'll share all these links into the video description so yeah you can refer it from there lastly the moment we all waited for for testing whispers hosted API so let's start with

Speaker 2: English today is a good day today is a good day let's try transcribing it it

Speaker 3: should work good

Speaker 2: Bang one hundred percent accurate. Let's try a Hindi word Today is a good day

Speaker 3: Tumara naam kya hai

Speaker 1: Bang on very good actually very very good lastly let's try a Marathi one while trying previously it did a transcription not a translation and it was not very accurate every time so

Speaker 3: So let's see if it works this time must the devas gilas. Let's say if it works good.

Speaker 1: I hope so it does must. This is really nice. The last time when I tried Marathi it did transcription not translation but this time the first time I'm seeing translation although it is not 100% accurate it this might be a glitch in the audio maybe but really nice really great great work so this is what whisper is I feel a huge shout out to OpenAI guys and what a beautiful model they have open sourced and I think the other community can take it ahead from here maybe fine-tune it more on on their own regional languages and take Whisper to the next stage. So that's it with this guys. If you like this video, if you find this helpful, just give it a like, share it with your crowd and stay tuned to AES for more such amazing tech stuff. Take care. Peace out.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript