Exploring Whisper: Multilingual Speech Recognition Model

Convert Your Audio To Text

4.9/5

3727 customer reviews

Discover the capabilities of Whisper by OpenAI: a speech recognition model handling multilingual transcription and translation, open-sourced for development.

OpenAIs Whisper model - Explanation and demo

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hey, what's up guys, this is Akshay from AES learning and today in this video we'll be seeing about Whisper. So let's first get somewhat brief about what Whisper is and then we'll be seeing its code base is GitHub repository it's a research paper and finally we'll be seeing the hosted whisper model okay so whisper is an automatic speech recognition system trained on six lakh eighty thousand hours of multilingual and multitask supervised data collected from the web so that so the data source is web, it is trained on 6,80,000 hours of audio data and it is multilingual, it will work for different languages and it is also multitask, okay. It adds robustness to assets, now what we mean by assets is British asset, English asset, So the same way of speaking English but with different accent so that is exactly what means by accent background noise ok technical sound analysis and modeling. So it has certain use cases which where it is good at and maybe it can be used further for different sound analysis and play with background noise and accents ok. It has enabled transcription for multiple languages, okay. You speak something and you will be getting a transcription of it, okay. And OpenAI has open sourced it for further building and development, okay. Now this is what the internal system architecture of Whisper is. Here we have, here we get the inputs. So here the input is being passed that that is the audio inputs and they are passed in the form of LogMill spectrogram ok and then we have an encoder decoder based architecture and and this is how the audio inputs is being mapped with the transcribed output ok. We have a Cov1D layer, then we have a GELU with the Cov1D and we have different kind of layers here. Okay, but majorly it is an encoder-decoder based architecture. Then it is implemented as an encoder-decoder architecture, it takes 30 second chunks of audio input. So, if you pass it any input, it will convert it into 30 chunks of audio input and it is passed to this encoder decoder architecture in the form of log-mail spectrogram. Decoder predicts the corresponding text caption that is the audio to text conversion text caption. Special tokens have been added which helps in performing further tasks like language detection so it will tell which language was being spoken in the audio. level time stamp so phrase by phrase it does the time stamp sorry the phrase by phrase it does the time level analysis also multilingual speech transcription you can pass it any language and it will does the transcription for you and it will also do to English speak translation okay at the last part of of this video will be also seeing the demo of whisper so maybe if you have read some article about whisper then you can straight away jump to the last part ok. I hope you know the difference between translation and transcription ok. So they are not same they are different. Then let's speak somewhat about what the dataset was. So one third of the whispers data set was non-English. So it shows that it is multilingual and it shows the importance, the amount of importance given to other languages. But if let's say one third was non-English, so means two third was English. So it still shows the significance given to English language. After all English is official language and English is still prioritized in many things and even here we can see that two third of the whispers dataset was English and the rest of the languages got just one third of it so definitely the accuracies will be very less on those languages comparative to English. Then it does a good job on transcribing the audio in its original language and translating it to English. Okay fine. I even I have tested it and it does a good job for English while testing it on some other languages like Marathi I felt it is not up to the mark the transcription was not up to the mark Hindi it worked good for me okay then it does a good job in learning speech to text task and outperforms the supervised state-of-the-art on Coffice 2 to English short short so basically what what has happened here is they have trained they have not fine-tuned whisper on this one data set okay they have they have trained it on a diverse set of data sets and because of which it is it is good in it is good in a overall way to perform on the when tested on other data set as compared to as compared to other models but for any specific task, maybe those models might be better than Whisper. But on an overall spectrum, if let's say you want to use just one model for your application, that is where Whisper beats all of them. So we can say Whisper is acting like a jack of all, but it is not yet master of anything. So even there's a term, there's a very popular metric called as library speed benchmark, and it did not beat it. But when tested on diverse datasets, as compared to other models, it was good. So we can say jack of all, but maybe not the master of one. So yeah, that is the point. Maybe this is what there was in the slide for us. Now what we'll go is this. So this is what the paper is, the open source paper by OpenAI. The title is Robust Speech Recognition via Large-Scale Weak Supervision. It's a big paper, 28 pages. This is the diagram. It is in much more depth. This is also the entire flow of how the training is happening. So maybe if you are much more interested into research-oriented and if you want to have any research kind of a thing so it's so you can explore it and maybe whatever pitfalls are there your maybe you can think of improving it okay this is where the whisper is hosted by hugging face okay we'll come to this at last I'll just give you a brief about the source code this is the github repo still it is just very new just open source like two three days back only still there are three pull request here okay so this is the source code you can download it and play with it and this is where the model card is okay you will have much more details about the model it is released into five versions tiny, base, small, medium and large and accordingly you can read it more for more details like what languages it supports and what's the architecture and into much depth okay I'll share all these links into the video description so yeah you can refer it from there lastly the moment we all waited for for Testing whispers posted API. So let's start with English Today is a good day Today is a good day. Let's try transcribing it

Speaker 2: It should work good

Speaker 1: Bang on. 100% accurate. Let's try a Hindi word.

Speaker 2: Today is a good day. Tumhara naam kya hai?

Speaker 1: bang on very good actually very very good lastly let's try Marathi one while try while trying previously it did a transcription not a translation and it was not very accurate every time so let's see if it works this time

Speaker 2: musta devas gilas Let's say if it works good.

Speaker 1: I hope so. It does. Must say. Bang. This is really nice. The last time when I tried Marathi, it did transcription, not translation, but this time the first time I'm seeing translation, although it is not 100% accurate, this might be a glitch in the audio maybe, but really nice, really great, great work. So this is what Whisper is. I feel a huge shout out to OpenAI guys and what a beautiful model they have open sourced and I think the other community can take it take it ahead from here maybe fine tune it more on on their own regional languages and take Whisper to the next stage. So that's it with this guys if you like this video if you find this helpful just give it a like, share it with your crowd and stay tuned to AES for more such amazing tech stuff. Take care. Peace out.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support