Unlocking OpenAI Whisper: Easy Audio Transcribe
Discover OpenAI's Whisper: an open-source tool for audio-to-text conversion in over 100 languages, using just 3 lines of code. Perfect for budding data scientists!
File
OpenAI Whisper Demo Convert Speech to Text in Python .Learn Whisper in 10 Mints in Hindi
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: If I say that you write a code that I am talking to you now, this whole text will start to appear in front of you, then you will say that it will be very difficult, it is possible because sometimes it happens, we have the option in YouTube in which we are able to generate its captions or it happens in other things too. But you will think that this is not my cup of tea, so now it is everyone's cup of tea. There will be only 3 lines of code and through this code, you can convert any audio file into a text file. That is, any person or any audio, no matter how bad the audio is, words will be taken out from it that what conversation is going on there and it will be shown to you. This miracle maker is none other than our OpenAI. Yes, the same OpenAI who introduced ChatGPT. And friends, a fun fact information, OpenAI is the only such application that reached 1 million users in just 5 days. So let's start and introduce OpenAI's Whisper. Whisper means to whisper. Okay, so it also hears the whisper and converts it into text. So it is still open source. Like our ChatGPT 3.5 is open source. This is also open source, which means there is no charge involved and it is very accurate in speech recognition. If you are speaking in English, then it will recognize your voice very well. In addition to this, it supports more than 100 languages, which means it can understand 100 plus languages. You can see the documentation of the official website here. But the most important thing to know about this is that it is working on automatic speech recognition and it is trained on an audio library of about 6,80,000 hours. It is trained on different different audios, where there was a lot of background noise, where the voice cannot be heard by a human, but a system, an application can be heard. and this is such an amazing system that they have made although it has some limitations that you can convert from your language to English but it is a little difficult to convert from English to another language but it is very good and its basic functionality, because you are an enthusiast you want to study machine learning and want to become a data scientist so this is very important information for you so how does it work? if you are asked this question so it works in the same system in which our transformers work. So let me explain a little bit about its architect. See, in a transformer, there is an encoder block, a decoder block, and the same architect. But the input embedding here goes in the form of a log-mell spectrogram because it is an audio, that is, it is going in the form of a wave. Whenever we talk, every word has its own wavelength. So it is taking input in that form here. but computer will not be able to understand what is wave so for that we are using convolutional 1D you may remember we used convolutional 2D when we talk about image we are using convolutional 1D because we are talking about audio and the rest are encoded and decoded the architect is like a transformer if you want to understand it deeply then do read this paper I will give you the link and till now its system is designed that you put 30 second videos it will work very well in 30 seconds but if the video is more than 30 seconds, will it work or not? Yes, it will work but the video will be divided into chunks that is, one subtitle will be generated for 30 seconds after 30 seconds, another subtitle will be generated for the next 30 seconds this is how it is going to work Please do read this as a data scientist, as a machine learning analyst, you should know this and you should have the habit of reading this because I have told you that 30-40% of your work will be of research only. So let's go and understand the code of this. And with this, my collab has also started. So this is their github repository, now what we have to understand in this, they have told us about vSphere, how the model is being trained, I have given you some brief information about this, and after that how to set up, as I promised you, there will be only 3 lines of code, so what are our 3 lines, first of all, we clone the entire github repository, I am not going to work hard because they have already worked hard and I have already cloned it. I have already started to clone it. There is not much limitation in this. If you go down in the documentation, there are different types of models. For example, Tiny has 39 million parameters. The more parameters, the more RAM requirement you have. I am not using RAM in this collage because I don't need it right now. Sorry, I am not using GPU. I am using RAM only. And see, RAM is 1.79 because it is running in the background. So, it is being used for 1 GB. Rest of the RAM is being used for other things. After that, what we have to do? After that, we have to go because we are a Python lover. So, we will not work in the interface. Otherwise, you can do like this in the command line for different languages. I will explain this code of 3 lines better than this. Okay, I copied this and pasted it here and did nothing else. So, first of all, we have to import Whisper, which is the name of its library. We have to load the model in Whisper. And which model is going to be the base? As I showed you earlier, it can be tiny, it can be large. Now they have a large V2V model. This is much larger. So, it depends on the capacity of your system. So, you can select any of these. I made an instance of the model. After that, I transcribed it. Transcribe means whatever I give you now, whatever is inside it, you convert it into text. Now I will prove verbose equals to here. What will happen with this? What I am saying with every timestamp, that thing will be generated. And after that I will print the result. That is, whatever text will be in the result, I will print it. Now, because I am using Collab here and I already have an audio file in Collab, I will run this audio file first and show you. So, this is the audio file. Actually, this is a shot that I was using in my YouTube video. This is the shot. I hope you can see it. And I also hope you can hear it. Because I am re-recording it. What are the assumptions of Linear Regression? 1. There should be a linear relationship between the independent and dependent variables. 2. There should be no correlation between any two independent variables. 3. Whenever we plot a residual, it should be a bell-shaped curve with a mean of 0. So, I have finished recording for about 30 seconds, you can see here, it has been 30 seconds. So, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, this much only, in Hindi. Okay, now I run it once, okay, I run it and I have deleted it, whatever was the old one. Okay, the fun thing in this is that I liked a very good thing, they have mentioned it anyway, but I have given the .mp4 format here, okay, mp4 is the format of the video, it is also bringing the format of the video, now I had kept verbose as true, so it means that he is dividing it into chunks, it is going up to 30 seconds it is going up to 30 seconds now see the interviewer's one of the favourite question what are the assumptions of linear regression? number 1 you can see he converted exactly what I was saying although he didn't write it in Hindi but we talk like this when we talk to our friends on whatsapp we talk like this he has written a very wonderful open source code, he has made a library on it. So now I would like to tell you one more thing, right now it is generating only 30 seconds, but if you want your video to be 10, 15, 20 minutes, 1 hour, whatever your video is, you can directly make audio from the video. Or you can directly fetch the YouTube video and make audio. I am going to make a separate video for that too, in which I will directly pick up the content of YouTube, put it in my laptop, it will directly fetch the content of YouTube, and as I speak at that time, in the same way, it will keep generating my subtitle. It's going to be an amazing video, but I'll make it after some time. And apart from this, how is this going to happen? There is a very easy way to do this. We don't take the first code. We have this code in which what we are doing is that we are trimming it. If our video is one minute long, then we will divide it into 30 seconds. And in different chunks of every 30 seconds, our subtitles will start to be generated. So, isn't it amazing and how much progress is being made in large language models and OpenAI is doing so well with us that it is showing us all these things for free and teaching us. They have kept their entire code in open source. How they have made all the code, they are giving us the code in open source. So, the opportunity to learn is limitless. You are going to get everything for free. I am giving you all my information for free because I am working on Whisper and Japanese language. I am working with a Japanese audio company and I have to generate their subtitles. I have encountered them and I have seen their capability. They have a great capability. In the same way, when you become a data scientist, you will get to explore new things. You will get a lot of great things. I hope you enjoyed my video and I would like you to share this video with as many people as possible because I don't want the knowledge to be limited and sometimes I make mistakes in some things because I am not habituated to make YouTube videos so please forgive me for that and if there is anything else, please let me know in the comments that you have made a mistake here. Don't forget to subscribe and share and thank you very much for giving your precious time to Data Science.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript