Exploring Whisper: Free AI Tool for Transcriptions

Convert Your Audio To Text

4.9/5

3723 customer reviews

Discover Whisper by OpenAI, a free AI model for transcribing audio and video in any language, and learn how to use it effectively.

Cómo transcribir audio y video gratis usando Whisper y Python Tutorial paso a paso

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: PLATZI EFFECTIVE PROFESSIONAL EDUCATION

Speaker 2: This time we are going to talk about Whisper, an artificial intelligence model of the OpenAI team capable of transcribing any audio or video in any language and the best thing about this model is that it is totally free. It was trained with more than 680,000 hours of different audio in different languages and simply goes through a common and regular transformers architecture so that this model is totally free and it is from the OpenAI team, that is, surely OpenAI will use it to transcribe all the audios and videos and thus create a much more robust version of GPT-3 if they start to create GPT-4 and the best thing, the OpenAI team also shows us something called World Error . And it is simply how much is being wrong in the words with different languages showing that Spanish is the language that has the best performance in this model. Now, without doing more, let's go to the code and see how it works. For this case we are simply going to use a Google Colab notebook. Previously I have loaded an mp3 file from a segment of those found on YouTube and I simply have to go to the OpenAI repository . After that, I'm going to upload this Wikipedia video. Next one. This basically is our Google Bullet This is also the link that you can see in the resources from this video and follow the instructions. These instructions go from the installation from whisper. I'm going to copy and paste, execute.. and also these dependencies. trying to find my way around. Google Colab uses a basic system Ubuntu and Debian, so once it finished, it fits into A maximum. I'm going to execute this code too. The installation begins. And the only thing I have to do to execute this model, I move down here, are these lines of code. I'm going to copy them to take them to Python. I'm going to create something like this here. I'm going to remove this. Execute the model. Let's call it something like this. Done. I'm going to paste the lines that I previously copied. And Whisper has different models. The simpler, the worse performance it has. The more complex, the greater performance it has, but it takes much longer. We can find these models directly in the documentation. It has Tiny, Base, Small, Medium and Large. We will use the Medium model for this case. I just replace it at this point. And now I'm going to give it the path of this file. We copy the path. We leave it here. And we are not going to print the output yet. We will print this one in another cell. We will print this one in another cell. And we execute. This process will take a while. It depends on your file and the model you choose. But we will simply have to run this cell. What the notebook is doing at this moment is to download the Medium model directly to be able to use it. This model weighs 1.42 GB. And once this procedure is finished, the result is saved in this variable which conveniently is called Result. And we can execute this line here. Today we are going to talk about artificial intelligence and the management of Big Data. We have here all the transcript. And not only the transcript with the language it is, which I detected automatically. At no time did I indicate that it was Spanish. It also leaves me score signs. It leaves me accents. It leaves commas. It leaves everything necessary to have a very good transcript of that audio. Now I can simply take out this variable Result. We print it. And we see that it has other data. For example, the ID that marks the position of this little dictionary, which indicates when it starts, when it ends in seconds. And also the text of what happens in that period of time. Also, if we go to the start part, the initial part of all this dictionary, which is the result that the Whisper model gives me. We move. We find that it has the text, which is the flat text that we just saw. And also a variable called Segments. These Segments I can enter which are simply the ID, start, end and text. And I could even pass it to a Python file, for example. So we are going to import Bandas as PD. We execute. And here I could tell it that dpd dataFrame dataFrame about these variables that we have here in the Segments, which are the organized segments of this transcript. We do it this way. We execute. And we would have in a Pandas file or in a dataFrame all these variables. And likewise, as it is a Pandas file, I could tell it I just want the ID, I just want the start and I just want, let's say, the end and the text. They are the ones we saw previously. I don't care about the rest of the variables. We execute. And we have an organized list that from the second 0.0 to 1076, this was mentioned and so on in all the variables. In this way, we can use Whisper to make a transcript and do data manipulation directly from Pandas. In addition to this, we could create subtitles directly with utilities that Whisper brings. These utilities can be found in this same repository. We can enter Whisper. Transcript. And in Transcript, we can find utilities that have as they are these ones here, which are the ones that we are going to see for now. So I could import, let's say that from Whisper.Utils we are going to import these ones here. We execute, everything is in order and what could I execute in this? Let's go to the final part, which is where these functionalities or these utilities of Whisper are used, which are these ones here. Copy and paste as is. Then at this point, here I could better adjust this text. Perfect. And for now, I can define directly the output directory. This one here. I'm going to say that the output directory is the root where I am or this file is directly in, let's copy it, then in content and we say that the audio base name we could call it, let's just call it output. We execute these two small variables and we are also going to import, we could do it here, os, so that I can access the system. And at this point, I could execute directly this line of code. And when refreshing, I find a output in srt, text and vtt format. srt and vtt are these outputs that are used for the subtitling of any video. In this format and in this format we can see that the output of the video is the same as the output of the video of the video. So, I have to use this format and I simply had to use what whisper provides me to be able to get it out. I leave it here, the txt format is actually a plain text with all the transcript of what was spoken in that video and the vtt output which is this, is another format for the subtitling and this is simply using what we have in the repo totally free. We already saw how easy it is to use whisper to make transcripts of audio and video and the best is totally free. Let's hope that the OpenAI team continues to surprise us with models of this type

Speaker 1: totally free. It's just that I don't have time, it's just that the courses are hard, it's just that learning LINE is very difficult. That's it, stop pretext and start learning for free with this course in Platzi. Or are you going to tell me that clicking here is also super difficult?