Exploring Whisper: OpenAI's Free AI Model for Multilingual Transcriptions

Convert Your Audio To Text

4.9/5

3720 customer reviews

Discover Whisper, OpenAI's free AI model for transcribing audio and video in any language. Learn how to use it with Google Colab for accurate transcriptions.

Cómo transcribir audio y video gratis usando Whisper y Python Tutorial paso a paso

Added on 09/05/2024

Speakers

Add new speaker

Speaker 1: PLATZI EFFECTIVE PROFESSIONAL EDUCATION

Speaker 2: This time we are going to talk about Whisper, an artificial intelligence model of the OpenAI team capable of transcribing any audio or video in any language and the best thing about this model is that it is totally free. It was trained with more than 680,000 hours of different audios en distintos lenguajes y simplemente pasa por una arquitectura de transformers común y corriente de manera que este modelo se encuentra totalmente libre y es del equipo de OpenAI o sea que seguramente OpenAI lo va a utilizar para transcribir todos los audios y vídeos y crear así una versión mucho más robusta de GPT-3 si empiezan a crear GPT-4 y lo mejor el equipo de OpenAI also shows us something called Word Error Rate and it is simply how much is being wrong in the words with different languages, showing that Spanish is the language that has the best performance in this model. Now, without doing more, let's go to the code and see how it works. For this case we are simply going to use a Google Colab notebook. Previously I have loaded an mp3 file from a segment of those that are on youtube and I just have to go to the whisper openai repository this link you find it in the resources of this video and follow the instructions these instructions go from the whisper installation I am going to copy and paste I execute and also these dependencies google colab uses a system based on ubuntu and debian or debian so that once I finish I will also execute this code the installation begins and the only thing I have to do to execute this model I move or here below are these lines of code as I am going to copy them to take them to payton I am going to create something here like to remove this and execute the model let's see something like this ready I am going to paste the lines that I previously copied and Whisper has different models, the simpler it is, the worse performance it has, the more complex, the greater performance it has, but it takes much longer. These models can be found directly in the documentation, it has Tiny, Base, Small, Medium and Large. We will use the Medium model for this case. I simply replace it at this point and now I am going to give it the route of this file, we copy the route, we leave it here and we are not going to print the output yet, we will print it in another cell and we execute this process, it will take a little, it depends on your file and the model you choose, but we will simply have to run this cell What the notebook is doing right now is downloading the Medium model directly to be able to use it, this model weighs 1.42 GB. And once this procedure is finished, the result is saved in this variable that conveniently is called Result and we can execute this line here. Today we are going to talk about artificial intelligence and the management of the Big Data, we have here all the transcript and not only the transcript with the language that is, that I detected it automatically, at no time did I indicate that it was Spanish, but it also leaves me score signs, accents, commas, it leaves everything necessary to have a very good transcript of that audio. Now I can simply take out this variable, result, we print it, and we see that it has other data. For example, the ID that marks the position of this little dictionary, which indicates when it starts, when it ends in seconds, and also the text of what happens in that period of time. Also, if we go to the initial part of this dictionary, which is the result that the Whisper model gives me, we move we find that it has the text that is the flat text that we have just seen and also a variable called segments this segment I can enter that they are simply the id star in itex and I could even pass it to a python file for example so that we are going to import bands, as pd, we execute and here I could tell you that dpd data frame data frame about these variables that we have here in the segments that are the organized segments of this transcript, we do it this way, we execute and we would already have in a pandas file or in a data frame all these variables and likewise as it is a pandas file I could tell it I only want the id I only want the start and I only want let's say the end and the text are the ones we saw previously the rest of variables do not interest me we execute and we have an ordered list that from the second 0.0 to 1076 this was mentioned and so on in all variables in this way we can use whisper to make a transcript and do data manipulation directly from pandas in addition to this we could create subtitles directly with utilities that whisper brings these utilities we can find them in this same repository we can enter whisper transcript and while in transcript we can find utilities that have as they are these here that are the ones that we are going to Let's see, for now, in a way that I could import, let's say that FromWhisper.Utils we are going to import these from here, we execute, everything is in order and what could I execute in this? Let's go to the final part, which is where these functionalities or these WSPR utilities are used, which are these ones here, I copy and paste as is. Then at this point, here I could better adjust this text, perfect, and for now I can directly define the output directory. I'm going to say that the output directory is the root where I am or this file is directly in, we are going to copy it then in content and we tell it that the audio base name we could call it simply let's call it output we execute these two small variables and we are also going to import what we could do here so that it can access the system and at this point it could to run this line of code directly and when I refresh I find a output in srt text and vtt format srt and vtt are these outputs that are used for the subtitling of any video in this format and I simply had to use what whisper has for me to be able to get it out here the txt format is actually a plain text with all the transcript of what was talked about in that video and the vtt output that is this is another format for the subtitling and this is simply using what we have in the totally free repo, we already saw how easy it is to use whisper to make audio and video transcripts and the best is totally free, we hope that the OpenAI team continues to surprise us with totally free models of this type.