Exploring Whisper: OpenAI's Open Source AI for Accurate Video Transcriptions

Convert Your Audio To Text

4.9/5

3726 customer reviews

Discover how Whisper, an open-source AI by OpenAI, transcribes audio to text with high accuracy. Learn installation steps and see a live demo in this video.

Usando Whisper, la IA gratuita y libre de OpenAI para transcribir audio

Added on 09/05/2024

Speakers

Add new speaker

Speaker 1: The subtitles in this video are generated with an artificial intelligence, specifically with Whisper. Whisper is a program that you can pass an audio file to, like an MP3 for example, and it will spit out a text transcription with the words that are said in the video. And the most important thing, and the reason why I'm interested in these things for once, is that it's completely open source. It's made by OpenAI, but for once, instead of creating a page where you have to enter your phone number to use the service, they've published a GitHub repo that you can download with a PIP and install on your computer. And now, in a moment, I'm going to do a demo with audio taken from my own YouTube channel so you can see how easy the process has been. I haven't needed practically anything. I've used my gaming PC, which has an NVIDIA 1660 from two years ago, and which uses a Linux distribution, nothing more. Being an open source product, the question is, does this really work? It works so well that it's even scary. There you have the subtitles, if you're not seeing them already. In fact, to be transparent, in the description I'm going to link you to a text file that contains what Whisper has spit out for me, so you can compare it. I'm going to leave the errors that I've had to correct, and you'll see that they're minimal. Whisper is a program that has been trained with 680,000 hours of data to be able to learn to transcribe and reduce so well. If you take a calculator, that's 77 years of continuous learning. That's more than what anyone will be listening to and learning throughout their entire life. It's normal for it to work so well. If you want to run Whisper on your own, what are you going to need? I'm telling you, it's not too much. First, a computer. Obviously. I've used Linux and Mac OS. I guess it will also be possible in Windows, although I haven't tried it yet. Python. I don't have much more to say. This is easy to get out. FFmpeg. This is a bit peculiar, okay? It's a program that works with multimedia files. Maybe there are people who already know it. If you're on Linux or Mac OS, it's very easy to get with a package manager. Maybe RAST. You shouldn't need it, okay? But one of the dependencies asks for it, so even though it tries to download a binary distribution, if it doesn't find it, it's going to ask you to install RAST to be able to compile it locally. And most likely, a graphics card will also be needed. It doesn't have to be very modern, okay? I'm using my 1660, which is not the latest in technology, so if you have a more modern one, surely this will work even better. Small parenthesis. Dani, what I see there is a Mac Mini? Yes, it's my new Mac Mini equipped with a M.2 Pro processor. It's a professional-grade computer and it has a lot of GPUs and coprocessors to be able to work so fast. Programming and editing video on this computer is a delight. And yet, Whisper is still not compatible with its graphics card. Literally, to transcribe an hour of video, it took 6 or 7 hours. I don't know, I didn't count them either. Humiliating. So if you have an M.1 or M.2, be patient, because it probably won't work as well. And now, I'll turn on the computer and I'll tell you how I got it, okay? Very well, we're already here on my computer. And in case someone is wondering, yes, it's i3. Thank you for asking, I'm turning 30 years old and each one is going through a crisis as they please. Here is the OpenAI Whisper repository, okay? In its GitHub. You can download it, you can install it, you can see instructions, its readme is quite detailed. And what I'm going to do in this video is also install it and show you in real time how you can get a transcription of a video, okay? I'm going to come to my page, okay? Sorry for the clear subject, but this is my control panel. This is where I normally manage my channel and all that, apart from YouTube. And here what I have is a list with videos without transcription, which is really valuable because it feeds the search engine and makes people put more points in the search field, because it gets much richer results, okay? And much clearer. I mean, if inside this video the word bullion is said, I would like the card to come out in the search engine, because if not, no one will find the video. So I'm going to choose one of these, for example, I'm going to choose this one, or better this one, for a very simple reason. And it is that here are words in English. The reason why I'm going to do it is because in the YouTube system you will see that YouTube really gets you automatic transcriptions too. However, I want to compare them, because the transcription of YouTube is generally not very good, okay? And it usually has false positives. For example, here surely I wanted to say build, in the sense of construction or compilation, and this understood build as such, which does not make sense. Or here where it says start this heap, install and run, okay? I do not know what I said, but it looks like it's not the right thing. So let's see what I really meant. And I'm going to ask Whisper. So what I'm going to do is download the video, which is something I can do from the YouTube control panel. And while this is happening, I'm going to install Whisper essentially. What do we need to install Whisper? That's what I told you here. We will need ffmpeg, possibly RAS, we will throw without RAS to see if it holds. And then one thing that the readme does not say, but I think it is important, is to install CUDA, because I'm using my NVIDIA, so I think CUDA is not installed. I'm going to do just in case, sudo pacman-s cuda, to make sure it is installed. Indeed, it was not installed. I'm going to let it download and install, and in the meantime, I'm going to install Whisper. This is done with pip install. And look, I've been too long without working on my day-to-day with Python, so some things have been forgotten. I remember that the last time I was here there was a fight between Poetry and pipenv, it was like you had to choose which was the best package manager. I don't have time to decide what is the most correct and which of the two is cool, so what I'm going to do is create a virtualenv, so that later it is easy to uninstall. And in the meantime, this CUDA has also finished its installation. I'm going to create a directory which I'm going to call transcription, and I'm going to bring from my downloads folder the video about Gradle that I have downloaded. And now what I'm going to do is install Whisper. I'm just going to copy the installation command and let's wait for it to come out as best as possible. Okay, while this is going on, I think I'm going to take advantage of the time and tell you a little bit about how this works. Whisper puts a program to which you are going to pass a route to the file you want to transcribe. In this case, for example, they put audio.flac and audio.mp3 and you tell them the model you want to use. There are several models and I have not finished understanding what the difference is between each one, apart from the size, the memory it requires and the relative speed. You can also ask them to translate or understand in a specific language. Whisper knows how to speak more languages and here you have the graph with the complexity of each language. It's funny because some languages are better than others. In this graph, the smallest value is better and just curiously Spanish is the easiest language to understand for intelligence because it has a complexity of three. There we have several like English, Portuguese, Indonesian, and finally one of the most complicated which are Belarusian and Nepali. One of the things that fascinates me is the fact that you can not only ask Whisper to transcribe, but also to translate. If you put the option translate, it will translate after understanding and will pass it to you in English. In short, just to close this tell you that if you know Python you can also import directly the Package Whisper and do crazy things in source code. What I'm going to do is show you in real time how that Whisper works. I'm going to call Whisper and I'm going to put the route to my file that I want to transcribe, in this case Gradle Tutorial. I'm going to put that I want to recognize the Spanish language and I'm going to ask the model to use the Medium version. Possibly with Small I should compare to see what results I get because I do not have it very clear yet, but of course in Large What you are going to notice now is that when you finish downloading the model, which is the first thing you have to do, surely the bitrate of the video and the quality goes down a lot and the mouse starts lagging because I'm going to use the GPU and at the same time I'm using the GPU to record this. In fact, for you to see it well, one thing I'm going to do is run a program called Mbitop, and I'm going to show you the graphics right now and how it works with resources. Here you are seeing that the OBS is open because that's what I'm recording. We also see Firefox, because I have it open, so it's convenient for me to close it. We see Alacrity, because after all these terminals are Alacrity, so they use a graphic acceleration. I'm going to close Firefox because at this time I don't need it either. And so I will release a little resources. As you can see, they are starting to make noise and my GPU started to make weird noises, so that means it's running. And I'm seeing in the OBS that I'm already losing frames, so that means it's already working. Let's wait a little and soon letters will start coming out with what is said in the video. There are the letters appearing. Well, the transcription is over, my computer has returned to normal and the OBS is recording as it should be. So let's close Mbitop Well, here I have my video file from which I got the multiple versions of the transcription. We have the TXT version, which is the one I'm going to use, mainly because it doesn't have time frames, but if you want to go a little faster you have versions in SRT and VTT. I personally am not going to use them for a very simple reason, and that is that the SRT files that I generate and the VTT files, I think the sentences are too long for the subtitles, so I'm going to use the TXT version and let YouTube automatically put the divisions of the time frames so that the subtitles are easier to read. I have the feeling that YouTube will do it better than OpenAI in that sense. But I'm going to compare, essentially, the transcription that OpenAI has generated with the one that YouTube generates to see how the false positives go. Okay, and here are two things that I really like about OpenAI compared to what YouTube does. To begin with, OpenAI is not only able to read words correctly and knows how to put the commas and the dots, which I consider to be quite important. The automatic transcription that YouTube generates is fine because it knows how to put the words automatically, but it is also true that it does not have a dot or a comma, and that to read it is quite complicated, and especially for me to be able to clean it is very difficult because I have to listen to the video to know where the commas go. Instead, with Whisper you can do it automatically and know where the commas are supposed to go. Then we have the most obvious cases of false positives that are caught by one and the other. For example, we had to run GradleBuild, we had to look for the hard and we had to run it. This is the transcription of YouTube that has generated me GradleBuild has understood of the Build, and then instead of hard I have understood it as hard, and really there it is, because then the rest of the phrase is fine. For example, here at the end I have understood now world when I wanted to say hello world. It is not perfect, obviously. For example, Gradle has joined We have seen the install app and GradleTasks, which is not bad because I really understood the words, but the way to write it is not the most correct, but I'm surprised he understood what we wanted to say. And then of course some incorrect cases, such as instead of understanding IDE as IDE, which is Integrated Development Environment, I have understood as ideas. Although good, YouTube has understood ideas, that is, almost better ideas than ideas. But in short, you are seeing that this tool has a lot of potential, and literally with any graphics card it will be able to run. I have enough desire to work with this, and obviously I'll have to clean up a little bit the output, but I'm pretty convinced that it will help me save a lot of time when generating transcriptions. So that would be all for me. Whisper, if you like these things, there you have the tools, you can download them freely on your computers, and see what happens. It opens a world of quite fascinating possibilities. See you in future videos, a greeting, and see you later.