OpenAI's Whisper: Revolutionizing Translation and Transcription
OpenAI open sources Whisper, a powerful multilingual AI for translations and transcriptions that's now free for use under an MIT license. Explore its capabilities!
Open AI Whisper - Open Source Translation and Transcription
Added on 01/29/2025
add Add new speaker

Speaker 1: OpenAI has recently decided to stick to their name and open source their translation and transcription AI Whisper. So now it is under an MIT license and that includes both the code that's here as well as the model weights that were used to train the AI. So if you wanted to go and try and make your own speech transcription AI with that data, you are free to do so. So this is an automatic speech recognition system that's been trained on six hundred and eighty thousand hours of multilingual data collected from the web. It's able to transcribe and translate several different languages. In fact this picture here pretty much just explains it. So you know it's able to take something like English audio and then transcribe that to English text. It's able to take Spanish audio and then transcribe that to English text or it can do that for any number of different languages and it's able to take the native language like I think this is Korean here and it's able to transcribe that into Korean text. So it's really useful for translating. This is a list of all the dozens of different languages they have that you can translate into English or you can transcribe it into those native languages. But anyway reading documentation is boring. You guys can do that on your own time. Let's get to installing it. So installing is very simple. If you have PIP you can simply well install with PIP that this command here will automatically pull down the latest data from the open AI Whisper GitHub. And because it uses Python scripting with it is also really straightforward. You don't even really need to know how to code. So you can just import the module once you've installed it with PIP and then here we're defining what kind of model we wanted to use. So by default it's going to be using the base model. I'm pretty sure that this installs whenever you install this AI with PIP but you can download the larger ones if you have space. Of course this large one it looks like it takes up a gig and a half of space. So these larger models they're more accurate but they use more system resources of course storage space but more importantly VRAM. And yes in case you haven't figured out already this does require a graphics card to use. Maybe it's possible to run it on a CPU but trust me it's not. This is not the type of thing you want to run on a CPU. Not really any of these AI things. They really really do much better when they have a lot of parallelization and CPUs just can't compete with GPUs when it comes to that. So you'll want a graphics card and I think ideally you're going to want an NVIDIA graphics card since this uses CUDA. But one small thing that I'll just say since you know obviously using the larger models does take a whole lot longer like as you can see from the relative speed here. But I have been using this on a GTX 1080 Ti which pretty much just barely has the VRAM requirement. I think that card has 11 gigs of VRAM and it hasn't been terrible. I mean it is noticeably longer but it's not even as long as like stable diffusion is. So if you guys are running more modern graphics cards than a 1080 Ti especially one of the higher end ones you should be totally fine with running this and not having to wait long periods of time even if you're running the large model. But anyway back to the script. So you know you define your model there and then if you change it it should just automatically download the bigger model. Audio. So here we're specifying the file that we're going to run it on and that is Fly With Me by Frank Sinatra. So it should be pretty easy for this AI and this is the command to transcribe that and translate. We're not going to be using since Frank Sinatra is in English and then here it's going to print out just to standard output the lyrics essentially since this is a song. So why don't we go ahead and run. And of course this is going to have background noise to instruments and whatnot since it's a song. All right. And I'm just gonna let it run in real time because with bass it should go pretty quick. All right. There we go. So this is I mean you guys can look up the lyrics if you don't know them by heart to this song. But this is pretty much correct. I mean I'm just kind of skimming through it and. OK I think like right here is the first mistake. Right. So it's saying down to a pogo bay and that's supposed to be like Acapulco Bay. But anyway this is like I'll say maybe 98 percent correct which is pretty good. I mean it's kind of to be expected because this song if again if you don't know it you should definitely go listen to it because it's a classic. But Frank Sinatra is a really really clear singer you know like the polar opposite of mumble rap. And they actually use his songs in a lot of English as a second language class because the lyrics are so basic and so easy to understand. So I guess it's not a really huge surprise that you know this open A.I. was able to translate or transcribe our boy Frank Sinatra. But let's give it something more difficult. So this is an interview with Popcorn Sutton. There's a little bit of background noise in this but Popcorn Sutton is from Appalachia which it's kind of like the East Coast mountain region of the United States. But it's an area where a lot of people there don't really leave. Like I think he's one of them where you know he hasn't really left that region and he has a very unique accent as a result. A lot of people there do. And in fact their accent is so unique and so difficult for a lot of people to understand that you that you have not just Americans but people that live in states like Virginia that Appalachia goes through just a few hundred miles away from these people and even they can't understand them. So yeah we're going to see how I don't know if I can play audio of this. I think it might be copy written but we're going to see how good it does and maybe show the closed captions because I believe this is yeah. So it's not even auto generated like whoever made this like a sucker punch pitchers wrote the closed caption. So we'll compare it to that. And let's just edit this real quick. And I believe actually that it is yeah popcorn. So we'll just do that. We'll run it again. And again we're using the base model so should run relatively quick. All right. And let's see maybe I'll just shrink this down a little bit and we'll kind of put it like this so that we can compare. The transcriptions that they put in to this one so there's the regular guy talking and then there's popcorn. So he said been here all my life. Got that part right. See so he's talking about Ghost Mountain on the backside of on a place called Hemphill. So it kind of got the kind of got his town name wrong but I'd say that's close enough. All right. Then here he's saying that it's better known as the asshole of the world. Here it said ice hole of the world. You could probably figure that out right. If a human being was reading this I think they would know he didn't say ice. All this is right so far. Yeah it's it's right so far. And in case you're wondering I'm not reading these I'm listening to him talk because I actually do have the ability to understand mountain folk. Yeah so that part the way he says tobacco it really threw him off. You can even see that they took a little bit of so he's saying to backer. They're actually putting it out phonetically here. The A.I. has no idea what tobacco is so it's a two backer. Of course he's saying tobacco. Corn and taters it got that part right. Yeah stuff like that and sold it. It's doing pretty good. So this I actually did pretty good at transcribing popcorn but we're going to test it a little bit more. So we're going to see how good it does at transcribing him when he's driving his old very noisy rickety car. So there's gonna be a lot of background in this one. And let me actually change this first. So I think it's actually I forget what I called this audio popcorn and car. Good thing I checked. All right. So we'll change that right quit and we'll go ahead and run this and actually let you guys show my let you guys see my GPU usage while I run it just in case you're curious. It doesn't use a whole lot because you know it just spikes quickly. All right. And then I can tell it's done there. So let's let's see how well it translates him. OK. So first part it got right. All right. So that one it didn't get right. What was it worked up here for a woman. So she had a summer home back there in Paris Texas and somewhere home back out in the fire stacks. All right. So it has no idea. It did not get that at all. And it's it's messing up his town even more. It said him instead of him pill. Yes. Let's see. Got it. All old horse started riding it. It's it's messing up a lot here. Okay. So here it's getting. I don't know maybe like 80 percent or like 70 75 percent. It's to the point where I think if you were reading this transcript it would actually be kind of difficult for you to know what this guy's talking about. or like 70, 75%. It's to the point where I think if you were reading this transcript, it would actually be kind of difficult for you to know what this guy's talking about. So, why don't we stress out my poor little graphics card? This might actually crash because I think OBS is using the graphics card as well. Oh well, run her again and we'll see if she does a better job. And I'll just clear all this crap out my screen and I'll full screen this so you guys can see my poor little graphics card struggle under the weight of this Appalachian man's accent. There we go, there's the 100% usage. And OBS is lagging tremendously. My voice might sound robotic during this point. All right, so the spikes have finally fallen. It must be done. I guess we'll see in editing how bad that part is if I have to just like speed the whole thing up or whatever. So anyway, we'll just start from the beginning. See how much better it did. Looks like it got the Paris, Texas back or the Paris, Texas part right now. All right. It's all right so far. Yeah, it's right so far. Here, I'll do this. Sucker punch, better not give me any gruff for this. Just so you guys can hear what this guy sounds like.

Speaker 2: That happened sometime in 1980, I can't remember when.

Speaker 1: Were you injured?

Speaker 2: Yeah, it tore me all to hell. That four-wheeler hit the ground so hard it flattened all four tires on it when it landed. That's partly what's wrong with my back right now. Hurts all the time. That and old age, that don't help it a damn bit either.

Speaker 1: Yeah, it's getting it all right so far. And you know, that's despite, I mostly did that so you guys could hear how loud his car is. Yeah, and he talks about his slingshots and his cobalt blue bottles. It's getting it all right.

Speaker 2: He was a fortune, one of them cobalt blue bottles. Now, like we shot, the slingshot brings $10 or $12. Coca-Cola bottle brings $10 apiece.

Speaker 1: And I mean, look, maybe some of you guys in the comments have no idea what he's saying either. I'm able to understand him probably because of all the time that I spent living in Southern Virginia. And you know, I've talked to tons of guys like this. Not like necessarily moonshiners, but just guys that speak like he does with the same kind of accent. But I know a lot of people, especially some of my family that have been living in Boston their entire lives, they can't understand this man worth a damn, okay? He might as well be speaking a whole other language because they could not sit there and tell you hardly any third word of what he's saying. And yet this AI is doing it perfectly. Okay, so now another one that I wanted to try out was Ghost, Square Hammer. So this is another music example where it's fairly easy to understand the lyrics of this, but there's a whole lot more instruments. And I think that the instrumental track is about the same volume as the audio. I mean, it's a little bit quieter of course, but it's gonna be a lot harder to pick out audio or to pick out speech from this. And I actually already know that because I tested it on the, with the basic one and this was the results. So in case you guys know the lyrics to Square Hammer, I mean, it's a little bit more of a modern song and also a really good song. This is almost all wrong. So yeah, we're going to test it on Square Hammer with the large model. And let me see, where was it? Here it is. And we'll do it here. Okay, and we'll run this and probably just gonna speed this up in the video editor. All right, so let's see how well this did with Square Hammer. And let me definitely make sure this is muted. All right, so when we run this, we're gonna see that it's a little bit quieter. So I'm gonna go ahead and speed this up a little bit All right, so it looks like from my knowledge of the lyrics, it looks pretty much right at the beginning. Yeah, so it gets the very beginning wrong. That's pretty much right. That's pretty much right. Yeah, so far, all the lyrics are right. Despite all the drums and guitar in the background. And it pretty much got the chorus perfect. Second part of the chorus, it gets wrong. It doesn't get the that, but then again, the that is so just kind of quick that it's kind of understandable why the AI would drop that off. So again, this is kind of in like the 90% of right. And it's definitely one of those things where if you just heard the song and you had the transcription, you could easily figure out like what parts are wrong and then boom, you pretty much know the lyrics to the song. Okay, so now we're going to try some music where there's heavy effects on the vocals. I'm not exactly sure what effects Bergerchild is using here. Maybe he'll be able to tell us in the comments section. And this is actually music that I can play for you guys because Bergerchild is too bass to make a copyright claim. So we're going to put his song in there. We're going to put his song in there. We're going to do the large boy and we're just going to run her again and come back in a couple of minutes to compare it with... Looks like we're going to be doing auto-generated lyrics. So we're going to actually do a test of this open AI to Google's proprietary bullshit that they use on YouTube. All right, guys, let's do this. Let's compare bass AI to YouTube's AI. And here we go. Crank up the music. You'll get all you need Cats needs feed and seed It's a mystery What this place used to be All right, so you'll get all you need All right, so you'll get all you need Cats needs feed and seed Both of them don't really seem to know what Berger is saying here. It's a mystery What this place used to be E-biz store is more than just a funny time It's a place worth spending your time Wow, YouTube's AI didn't even try. And yet bass AI got most of it right. The store is more... It's supposed to be a funny sign, but funny time. And it's a place... I believe the lyrics are worth spending your time, not where I'm spending your time. So I kind of got that wrong. Get out, you city flicks This place is run by hicks And again, YouTube being lazy, not translating this part. Let's see. City flicks, supposed to be city slicks. This place is run by hicks. To get here, you must go far And your fancy German car We don't sell those, yes, you're out of luck Go, go, shark At the feast, we just got feed and seeds So I'm just going to disqualify YouTube for laziness. It's not even trying. So, literally. At the feast, we just got feed and seeds Hold on, that music track is loud. Alright, so YouTube is disqualified for not actually trying to translate Burger Child's lyrics. OpenAI? I'd say it got it maybe about 80% right. Okay, pretty good. And, oh yeah, we had the Breaking Bad one. I'm not going to show the video of that, or maybe I will. Let's see. I think it was the one where they were talking about Nacho and Don Eladio. Yeah, so this is in Spanish. And we'll translate that, or we'll put it in here. And we're going to enable the translation. So this is going to print it out both in Spanish and in English. And I'll let one of you guys tell me how accurately it does that. And, yeah, let's go. Okay, so we got our transcriptions. And it actually does have subtitles done by, I don't know, I guess AMC. So these are actually going to be fairly accurate. And this isn't even on an AMC YouTube channel. So I think this will be okay to put in the video. So, let's see. I'd like to introduce you to Ignacio Varga. And he's our new man in the north. Okay, so, yeah, that seems right. And it says he's not from Salamanca, but Salamanca is like his. Or he's asking if he's from the Salamanca family, not like from Salamanca. So that part's a little bit off. Yeah, so that part looks like it got lost a little bit in the translation. He says, so you're still here. Yeah. It's not doing so great with the translation. Let's see. Let's talk. You and I will get to know each other. Okay. Looks like that part was wrong. So it didn't translate that part, but I think it actually got it right. I know he said, how is Tuco? And he said something about him being in jail and something about gringos. So it looks like he got that part right. So this part, it looks like it might have messed up. Unless he actually asks him twice how you're going to make money. Yeah, he only asked him once. Varga's answer he got right. Yeah. So, I mean, it's doing okay. Maybe around 70 to 75% accuracy. But again, it's having to also translate this, which you can just go and try out something like Google Translate. It's always going to make a lot of these grammatical errors like that. So I think all things considered, this is actually really impressive how good this AI is. Okay, so now we're going to do one final test. Let me find where my audio is. Here we go. There's also an MP4. So I guess a bit of a test on that kind of file type as well. And I will give you guys just a quick warning that the audio in this is a little bit based. So if that's going to offend your sensibilities, you might need to go back to TikTok. And let's do that. Okay. And I'm also going to change it to base because this is English, but it's like Irish with an Irish accent. So I don't think it necessarily needs large. We're going to see. We're going to see how good it does with large because there's no background stuff. So it should be able to perform similarly well to the first popcorn Sutton file. So we'll just go ahead and run. Transcribe. And that should work more or less instantly. All right. And we're going to test. We're going to compare this audio, and I'll let you guys listen in as well.

Speaker 3: Where's your proof that you don't have the virus? Where's my proof?

Speaker 1: Yes.

Speaker 3: That I don't have it?

Speaker 1: Yes.

Speaker 3: Where's your proof that you don't have it? I don't have to prove to you. You're the one that stopped me. If you want to go along with this scam, damn it. Scam. Damn it. I haven't stopped you.

Speaker 1: You have driven into a guarded checkpoint.

Speaker 3: You have no right to be setting up checkpoints. Okay.

Speaker 1: So it looks like it got it set a regarded checkpoint, but they're calling it Garda, which is like, I don't know. It's what people in the UK call cops. Where are you going, sir?

Speaker 3: None of your business now.

Speaker 1: Well, actually, you're required.

Speaker 3: I'm not required.

Speaker 1: That part got messed up, too. He's not a courier. They're telling him he's required to do some stuff. Under our Constitution. And then I'm never quiet under our Constitution. Yeah. So as the Irishman gets pissed, the AI becomes more confused.

Speaker 3: Anything you say. Where are you traveling to today, sir? I am going to the hardware store to get supplies for my farm. Okay. And where are you traveling from? My farm. Where would that be, sir? None of your business. No, it's fucking not. We did not fight 800 years for you to start treating us worse than the British fucking army did.

Speaker 1: Did we? Or Irishman gets pissed. AI gets confused.

Speaker 3: I watch the news. The fucking news? Are you having a laugh? No. RTE? What fucking news? There's other news. I do not watch the BBC news. Shit. The BBC? Yeah. Really? Yeah. So the Crown News?

Speaker 1: Oh, yeah. It's supposed to be Crown News. You want me to watch the Crown News? I mean, it's messing up like... It's still like 90% accurate. Again, you can take this transcription and figure out very quickly what this is about. It's a very based man having an interaction with some of our... Well, not our feds, UK feds. So, yeah. This OpenAI, I'm very impressed with it. I think it's amazing. And it's open. This is all running locally on my computer with a graphics card. You don't even really need an internet connection to do this. So, obviously, all the privacy concerns that come with things like using Google Voice or Google Translate, they're kind of null and void now, right? I mean, you can use this. It's pretty much just as good. And it's not going to send any data about your translations over to Google. So, try it out for yourself. Leave a like and a comment on this video to hack the algorithm. And have a great rest of your day.

ai AI Insights

Generate a brief summary highlighting the main points of the transcript.


Generate a concise and relevant title for the transcript based on the main themes and content discussed.


Identify and highlight the key words or phrases most relevant to the content of the transcript.

Enter your query

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.


Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

{{ secondsToHumanTime(time) }}
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
New speaker
Add speaker
Edit speaker
Save changes
Share Transcript