Exploring OpenAI's Whisper Speech-to-Text Model

Convert Your Audio To Text

4.9/5

3720 customer reviews

Join us as we test OpenAI's Whisper, a new multilingual speech-to-text model, and explore its features, setup, and performance on various models.

Lets Play with OpenAI Whisper AI Dev Stream

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: All right, it says we're live, but I think it's lying. Oh, okay, here we go. Let's get the link to this stream. I don't even know where that lives. There we go. Share. Copy link. All right, folks, hello. I'm just going to throw up on Twitter. At the OpenAI whisper model. All right. Well, hello, friends. Welcome back to the stream. Today we are going to have a little bit of a look at this brand new model that's just come out from OpenAI called Whisper. Some of you may have seen this already. We'll have a play with it. I wrote a little GUI for it just to make messing around with it a little bit easier. But yeah, otherwise we will get into that. We'll have a play. We'll see how well it works and dive a little bit into the model itself and all of the bits and pieces. So what is Whisper? Whisper is a speech-to-text model released by the team at OpenAI. For those of you like me who normally have dark mode on, I apologize for the blinding light of this tab. But anyway, it takes text and it takes speech and converts it into text. It works on multiple languages, which is very cool. I don't know how to speak multiple languages, so that's really not going to help in testing. But it uses some attention setups in a typical Transformer-esque fashion to actually take what they call a log-mel spectrogram, pass it through a number of encoder layers and decoder layers, and output text on the other side. It's very quick. If you use one of the small models, I've been playing around with it for about 20 minutes prior to this. We'll see how well things work, given that multiple things are going to need to use my microphone. I don't know if that's going to work as planned, but we'll get it to work one way or another. But anyway, let's talk about the model for a second. So you can see the diagram on the screen here of the model. It takes this spectrogram in, which is a representation of how frequencies in the audio are changing over time. So from left to right, that's how time is passing, and then from bottom to top, that's different frequencies, and the color indicates how much of that frequency is present in the waveform. So if you think about just scrolling up here, I don't know if you can see it well, but this example of a traditional waveform from audio, that can get converted into a spectrogram. We then have a couple of convolution layers, positional encoding, as is very common in these transformer setups, and then a number of encoder layers fed into a number of decoder layers, with the additional output, which is a marker to say it's the start of the tokens, and it's saying, look, this is English. There is some language detection in here, which is quite cool. Please transcribe the following, and then the output becomes the transcription. As it says here, it does process the model in 30 second chunks. I've got it set up so that we can actually just speak into the microphone, and do a little live recording in the browser, and then it will send it off to my AI machine, which you can see the outputs from here, but we'll go into that in a second. It's a very cool model. One of the things that I like about this setup is that it's open. You can download this code, it's very easy to install. You'll notice in the description of this, I've put all the code that I'm using today, or at least how it is right now, up on GitHub, so you can install this, follow along, and do all that good stuff if you want to. It's very easy to install. The weights are downloaded as you hit run, and you can get going pretty much straight away, which is great. Off the back of StableDiffusion, which we had a look at last time, it's really cool to see all these models just being made available. I'm sure cool stuff is going to be built on top of this. Now, I will say, compared to StableDiffusion, which was very new and exciting in this whole image generation space is just amazing, the amount of stuff that's coming out there. Speech-to-text has been around for a while. It's been pretty good for a while, but look, more open source models are always going to be good with me. Now, let me jump into here, and I can show you what we built. I'll have a look through, I'll show you the code. It's so simple thanks to the Gradio library, which is maintained by the team at Hugging Face. Let's have a look. This is the small GUI that I put together. You can see it takes some audio. In this case, it was just a short recording, and we can look through the different model sizes. We hit submit, and then the output here is the text that comes out on the other side. I'll just give this a quick check to make sure that it can use my microphone, and we'll see. We'll see if it's happy to use the microphone while OBS is using it as well. I don't know what's going to happen. Hopefully the stream doesn't crash. Who knows? We'll just reload the page, make sure that it's happy, and let's do a little test. We'll just do a boring test to begin with. We'll pick the base model size. Actually, I'll show you this on GitHub before we get into this. This is the OpenAI GitHub. What I want to show you here is the different model sizes that are available. We have Tiny, which is a 39 million parameter model. It takes about one gigabyte of VRAM. In the machine that I've got at the moment, I've got a 80Ti, which has about 12 gig. We should be able to test everything to maybe large. Depending on what else is being put on the GPU, it may not like that. We've got Tiny, Base, Small, and Medium. Here, you're going to have a trade-off of speed versus quality, as you might expect. The smaller you go, the faster it is. Tiny is 32 times faster than Large. I haven't tried to test any real-time translation here, but we'll see. Naturally, you're going to have a trade-off in accuracy. Speaking of accuracy, there's a difference in accuracy based on the performance. You can see here this is from the paper, but it's showing various scores. Below is a WER breakdown of languages in the FLIRS dataset. This is using Large. We can't have a look at the paper before, but the cool thing here is you can just see just how many different languages are available. Everything from Spanish, Italian, English, Portuguese, all the way down to languages that I do not even recognize. It's cool. This is open, ready for you to hack around with. In terms of the API, there is a command-line API where you can just pass audio files in, choose the model, potentially choose the language and it'll spit out the text on the other side. We're using it as a Python library. I'll show you that in a minute. This four-line example here is pretty much as complex as it got setting this up. You import Whisper, you tell it which model to use, you tell it to transcribe a file name, and it'll give you the results back as text. It's very, very cool. There are more complicated examples here below which uses detect language and uses some of the lower-level API. I think you can even get it to create the spectrogram, which would be a fun thing to visualize and might have a look at in the future. You can run the language detection model and so on. It's MIT-licensed, which is very cool. Without further ado, let's hack around a little bit. Back to this. This is a UI that I built for this example. You can see the code up here in my terminal. Hopefully, that is readable. Let me just zoom in a tiny bit there. Hopefully, that's good for you guys. You can see here I'm using this Gradio library, which is from OpenAI and allows us to create this GUI that we're seeing here on the left. I imported Whisper itself. The installation structures and instructions on the Whisper GitHub are pretty clear and pretty simple. Make sure you have ffmpeg installed, which again is a one-line install. Then Gradio as well is just a pip install. You can see, we'll jump in here, this section here is the guts of the model. It's loading the model depending on what model size we choose. It's running the transcription, returning the text. Then I put that in, honestly, the world's simplest Gradio interface. It takes audio from the microphone in this case. You can use a different source if you want to give it audio files. Let's say you have a recording of a podcast or whatever else and you want the transcription. I figured microphone was better for mucking around. You can pick the different choices of the model size. We'll test a couple of those today. We'll try to come up with something somewhat tricky for it to understand. We'll see how it does. Without further ado, let's give it a go. There is a small risk that trying to grab my microphone from two sources will make it sad, but let's try it anyway. Otherwise, I'll just yell at my laptop and use the built-in mic. Okay, it seems to be pretty happy. What it should be doing now is recording all of this text locally in my browser. When I hit stop recording, that'll save it. We've picked the base model size. I'll hit submit. That'll send it over to the AI box, which is sitting in the corner over there, so bandwidth shouldn't be a problem. We'll see what text it comes back with. Let's give it a go, shall we? We'll stop the recording. It seems to have recorded fine. We've picked the base model and we'll hit submit. If you can hear that in the background, the fan in my machine is going off. You can see here it detected that the language was English. It did some conversion behind the scenes. That's all good. Here is the output. Okay, it seems to be pretty happy. What it should be doing now? Look, I would say that was a pretty good interpretation. I'd say that's pretty spot on. I can't even remember exactly what I said. We'll pull up some tongue twisters, maybe. Give it a little run for its money. The other thing you should notice is just how quick that was. Between pressing submit, it sending it over and running and bringing it back was really, really fast. That was, in this case, I'll just mute this and we'll just press play so we can get an idea of how long it was. About 7 or 8 seconds of audio. That was definitely transcribed in less than 7 or 8 seconds, I would say. We'll do another test here where we just record a short bit of audio, aim for 5 seconds or so, and we'll try and get an idea about how long it takes to compute. There's probably a much more scientific way to do this. I'm just going to use the timer on my phone, which you won't be able to see, but you'll just have to believe that I'm not lying. Alright, let's do another base model and we'll do about a 5 second recording. 1, 2, skip to my loo, pineapples, pinecones, tomatoes, tulips. Okay, that was 6.65 seconds, which you can hopefully see here. I will try my best to synchronize hitting these two buttons. 3, 2, 1, go. Okay, that was extremely short. Less than 3 seconds, which is awesome. 1, 2, skip to my loo, pineapples, pinecones, tomatoes, tulips. Did a pretty good job. It's also interesting that it put the punctuation in there. The commas, I think, is a nice touch. I've seen a lot of audio models Oh, hello chat. I did not notice you folks there. Thanks for sitting along. Yeah, this is pretty cool. So, let's try one of the larger models, shall we? And what I might do here is actually pull up NVIDIA SMI. So this is, if you're not familiar with this, this is NVIDIA's tool to show you what your GPU is doing. And what we want to be paying attention to here is either this number here or this one down here, which shows us how much GPU usage is going on right now. And we'll test that with a few different model sizes. It should hang on to the model after it's run, so we should get a good idea about how the amount of RAM changes as we try different models. So let's start. We'll use this same sentence. We'll see if it gets any errors with this tiny model. So we'll hit submit again. We'll see if that changes at all. So you can see this one takes a little bit longer, which is interesting. There is a chance, and we'll just double check. Yeah, so why that took a little bit longer is because the first time you run on any of these model sizes, it actually goes and grabs the weights for you, which is super convenient. I remember grabbing the weights for stable diffusion was a bit of a pain. So it's done this download here and then done the translation. We'll just run that one more time just to get a better feeling for the speed now that it's done that download. Okay, so that was super quick. You can see here though, the text output and the quality of it, or the translation has changed a little bit. We've got one to skip to my loop instead of loo. Pineapples, pinecones, yeah, that's again. Tomatoes has become two mottoes. And then tulips it got as well. So interesting to see even between that tiny model, which if you remember is 39 million parameters and the base model, which is 74 million parameters, we see a difference in the quality of that output. I would say that in our case, the base model, which we ran before, really got it perfectly translated in this fairly simple contrived example. And let's just have a look at NVIDIA SMI. So it doesn't look like there's been a big change in the amount of RAM used there. It is possible that it does make a huge difference. Let's try some of the larger models. Again, you can, I'll show you now live how it downloads the weights and we'll see if this makes much difference. So you can see here the model weights for tiny were about 72 megabytes, which is pretty fantastic. I mean theoretically you could put that on a phone. You could put that on a little Raspberry Pi and potentially run this somewhat live. Now obviously this is running on the GPU, so it's going to be quick. But if you wanted to monitor something over a long period of time or do some basic transcription or whatever else, maybe a text-based baby monitor, well I guess babies don't speak, do they? Yeah, whatever. You could run that on a pretty small device, given the amount of memory that it needs. I don't know how quickly it runs on CPU, but we could work it out. Anyway, so small is now downloaded. That's 461 compared to the 72. And you can see the results here. We'll run it again for a bit of a speed test. If we take that to be the correct answer, about 4 seconds. What I find fascinating though here is this one has got the text right again, but it has dropped a lot of the commas and a lot of punctuation, despite being a slightly bigger model. Just jumping to chat for a second, we've got a couple of questions here. Have you tested smaller or better yet? So hopefully you'll see here that even the model size isn't directly correlated to the quality. I would say personally that base compared to small actually did a better job in this example. And WTalkie, I've got a GPU here, but you can see that it's using now about 2GB of GPU RAM. That should be possible to run on a CPU machine, naturally. I mean, you'll have the RAM, but it'll be slower. How much slower, I'm not sure. I might run a couple of tests offline and come back and see what we get there. But yeah, really fascinating to me that the base model, which we'll run again, unless I've got the order wrong, did a little bit of a better job there, putting the commas in and constructing a full and complete sentence there. Another thing which is pretty interesting, I reckon, is you can see here the base model has used the numbers 1 and 2, whereas the small one, which we just ran before, seems to replace the numerical digits with the full words. And coming back here, yeah, small is bigger. It's a little bit slower. It's 244 million parameters compared to 74, but produces quite a different set of results. We'll try medium here. I think I mean, we can try large, but my guess is that it's going to run out of GPU memory. But anyway, let's try medium. Let's see how big that one is, see how much it takes up on my GPU. You can see it's 1.5 gig of weights. This is going to take a little while to download. But yeah, look, if you haven't used the Gradio library before, which I hadn't before today, I'd seen it before but never made use of it, super easy to get started. And real props to OpenAI here for making their Python interface so simple. This took me five minutes or so, and that was mainly working out how to set up the microphone recording. I put a note in the readme, but if you're running something on your local network it's not HTTPS by default, which means that Chrome won't let you access the microphone. I had to follow these instructions to treat my origin as secure, even though it's not. Definitely don't do this on other people's websites, but if it's your own stuff, that gets around it. Actually, that's fairly decent internet speed. But yeah, you can see a big step up in the number of, or in the size of the weights there, about three times as big as the last one, which I guess tracks here about five gig of VRAM. And this is sort of what they're calling their one to two times speed is this largest model size. And you can see that we are paying a price here for these increased weights, despite the fact that as we've seen already, the results aren't necessarily better on those other models, which surprises me. But we'll see how it goes and then we'll try something a little bit more tricky. Let's have a little bit of a Google in my other browser that you cannot see on the screen, so I don't dox myself accidentally, for something to read out. Let's have a little bit of a tongue twister. See how well it does. And then you can watch me butcher some Portuguese and we'll see how well it does that. Okay, so the results from the medium model are in. Let's run that again to get a bit of a speed test. One second to... Wow, this is... So that was about 12 seconds to get the medium sized one. And it's interesting. Okay, so some of the punctuation has come back. It's split this into two separate sentences. It has used the word loo as in toilet here, rather than loo as in skip to my loo. I mean, maybe that's the correct way to say skip to my loo, but I didn't think it was. And it's gone back to spelling out text, which is interesting. Alright, let's try with a little bit more of a complex example. Well, let's try some very crappy Portuguese to begin with and let's see how well it picks up my Portuguese, which is very out of date. So if anybody watching this is a native or better Portuguese speaker, forgive me for the atrocities that I'm about to commit. You speak a little bit of Portuguese, but you don't have time to study. So what that means roughly is I speak a little bit of Portuguese, but I don't have a lot of time to study, which is true. Alright, let's have a look at how that picks that up. And all props to OpenAI here because it doesn't pick this up. There you go. That actually did a pretty good job. So you can see here in the terminal that it has detected the language as Portuguese, which says probably more about OpenAI than my pronunciation. Yeah, it's done a pretty good job there. Well-structured sentence matches what I was trying to say. My Portuguese spelling is probably not enough that I can pick up any errors there, but that did a good job. I was on the base model again, which was the winner last time, so we'll stick with that for now. Alright, one last test. We won't run this stream too long. We'll do a bit of a longer version here, and I've got a couple of tongue twisters which I just picked up on the internet. What I'll do is we'll jump out of here. We'll go in here. I'll paste them into the terminal so we can do a little bit of a side-by-side with some of these classic tongue twisters. Alrighty, let's give this a red-hot go. We'll switch back to English. We'll do these. This should be a little bit longer, I would imagine. Hopefully running over 30 seconds so we can see if that chunking effect has an effect. Let's go. We'll start with... Let's go through all the different models. We'll start with Tiny. So here we go. Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, where's the peck of pickled peppers Peter Piper picked? Betty Botter bought some butter, but she said the butter's bitter. If I put it in my batter, it will make the batter bitter. But a bit of better butter will make my batter better. So t'was better Betty Botter bought a bit of better butter. Now as I rambled through that, and we'll hit submit, I realized that one thing that may bias this a little bit and we'll get maybe something a little bit more technical in a minute, I probably should keep that there so we can compare, is that there's every chance that other people in training this model recorded themselves saying tongue twisters and it was used in the training data. So this may be a bit of a dodgy test, but I wanted to test a little bit of a long one here. So Peter Piper picked a... Okay, so a couple of little changes here. So you can see it says peck of pickled peppers. It has strung a lot of this into one big long sentence, which is interesting. I don't know, I guess there's not a lot of punctuation in the original, but I'd say overall they did a fairly good job. It's gotten bitter and better mixed up a couple of times in here, if you'll notice, and that was on the tiny model. So let's try the next size up and see if we get any differences there. And then we'll try a bit of technical jargon and see if we can trip it up at all. Bit of better. Yeah, you can see there's still a couple of changes there between bitter and better. Again, it isn't a tongue twister, so it could be my fault for... Yeah, as number 16 said, I did definitely stutter as I tried to read this. Probably should have practiced. And let's just dive straight to medium, the big one, and then we'll try some technical jargon. I'll find a tab that I've got open here and we'll just read some stuff out of that and we'll see how it does with some terminology. I should have an AI paper open somewhere in this list. Okay. Yeah, I'd say that's slightly better. Not bad. Not bad, OpenAI. Alright, so I'm going to pull some stuff from one of the papers from... I'll throw it in the chat if you are interested in having a look at what I am rambling on about. The goal here is really just to do a bit of technical speaking, see how it picks up some terminology that's a little bit more advanced than tongue twisters. And what I'll do is I will jump to the algorithm section and probably read the first paragraph there with something with a bit more jargon. Okay, so I'm reading from section 2.2 in reverse trajectory from this paper, so if you have a look at that link you can have a look. I'm not going to read the math, I'll just read the words on the page and we'll see how well that does. The generative distribution will be trained to describe the same trajectory but in reverse. For both Gaussian and binomial diffusion, for continuous diffusion, limit of small step size beta, the reversal of the diffusion process has the identical functional form as the forward process. Alright, that's enough of that. Submit this one, have a look how it is going. The generative hmm Look, I would say that did pretty well. Overall, it's not highly technical but it is more technical than general speech, it's done a fairly good job there. Some of the punctuation is okay, I mean in the original paper you know, for example, limit of small step size beta was in parentheses, which it's not here but I don't know how the model would be able to pick that up from just the changes in my voice, that does seem like a big challenge for it. But again, this base model has done a pretty good job, it's pretty quick at getting most of that text correct. And I think if you were using this for voice notes or dictating some text that you wanted to type out that would do a pretty damn good job. Having a look here, not a huge difference in the jargon there. Yeah. Look, overall I'd say this model is functioning pretty well. I'm just trying to think if there's like anything I could run through that would be too technical for it. I mean maybe if you get deep into some fields, let's say medical. I spend a lot of time in the medical AI world, one of the big problems in that space is transcription of voice notes from doctors about what they've worked through with the patient, symptoms, things that they've diagnosed. There's a lot of technical jargon in there. I don't have enough of that jargon myself to be able to test that here but I can imagine that that might start to be a problem. But with many of those types of problems, what you'll often do is you'll do an initial automatic transcription and then you'll pay somebody to listen to that transcription while reading it and make any corrections and they can often listen to it at 1.5 times and it's faster overall than them typing it up because they're highly trained transcription experts. Let's just take a quick look just to wrap this up at the medium. So you can see here at medium on my GPU it's using about 4GB of RAM which is pretty close to what they estimated. Again that's a 1080Ti just sitting in a fairly old machine there that I bolted together. But yeah look I think that's about all I wanted to get through today. But as I said on my GitHub there is all of the code that we ran here. It is literally just two files which is the run file and the readme. Have a go, have a play around. Big props to OpenAI for making this model available to people and to allow people to play around with it and much like stable diffusion I'm really looking forward to seeing what people put together. To anyone that has been watching along live, thank you so much for watching along. I hope you enjoyed this. I may need to do more of these so if you haven't already, subscribe and we'll continue to play with some AI now and in the future. But look, thanks so much everybody. Have a great day.