Exploring Whisper: A Revolutionary Speech-to-Text Model

Convert Your Audio To Text

4.9/5

3723 customer reviews

Discover OpenAI's Whisper, a new transformer model for speech recognition, offering insights into AI models with diverse languages and tasks.

Open AIs Whisper is Amazing

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: While the machine learning world is still very busy partying with diffusion models, there's a new transformer model on the block released late this September called Whisper. And instead of being yet another transformer text generation model, it's actually an automatic speech recognition or speech-to-text model. There are many ways that we can actually check out this model. It's fully open-sourced for inference, and you can just download the model and use it. The team behind Whisper is... Oh, I'm sorry. There's got to be some sort of technical difficulties going on here. There must be a mistake. Does that say OpenAI? Yeah. Okay. Yeah. It's an actually open-ish model from OpenAI that you can just download and use. Incredible. So let's see some quick examples of this model's performance, and then we'll jump into the paper, which actually has a surprising amount of just general AI insights beyond just telling us more about the model and project itself. I encourage you to at least try the HuggingFace web app implementation if you have any sort of mic available. I think theirs is actually running on the CPU, so inference time is about nine seconds or something less. I tried if you have a few seconds of audio. On a GPU, you should be seeing something like inference times of maybe 500 milliseconds to one or two seconds for multiple seconds of audio. Also, the HuggingFace implementation is using the small model. There's a variety of model sizes, so each of those is going to have varying performance in terms of accuracy as well as inference speeds. So anyway, let's check those out now. Here's a simple example and implementation using a sound sample that I have recorded. So just how good is this OpenAI Whisper? There's a lot of background noise in this recording. I haven't turned off my air conditioning. To transcribe this, we can use the sample code from the GitHub, which again, we have to respect how exceptionally simple this all is to use. It was one line to install everything and then just a couple of lines of code essentially to get the transcription. On first running, you might find the time it takes to run everything is a little bit more than say 500 milliseconds, but this is mostly because the model has to load onto the GPU and then all your subsequent inferences will be much faster. For the purposes of testing these models, I've made a range of audio samples with the exact same sentence with just varying quality from that baseline decent quality example with some basic background noise that you've heard

Speaker 2: already all the way up to this one. So just how good is this OpenAI Whisper? There's a lot of background noise in this recording. I haven't turned off my air conditioning. The way I made

Speaker 1: these is just by recording the previous variant just out from my speakers into my microphone. Each time this re-recording just sounds worse and worse, but the microphone was like three feet from the speakers, the air conditioner is still running, and just yeah, everything just slowly kind of degrades over time. I stopped at four just because I feel like people at this point, at least myself, would start getting some, if not many, of the words wrong. So I didn't really see a point in continuing that process to just complete gibberish. From here, I wrote a quick script to iterate through each of the model sizes and then each of the audio recording qualities to see both how quick inference times can be between the various model sizes and to get a very kind of basic general look at how well can they perform from decent quality to very subpar quality data. Again, this is just like one sentence, one sample. There's just a tiny little amount of words compared to the entire data set of possible words. So this is obviously a very, very basic test, but just a general vague idea. I found the results to be shocking. On the worst quality sample, I found both medium and large to do a pretty good job, really just confusing the have to versus the have not turned off my air conditioning and so on. But the other stuff, it actually got quite right. Definitely very impressive results, and all models perform inference faster than real time, at least on a 3090 GPU. Obviously, this is going to vary depending on where you run the models, but even that tiny model is pretty good if the quality of audio is also pretty good too. And both the tiny and base models only require about a gig in memory to run, which is pretty awesome. And even the largest model so far that's available is only 10 gigs of memory, which is very comfortable for today's day and age. So let's check out the associated paper, which I find most interesting for, it's not just about the model, but really probably most of the golden nuggets in here are just the more generic implications and findings for machine learning and AI models in general. First off, the whisper model is what they're calling weakly supervised, which just means it's trained on not perfect gold standard training data. It's data with imperfect audio recordings and background noise and all of that. Notably, there is way more, like orders of magnitude more weak data than gold standard quality data for audio training. And when it comes to speech to text, I would say the reality of using speech to text is not gold standard many times, if not the majority of times. I don't really know, but all the times I could think of where people are using speech to text, most of the time, I think it's from a poor microphone in a phone or maybe a lack, I don't want to say the name and trigger people's, but anyway, those devices. And those are all going to be very imperfect settings. So you really do want speech to text to work in an imperfect setting. I think that's the most common scenario when you're using speech to text. Now we have gold standard training data that is about a thousand hours. And just a few years ago, the gold standard training data was only about 50 hours. So as time goes on, we do have a very quickly growing gold standard data set. And with enough time and grad students, this could eventually be hundreds of thousands of hours and maybe even millions of hours. That said, the amount of audio available online that isn't gold standard is always going to outnumber the gold standard audio quality, right? And beyond this, there is a slight question as we will get to in this paper, do we really want only gold standard data? Is this the ideal for training, especially with a speech to text? So the question is, can we use this data to make a good audio based model? And if so, how might it compare to the current state-of-the-art models? In this case, that's state-of-the-art for speech to text, but there are many questions and implications from the findings here, arguably very comparable to findings in say, some of the latest image models. So the first insight here is a current issue with fine-tuning models, especially true with speech models. I think it's just very evident with speech models versus some of these other more, you know, because like a speech model, there is a right and a very right and very wrong answer. Whereas with typical like GPT generation models, there are many options that could have been the case. Same thing with like image generation models. There are many images that would match a prompt that you might put in. So it can be tough to validate, you know, validate those models basically, other than did it produce grammatically correct output and all that. So the first insight here is that when you're fine-tuning these, let's say, speech to text models, you might have an exceptionally robust model that can differentiate between like phonemes and different words and characters and do that super well. So with speech to text, there are many words that sound exactly the same, right? But based on the context of preceding words, we can, that's how we know what that next word should be. And so you might have a model that's like exceptionally well, and then you go and fine-tune it on, say, a new speaker. And very quickly that model loses a lot of its robustness. And it's highly likely to very quickly overfit to that new speaker. One tactic that I've personally seen to handle for this situation is from NVIDIA. And I don't know if they're the ones that really made this up. It's just the first place that I had seen it, is basically you will fine-tune on new data mixed in with original data. This might be original data that was held out specifically for this purpose. But even though you're trying to fine-tune to some new speaker, you're kind of mixing in the old data as well. And this seems to help with the overfitment to the fine-tuning new speaker. OpenAI here wonders if similar gains in performance with image models, where data sets are going far beyond the highly curated and perfected data sets like that of ImageNet to more like websites like DeviantArt and such, whether this strategy can apply to audio models as well. And again, I think there's an important differentiator here between speech to text versus text to speech. So in text to speech, I think it'll be interesting to see, can someone use this like weekly supervised model to go from text to speech? Because in text to speech, you really want that produced speech to sound amazing, right? You want it to sound as high quality as possible. Whereas speech to text, you don't have the same problem. You're just trying to take speech and get the correct words that were said. So I'm very curious to see if there will be any output. So maybe you primarily train the model on imperfect sound quality data, and then you fine-tune it on the gold standard, maybe mixed in. I don't know. I have no idea. But just take note that this direction of using imperfect data, I feel like can only go one way, right? With dirty speech to text, you can't really go text to dirty speech in the training and then expect in practice for that to sound good. Anyway, the training dataset itself for this model is 680,000 hours of audio, 117,000 of which cover 96 other languages other than English, and 125,000 hours are other language audio to English text translations, which is interesting and cool to like fold in those capabilities, but also fairly interesting that they're finding that for large models, and also I would argue with enough data, there seems to be no drawbacks and only gains to doing things like supporting multiple languages. So doing the transcriptions for multiple languages all in the same model and doing different types of tasks. So multitasking all in the same model. So in this case, we're transcribing for many different languages, but then we also are capable of translating many different languages. Historical convention was to always keep things as narrow as possible for success, but with larger models and arguably larger datasets for those models, we seem to be finding that mixing tasks and training data to support those tasks, to add sort of generalization and robustness, as well as we'll see later, models with mixed tasks and data to support them, they tend to just simply perform holistically better even on those single narrow tasks than models that were trained to just do those narrow tasks. So rather than maybe, rather than confusing the model by having multiple tasks and multiple languages, it seems as though that it actually helps to do a little more generalization. So even on, say, transcribing English, the models that were trained purely just to transcribe English did worse than the models that were on the task of transcribing just English than the models that were trained to transcribe English, translate other languages, and or rather transcribe English, transcribe other languages, and then also translate those other languages back to English. I'm probably going to mix up translate and transcribe somewhere in this video. I apologize in advance. So for the dataset and the training data in general, OpenAI did not curate the datasets very much. The audio quality varies a ton. The speakers obviously vary a ton. The main focus seems to have been on making sure transcript quality was good and trustworthy. So they're trying to remove cases where they could detect where yet another speech-to-text algorithm was being used to generate the transcripts. Somewhat ironically here, I think this is the type of issue that we're going to be facing more and more as time goes on, especially, for example, with like image-based models. So these models very quickly, with the influx of all these new AI-generated images, they are, if we're not careful, going to be very quickly training on themselves. And as that happens, the quality and the diversity and the, I guess, creativity of those images is highly likely to just devolve over time. So I think we're going to start seeing the rise of models that can detect things like, was this transcript generated by an AI? Was this image generated by an AI? Was this audio generated by an AI? And so on. If not for many other reasons, but also for continuing to train those AIs. So it's quite the ironic problem to be having. To actually train this model, the training data was broken into segments of 30 seconds in length. So then later, when you're actually doing inference, if you have data that is longer than 30 seconds, it's going to get chunked out and then kind of built back together in sections. The model itself is an encoder-decoder transformer, and the tokenizer for the text is a byte-level, byte-pair encoder, same one that we've been seeing now for quite some time. The single model does the entire job of detecting the language, detecting the target task, like translate or transcribe, and so on. To control this model behavior of detecting the language and do we want to translate or transcribe, they're actually just using text tags like translate or transcribe, which is very interesting to see for this sort of task. I wonder if we'll start to see this edition of tag task types, that's a hard one to say, in future GPT style models. So what they found here is that, yes, mixing in these different tasks and stuff actually added to robustness and generalization. So will future GPT models have these intended task types besides just generate text? I think that'll be very interesting. And if they did, what would those tasks be? Because it also is seemingly questionable, does the task actually matter? As long as it does the task correctly, does it matter which ones you throw into the model or do you just want to have some different tasks purely for generalization sake? I have no idea. I look forward to seeing if they do have task tags, what would they be? And how would that work? And all that. Anyway, it's very interesting to think about as we go forward. So here we have the pipeline illustration for the model to give you possibly a better understanding of how the model functions. But again, I think the biggest takeaway here is the successful implementation of these tokens that are denoting things like, what do we want the rest of this generation essentially to be doing? Do we want it to transcribe, translate, all that? And what language is it? All that stuff. Because that is the sequence that it has always seen, the fact that it can even just detect language this way is very interesting. So I'm not totally shocked that this works, but mixing, especially like translation into a model that primarily just transcribes is very curious that it works so well, as well as tossing in 96 other languages other than just English. So the model sizes vary from 4 to 32 layers and 39 million to up to 1.5 billion parameters. These are the models that I tested earlier in the quality and inference times. Section 3.3 is yet another important insight into the sort of generalization for AI models. I think the point here is that you've got generalization when it comes to the entire training dataset that you've used versus generalization to the actual task itself. And this is where the in-distribution and out-of-distribution terms come from, and arguably an entire new field of research to get more models to be more successful out of distribution on tasks as well. The point here being that models trained on ImageNet, for example, a highly curated gold standard type of dataset may outperform humans in various classification tasks on even held out from training, but yet still a part of that ImageNet dataset. So it might have super human performance in this case, but then if you go and grab random images from the internet as a true out-of-distribution samples, you find that again for the same classification task, humans tend to do better. And this is an open question, but this seems to be this notion of fitting to this sort of kind of gold standard or some other attribute to that dataset that as soon as you go to real world applications, suddenly that AI is underperforming, and this seems to be plausibly why. What I think OpenAI and probably others are proposing here is that it's possible this is due to models sort of overfitting to the dataset style or quality or other factors, and that possibly it's really just as simple as being overfit to a sort of gold standard and just simply not being able to handle for noise in reality, whereas humans suddenly do a very better job because humans have a more truly generalized approach to solving the problem of say image classification or even speech to text for that matter. One thing I can also verify here is on page eight where OpenAI notes that NVIDIA's speech to text outperforms on the gold standard datasets, so high quality datasets, but then underperforms when compared to whisper on datasets with more noise, which I definitely agree with and have experienced myself, and I think is exemplified at the beginning of this video showing how much we can distort the audio, and whisper still performs at what I would call superhuman level. At minimum, it's super syntax level. It did get some words wrong, but I definitely would have gotten those words wrong as well if I didn't know the transcript already, and obvious question is what might happen if we continue to increase the model size. The largest whisper is 1.5 billion parameters versus something like GBD3 with 175 billion parameters. I think the main concern at the moment is that too large of a model will be able to overfit the data that is available. We can already see that English speech recognition performance does not really vary much from 768 million parameters to 1.5 billion, so it's not really looking like if we just increase the model size that much is going to change there. This would begin to suggest, at least to me, that a larger model might improve the maybe multilingual speech recognition and probably translation, but it seems like mainly it's going to come down to dataset size, which I think we can see already in table six, and this is also curious because there's actually way less data for those tasks, and yet it seems like those continue to improve as the model increases, and again I start to wonder like getting back to these tags. If you wanted to do text-to-speech, you use this, you know, weakly supervised dataset to get a good baseline audio model, and then maybe if you want to produce audio, maybe you have like a clean audio tag, and then on that clean audio tag is only gold standard audio, so maybe you just have an maybe you just have an audio model that can do speech-to-text, and I'm not sure if you could, you can't go the other way, I don't think. Anyway, I'm like thinking these things through life, but I think like you could possibly have at least a text-to-speech model that could still do the translation, so you know, maybe input English text, and then output, you know, Spanish audio, so I think you could still probably have that, and then to clean the audio, maybe you just have like a clean audio tag or something like that, and so the question is, would that continue, would that produce a better model? Because based on what I'm reading in this paper so far, I think that logic holds that you could still train that baseline model, and then for the task of clean audio only please, you could have a much smaller subset of data, and that model should probably perform better, but I am curious, would the audio sound actually clean? Anyway, I don't know. So finally OpenAI does go that step further with questions about model size and multitask performance, where they do note that for small models, the incorporation of multiple tasks and multiple languages does seem to cause degradation when compared to English only, so again, models that are trained to just transcribe English. The smaller models that just transcribe English, these do benefit from having their task be very narrow, so only training on English transcriptions, only doing English transcriptions, the smaller models tend to do well, but interestingly, and most importantly, for the larger experiments, the joint models, the ones that do transcriptions and translations and all that, they outperform the English only models. This is quite the insight, and I think an overall shift in trend that we're going to be seeing for years to come with data quality, quantity, model size, and model scope. Sometimes model scope purely for the purposes of generalization and nothing more. In the end, I think we're going to wind up with far more out-of-distribution generalized AI from this, and then also more powerful narrow AI due to this seemingly interesting behavior of somehow doing better when you just toss more tasks and more types of data at the model. And again, I'm curious to hear what you guys think if you've gone through this paper and really thought about this, or maybe you know a little bit more about mixing these tasks in. What do you think about text-to-speech that uses this weakly supervised data to just have this general text-to-speech, and then maybe one of those tag tasks? Because in this case, the thing that they're verifying and validating is, how does the English translation work? Rather than, they did not, unless I'd have to go back and check, did they compare a model that did just, let's say, translation? Did the translation model also do better, or would it have been better off being a narrow AI because it had way less data? That I don't know, and I wonder. I wonder, because in my example of doing speech-to-text and then having a gold standard tag, so you produce the text-to-speech rather, having a gold standard tag, would that make clean audio then? Because that's super important for at least text-to-speech. Yeah, I don't know. Interesting. Interesting questions. Anyways, thank you to OpenAI for sharing your model and your insights and all of that with us. It's very cool to see from OpenAI, and I'll at least be using Whisper, I'm pretty sure, for all of my current speech-to-text needs, and then possibly translation. There's a lot of really cool apps that can be made from these models, and again, these are pretty powerful, yet pretty lightweight models. So yeah, really cool. That's all for now. I will see you guys in another video.

Speaker 2: Bye.