Master Speech to Text with OpenAI Whisper in 3 Python Lines Using Hugging Face

Convert Your Audio To Text

4.9/5

3720 customer reviews

Learn to convert speech to text effortlessly using OpenAI Whisper and Hugging Face Transformers in just three lines of Python code. Perfect for NLP tasks!

Audio to Text Converter in Python Tutorial with OpenAI Whisper from Hugging Face Pipeline

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: Welcome to One Little Coder. In this Python tutorial, we're going to learn how to do speech to text in just three lines of Python code or even like lesser than that using Hugging Face Transformers library, especially using their pipeline feature. If you have seen our channel, you know that I've used pipeline multiple times in the past. That makes it really, really easy for you to do a lot of NLP tasks, machine learning tasks, computer vision tasks. So today we are going to do using pipeline, but using OpenAI Whisper, we're going to do speech to text and speech to text is also, you know, automatic speech recognition is not a very simple task to do. But since OpenAI Whisper came out, you know that there are a lot of applications that are being built primarily because OpenAI Whisper is really, really good at what it does. And it is also multilingual. So what I'm going to show you in this tutorial is how you can use Hugging Face Transformers. Pipeline feature to download the model from Hugging Face Model Hub, especially OpenAI Whisper media model, and then do speech to text. And while I'm talking, it all looks a little big, but ultimately it's going to be a very simple Google collab notebook with a few lines of Python code. And you have the state of the art machine learning to do automatic speech recognition of speech to text using Hugging Face Transformers OpenAI Whisper. Let's get started. This is the announcement that took me to this particular tutorial. So Arthur Zucker actually posted that Hugging Face Model Hub has got OpenAI Whisper in the Transformers. And if you want to know a little bit about OpenAI Whisper, so it is a speech recognition transformer model that is trained on almost like 680,000 hours of audio. Can you even believe it? The number of hours. So I have put together a quick tutorial of how you can use it. First, you need a Google collab notebook. Make sure that you have got GPU runtime. If you do not have GPU runtime, it's still completely fine. If you just have CPU, it still works fine. But if you have GPU, the inference is going to be faster. And I'll also tell you what kind of change that you have to make if you do not have a GPU. The first line Nvidia SMI is to check whether we have got GPU. Yes, we have got GPU. We have got a Tesla T4 machine that is well and good. The next thing that we're going to do is we're going to install the Transformers library, Hugging Face Transformers library from their GitHub directly. First step, install Transformers library from their GitHub. At this point, we are set. Like, for example, if you want to deploy this as a web application or if you want to deploy this anywhere else, you don't have to now do this again and again because you're going to specify the library requirement in your requirements.txt or config.yaml, somewhere around that. But the rest of the code that you are going to see is going to be the exact code to do the speech to text or automatic speech recognition ASR. So the next step is from Transformers import pipeline. Once you import pipeline, you know, if you're familiar with pipeline, you can use pipeline to specify a machine learning task. So, for example, in this case, you can give sentiment analysis. In this case, you can give text classification. You can give summarization. So all these are different kind of tasks. So here we are particularly giving automatic speech recognition. A point to note is that Hugging Face Pipeline already exists. Pipeline already had automatic speech recognition, but they were not using OpenAI Whisper previously because the model was not available. So now, given that OpenAI Whisper Medium or large, even large model is available on Hugging Face Model Hub, what we can do is after specifying the task type, which is automatic speech recognition, we can say what is the model that we want to use, which is OpenAI Whisper Medium. I have found Whisper Medium to be really good. It's a good trade off. Between the tiny base models and also you don't need larger like large model. So that's that's how we are striking the balance here. And personally, whenever I've used, I've found medium model really doing good. The next thing is we are specifying device equal to zero because we have got the GPU. If you do not have a GPU or if you do not want to use the GPU, you rather want to use the CPU available. You can just simply say that in that case, device would default to minus one. You can just leave it open. As it is so. But because I've got a GPU here, I'm specifying device equals zero. Once we have done this thing, this is probably, you know, technically like one line of Python code, two lines of Python code for us to do the speech recognition. Once that is done, now we need to have an input audio file for us to do speech recognition. We need an MP3 file where somebody is talking and and that that message is there for us to use. So what I'm going to do is I'm going to go to this website where they've got a lot of movie dialogues. And I'm taking the I'm taking one that the Joker talks about. So if you see this, this talks about. Yes, I this one where which one by. Yeah, it says starting tonight, people die. So I'm a man of my word. So this is what I'm copying. Right click it. Copy the download link. Copy the audio address. Come back to my Google Colab notebook. Paste it here. Run this. And this is going to download the audio clip. So now the audio clip has been downloaded and saved as a file called audio.mp3. Once the audio clip is downloaded and saved, I'm just displaying it here for you to play and then see, you know, if the audio works fine. And once that is ready, I'm going to use whisper audio.mp3, which is basically the name in which we saved here. Run this. And then we have got the text output saved here. So I'm going to print the text output, which is just this. And it says starting tonight, people will die. I'm a man of my word, which is exactly what it says here. Starting tonight, people will die. I'm a man of my word, which is from the movie The Dark Knight, where Joker actually says with a lot of noise. If you play this, you would actually hear a lot of noise. Um. And then you can also feel that it's not easy for anybody to transcribe this. So let's pick up one more thing. So let's say this one says this town deserves a better class of criminal, and I'm going to give it to them. Download it. Copy it here. Come back to the Google Colab notebook. Paste it here so that I can download this using wgit, which is a bash command, Linux command. And once downloaded, it is saved to the same name called audio.mp3. And I've got audio.mp3. I'm going to just play it and then see if it is correct. You may not be able to hear. Yeah, let's see if it works fine. I think it'll be a little difficult given that there is a lot of noise, but let's see. This town deserves a better class of crew. I think it didn't catch the word criminal and it caught it as crew. And I'm going to give it to them. And I'm going to give it to them. So you can see that it has made mistakes with two words here. But again, the audio clip is quite noisy and it's quite understandable because we have not done any pre-processing before sending the audio feed into it. But ultimately, the point here is that you can use Hugging Face Transformers, which is like a one stop solution for a lot of machine learning problems these days. And you do. And you do automatic speech recognition, especially using OpenAI Whisper, which is nothing short of the state of the art model these days. And it is also multilingual. You can transcribe audios that is not just English. That is any language. And if you want to make changes to the model, the way you can do is you can have a tiny model or a base model or a large model. Like if you want a large model, you can call it large here. If you want a base model, you can call it base depending upon what is the use case where you're going to deploy this. Are you going to leverage CPU or not? Based on these things, you can play around with this here. But overall, like I said at the start of the video, just three lines of Python code. One, from Transformers import pipeline. Two, create a pipeline that can download the model and create like a machine learning pipeline that will do the task for you. In this case, automatic speech recognition. And then three, all you have to do is whisper or pipe whatever you have created. Whatever you have created an object and give the audio file. You have the text file ready. So literally in three lines of Python code, you have the state of the art machine learning model for automatic speech recognition and not also any random library, but you can use one of your favorites. Or if you have got a pipeline set up with Hugging Face Transformers, you can literally use Hugging Face Transformers to do the pipeline automatic speech recognition and do speech to text using OpenAI Whisper. All the models are available on Hugging Face Model Hub. I hope this tutorial was helpful to you. The Google Colab notebook will be linked in the YouTube description. If you have any question, let me know in the comment section. Stay safe. Peace.