Convert Audio to Text Using Python's Whisper Tool

Convert Your Audio To Text

4.9/5

3727 customer reviews

Learn to convert audio to text with Whisper, an open-source Python tool, compare it with Google Speech API, and explore multilingual capabilities.

OpenAI Whisper Demo Convert Speech to Text in Python

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hey YouTube, in this video I'm going to show you how you can quickly convert any audio into text using the free open source package in Python called Whisper. I'm going to show I installed it, show an example of how I ran it, and compare it to an existing library. So starting off, you'll probably want to go to the Whisper GitHub repository that we're looking at here, and they give instructions on how you can install it. Now one thing to keep in mind, when you pip install just the name Whisper, it's not going to install the right version. We want to install from this Git repository, so just take this pip install command and run it in your environment that you're running Python. And they also mention here that you need FFmpeg installed. There's some instructions to do it, but I already had that installed on my computer. Now that I have Whisper installed, let's just make some audio that I can test this on. So I'm going to say some idioms. Idioms are usually hard for models to understand, even though this is just speech to text. This will be kind of fun. I would love to be on cloud nine as a one-trick pony that wouldn't hurt a fly. I'd be like a fish out of water and as fit as a fiddle to be under the weather. Let's save this off. I'm going to save it as a wave. They do have instructions for how we could run this just straight from the command line once it's installed. I'm going to show you how to use the Python API, which they show here. So it's really simple. We just import Whisper. Then we're going to create our model, which is we're going to load the model that's called base. And then just using this model object, we run transcribe on our audio file. So I named it idioms. Let's use the wave version. We want this to return the result. Now I noticed when I ran this before, I get this error because of CUDA's half tensor and float tensor. I was able to solve this. So that's something to keep in mind. If it doesn't work for you, you might need to set floating point 16 to fault. And you can see after it's run here, it detected the language already as English. And then this result object has a few different methods in them. But what we want to get inside of this is just the text. We can see that it looks like the result is good. I would love to be on cloud nine as a one-trick pony that wouldn't hurt a fly. I'd be like a fish out of water. It did mess up a little bit, this fish out of water in as fit as a fiddle. Maybe I didn't say it clearly enough. Another thing to know is when you first run this, it's going to have to download the base model. So you might see a progress bar going across and you'll have to download that model. And it says when you run this transcribe, it's actually taking 30 second chunks of your audio file and running predictions on it. Now there's also another approach that you can take, which is a lower level approach where you actually create the model and then you create the audio object and pad or trim this. What this will do is just make sure that this audio chunk is only 30 seconds or it'll pad it with 30 seconds since that's the length the model expects to have as input. Then it's making a log mouse spectrogram. It's detecting the language and we can decode here and provide a lot more options if we wanted to. So if I run this cell, again, get this error, which I now can set in the decoding options FP 16 equals faults. And actually this time it looks like it got everything correct. I'd be like a fish out of water and as fit as a fiddle. So that's it for whisper. I just want to compare it to an existing type of model and a popular library for doing this is the speech recognition library. The way we run the speech recognition library is we import it and then create this recognizer object, which we then can load our audio file with. After that, you could take the recognizer object and there are a few different recognizing methods for that. And we're going to use the Google recognize and let's see what the result is. So it looks like it didn't add any punctuation and the cloud nine is different. I would love to be on cloud nine as a one trick pony that wouldn't hurt a fly. But the one thing to keep in mind is that this is actually using the Google speech recognition API, the whisper library, you actually have the model downloaded and it's yours to use. I do also recommend you take a look at the whisper paper, which was released with this code. They also go into detail about how the model was trained and the architecture that it's used. Whisper does work on a bunch of different languages. The performance they say varies based on the language. So you can go here on the GitHub repo where they have a plot showing which languages actually performs best for the bars here. Smaller is better and larger means it performs worse. So still pretty impressive the number of languages that this model works on.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support