Exploring Whisper Medusa: Faster Speech Recognition

Convert Your Audio To Text

4.9/5

3718 customer reviews

Discover Whisper Medusa, a new model enhancing speech transcription speed, and learn step-by-step installation and usage to boost your AI applications.

Whisper Medusa - Speech Recognition Model - Beats OpenAI Whisper - Install Locally

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hello and welcome to the channel. Even this age of foundation models that can produce diverse content, advanced speech recognition remains highly relevant. This technology is not only driving key functions across sectors like healthcare and fintech, helping with tasks like transcription, but it is also powering very capable multimodal AI systems. Last year, OpenAI really rocked this field with its own Whisper model, and we have seen since then that Whisper has become a de facto standard in heaps of technologies and AI-powered applications across the globe. This Whisper model converted user audio into text, enabling an LLM to process the query and provide the answer, which was again converted back to speech. Due to its ability to process complex speech with different languages and accents in almost real-time, Whisper has emerged as the gold standard in speech recognition, where it has already been downloaded more than 5 million downloads every month. Just let it sink. But what if a model can recognize and transcribe speech even faster than Whisper? And that is what this new model, Whisper Medusa, is trying to claim. And interestingly enough, Medusa is a figure from Greek mythology. She was a monster and one of the three Gorgon sisters with snakes for hair and the ability to turn people to stone with a single glance. Medusa stands for strength, power, and production. And that is where this Whisper Medusa is trying to build upon to increase the performance of existing Whisper model. Whisper Medusa is an advanced encoder-decoder model for speech transcription and translation, processing audio through encoding and decoding stages. Given its large size and slow inference speeds, various optimization strategies like faster Whisper and speculative decoding have been proposed to enhance the performance of OpenAI's Whisper model, and that is where this new Medusa Whisper builds on Whisper by predicting multiple tokens per iteration, which significantly improves speed with small degradation in WER. They have trained and evaluated their model on the LibreSpeech dataset, demonstrating speed performance. In this video, we are going to install it locally, and then we will be using a simple audio file to do the transcription. Before I show you the installation, let me give a huge shoutout to MastCompute, who are sponsoring the VM and GPU for this video. If you are looking to rent a GPU on affordable prices, I will drop the link to their website in video's description, plus you will also get a coupon code of 50% discount on a range of GPUs. Another thing I wanted to share is that this speech recognition with Whisper Medusa is a very very high quality training pack. When training Whisper Medusa, they employed a machine learning approach called Weak Supervision. As part of this, it froze the main component of Whisper and used audio transcriptions generated by the model as labels to train additional token prediction modules. This is really unique and amazing, as they chose to train their model to predict 10 tokens on each pass, achieving a substantial speedup while retaining accuracy, but the same approach can be used to predict any arbitrary number of tokens in each step. Since the Whisper model's decoder processes the entire speech audio at once, rather than segment by segment, this new method reduces the need for multiple passes through the data and efficiently speeds things up. And that is why it is faster than Whisper. Ok, enough talk, let's go and install it. This is my Ubuntu system where I am running this Ubuntu 22.04 and I will be using NVIDIA GPU card RTX A6000 with 48GB of VRAM. First up, let me create a Conda environment where I will keep everything separate. I am calling it Medusa with Python 3.11. Let's wait for it to finish installing. Now let's install some of the prerequisites, which include Torch and TorchAudio and TorchVision. Let's wait for it to get installed. And that is all done. Now let's git clone the repo of this Whisper Medusa, sorry not this one, let me put the proper command here. Let me git clone Medusa and that is done and I have also entered into that. Let's install all the prerequisites from here. It is going to take bit of a time. And that is all done. Now let me install and trigger my Jupyter notebook where we will be downloading this model and then we will play around with it. And the Jupyter notebook has been launched in a browser. Now let me show you the code which is going to download the model and then we will also give it a sample audio. Before that, let me try to explain how this code is working. So first we are simply importing these libraries. Then I am getting this Whisper Medusa model and its packages from the repo which we have just cloned. We are specifying the model names and its repo from HuggingFace and then from there I am just giving it the path to a local audio file on my local system and I am specifying sampling rate as 16000. Sampling rate is the number of times per second that an audio signal is measured and recorded. It is a fundamental concept in digital audio processing by the way and it determines how often the audio signal is sampled or captured and it is measured in Hertz. A higher sampling rate means more frequent samples resulting in more accurate representation of the audio signal. Okay so here we are setting it to 16000 Hertz or 16 kilohertz which is a relatively low sampling rate and we are just doing for speech recognition and because this value focuses on capturing the frequency range of human speech and the frequency range of human speech is between 100 Hertz to 8 kilohertz so that is why this value is more than okay. Okay so that is done in English. The language here because I am using NVIDIA GPU so I am specifying CUDA but you can even use CPU with it. It might be slower but it will work and then we are simply getting the torch audio from the path specifying the sampling rate and then we are passing this as an input to the model and then receiving the output generated by the model and then we are predicting it or printing it. So let me run this code and see how it goes and you can see that it has started downloading the model. First chart is being downloaded so let us wait for it to get completed and model is not that huge around I would say 6.5 that sort of figure and the model is downloaded as you can see with the tokenizer and all the special tokens and this is the transcription which it has done from the audio perfectly fine because this is what I have said in that audio which is a very small wav file and the speed is simply awesome. As an another example I have just given it an mp3 file here now. This is just a small paragraph from this book and let us see how it goes. I am just running it. It is going to load the model on GPU all the charts and then it is going to do the inference from that mp3 file and I will let it run just to show you the speed of it. There you go. So it has given us this whole chapter from the, because I am just printing out 440 megs tokens if you want you can print more but you can see that it was so quick that it has simply got the transcription in a jiffy. So all in all real good stuff I think it lives up to its name Medusa. I will drop the link to it in video's description play around with it and let me know what do you think. If you like the content please consider subscribing to the channel. If you are already subscribed please share it among your network as it helps a lot and the code which I have used I am going to put it in my blog and I will drop the link in video's description. So enjoy. Thanks for watching.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3718 customer reviews

1/730

Verified Order

“I needed an interview transcribed accurately and I was happy with the quick turnaround. ”

Jen

Jul 20, 2025

“Very accurate transcription, fast service, easy to use and order, thank you!”

Gabby

Jul 15, 2025

“I am beyond happy with this service, which I am using it produce interview transcripts for my dissertation research. The interface is easy, the customer service was prompt and informative, the transcript is accurate, and the pricing is wonderful. I will recommend GoTranscript to anyone who is in need of affordable human-powered transcription services.”

Justin McDonald

Jun 29, 2025

“great work. quick and professional”

christian oradesky

Jun 28, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support