Explore Whisper Jax: 70x Faster Speech Recognition
Discover Whisper Jax, a faster Whisper implementation using Jax for optimized speech recognition on GPUs and TPUs. Transcribe audio efficiently!
File
Whisper-JAX 70x faster Speech to Text AI Transcribe YouTube Video Faster
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: This is Hrithesh Srinivasan and welcome to my channel. In this video, let's look at Whisper Jax. Whisper Jax is a highly optimized Whisper implementation for both GPU and TPU. So I saw this tweet from Sanchit Gandhi at Hugging Face. So they have made Whisper 70x faster. So what is Whisper? Whisper is an automatic speech recognition system from OpenAI. Okay, it was trained on a huge data set and it had exceptional performance. So they have taken that and they have done this Jax implementation, which is 70x faster than the PyTorch code. So what is Jax? Jax is a machine learning library from Google. It is a machine learning framework for transforming numerical functions. Okay, so they have a demo, which I couldn't test because I get this gateway timeout. But they also have this GitHub page, where they have this Kaggle notebook. Okay, in that notebook, they demonstrate how they can transcribe 30 minutes of audio in approx 30 seconds. So let's open this notebook and let's try it out. What I'm going to do is that I'm not going to try out that 30 minute audio. Okay, what I want to try out is I want to try it out on a YouTube video to transcribe a YouTube video. Okay. So there is explanation of what is Whisper Jax over here. So Whisper Jax is highly optimized Jax implementation of the Whisper model by OpenAI. Okay, it is built on the Hugging Face transformer Whisper implementation. Compared to OpenAI's PyTorch code, Whisper Jax runs 70x faster, making it the fastest Whisper implementation. Okay. So to get started, this is run on TPUs. Okay. So what is TPU basically, so you can check out what is TPU over here. Right. TPUs are tensor processing units or hardware accelerators specialized in deep learning tasks. They were created by Google. Okay. So in Kaggle, you can launch what you call Kaggle notebooks with TPU accelerators. Okay. So TPU v38, which is specialized hardware with four dual core TPU chips for a total of eight TPU cores. Okay. So this board provides significantly more computational power for mixed precision operations and matrix multiplications. So basically for, you know, optimized hardware for deep learning tasks. Okay. These are tensor processing units. So here, what you require over here is that you require TPU basically. Right. And basically in the notebook, if you go to this accelerator, this thing, here you have to select TPU v38. Okay, vm38. Okay. So you need to select that. And then once you run this import jacks, jacks.devices, basically, it will show you that there are eight TPU devices packaged into one accelerator, then you need to install this whisper jacks packages. So that is what is done over here. Right. Then you can load a pipeline. So to load a pipeline, you have to use this flax whisper pipeline class. Okay. And the model is large v2 model in bfloat16 half precision, half precision will speed up the computation type considerably. Okay. They also make use of batching for single audio inputs, the audio is first chunked into 30 second segments, and then the chunks are dispatched to the model to be transcribed in parallel. Okay, by batching and transcribing parallel, they get a 10x speed up compared to transcribing the audio sample sequentially. Okay. So this is how the pipeline is instantiated over here. You know from whisper jacks import flax whisper pipeline. Right. And then this is how it is instantiated. Okay. The model is open a whisper large v2. Right. Then what they do is they create a compilation cache which will speed up the compilation time if we close our kernel and want to compile the model again. So that is what is done over here. Okay. Then what I have added over here is that I have added how you can download a video using pytube. Okay, for that, I use the pytube library. And I import YouTube class from pytube. Okay, this is a link to one of my videos. And then I can actually download the video over here. Okay, that is what I do over here, I try to create this YouTube object and then YouTube class object and then I try to download I know I try to get the filters for what you call get the streams for that file. And I download the audio this thing which is like 139 over here. If you look somewhere over here, you have that audio tag 134, 133. Somewhere over here, you will have that audio tag as you know 139. So that is what I download over here. Okay, so I download the audio stream of that particular video. And then what I do is that I point the audio to basically file name to this variable called audio over here. Okay. So when I tried running this, because over here, they say that you can use mp3 files and all those stuff. Right somewhere over here, they say that you can run this with, you know, not just this thing, you can run not just the data set, but you can run from directly from mp4 files, okay, the pipeline accepts that mp3, WAV and mp4 files. So that is what was written over here. Okay, you can pass mp3, WAV or FLAC audio files. But when I tried it first, there was an issue and I couldn't do it. So for that, what you need to do is that you need to install this ffmpeg library, okay, in this particular system. So for that, I have to run these commands apt get update and apt install ffmpeg. So once I do this, then I can directly send the path of an audio file to this pipeline. Otherwise, it was not working. That's a small change which I've done. So once I do that, in this notebook, if you look at they talk about, you know, making use of the data set and then transcribing and showing it. But I'm not doing that. I've just set the audio path, audio path to the file to my file right to my audio file. And then what I do is that I pass that path over here to the pipeline. So this is the first time compilation of the pipeline with this audio. So it took close to two minutes, four seconds when I run it. Maybe I can run it again and show you it takes that much time. Okay. But once you do it again, then it takes only 6.76 seconds. Okay, because I've done it previously. It is only taking 6.81 seconds now. Okay, for the first time it took close to two minutes. Okay. So because of the cache function, because we are caching it, we are using it again over here. Right? Then we can look at the transcription. So the transcription is quite good. You know, so up to here, this was the basically transcription of my video. So it has done a very good job over here. Okay, you can also get your timestamps. Right for the same audio, you can get the timestamps and it does it really fast. Okay, so this was a two minutes video it did close to 6.76 seconds. Okay. A three minute video basically. Right? There is this example of you know, a 30 minutes video from this data set, which happens really fast over here. Right? I'm not running that you can just try it out. They say it takes only 35 seconds to transcribe this 30 minutes of video. You can also get timestamps. Okay. So in the pipeline, you can pass the file name and you can say return timestamps is equal to true. And you can get the text which is basically your transcription, you can also get the chunks. Okay. So for example, if I were to run it, let us see how much time it takes for this particular thing, it took close to 3.77 seconds. And here, you know, you have your timestamps. Right? If you print the chunks, you will get the timestamp along with the text. Okay. So I've just done a slight modification to this Whisperjax TPU demo to just you know, how you can download a YouTube video and you can take out the audio stream and you can transcribe using Whisperjax. So this is really fast compared to the OpenAI Whisper model. So you can actually try out this notebook. I will put a link of this notebook as well as the original notebook in the description of the video. I hope this short video on Whisperjax, the faster optimized implementation of Whisper is useful for you. If you like the video, please like share, subscribe to the channel. See you in another video.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript