WhisperX: Fast Whisper Transcription With Word Timestamps (Full Transcript)

A walkthrough of WhisperX features, its VAD+batching+alignment pipeline, Windows installation steps, benchmarking, and common troubleshooting tips.
Download Transcript (DOCX)
Speakers
add Add new speaker

[00:00:00] Speaker 1: Hello everyone. Today we'll be talking about WhisperX, which is another variant of the OpenAI Whisper models for speech recognition or speech-to-text. So the idea over here is that we have a new library that provides fast, automatic speech recognition with word-level timestamps and speaker diarization. My name is Rami Koshaba and I'll be your presenter for today. So large-scale, weakly supervised automatic speech recognition models like the OpenAI Whisper models have been around for a couple of years and they have shown impressive results on speech recognition across domains and languages. However, there are still some limitations with the Whisper model, including first, for example, the predicted timestamps that correspond to each utterance. They are prone to inaccuracies. In addition to that, the word-level timestamps are not available, so you have to process the audio further to get that. And then the application to long audio via buffered transcription prohibits batch inference. However, we previously discussed in the other videos the importance of batch inference and how that could speed the overall ASR task. So today, we look at WhisperX, which has been proposed as a system for efficient transcription of long-form audio with accurate word-level stamps. So the main components, in addition to the Whisper model itself, they include the pre-segmentation of the input audio. So they utilize a voice activity detection model, VAT, to separate the input audio into silence and speech segments and ignore the silence and focus on the speech only. And then they do some sort of cut and merge of the resulting VAT segments into approximately 30 seconds. So as you may know, the original OpenAI Whisper model, it actually utilizes a sliding window approach where it processes windows of consecutive 30 seconds of audio and provide each of these to a short-time Fourier transform process to get the time frequency energy map. And the third part is the post-alignment with external phoneme model to provide accurate word-level timestamps. So the unique features as advertised on the GitHub repository of WhisperX, they include the batch inference for 70x real-time transcriptional using Whisper Large V2. So they specifically use the Large V2, and they claim that with WhisperX, you can get speeds up to 70 times the original OpenAI model. The second point is the faster Whisper back-end. So WhisperX is based on a faster Whisper back-end, which is a model that we have considered in a previous video. You also get the accurate word-level timestamps using Wave2Vec2 alignment. So this is the phoneme model that we mentioned in the previous slide. And then you also have the multi-speaker automatic speech recognition using speaker dualization from Peanote Audio Library. So this is where the output of the model not only tell you that at this segment of time, starting at x1 and finishing at x2, this is the text that has been set, but it also tells you it has been set by speaker 1 or speaker 2 or speaker 3 and so on. And then you have the VAT processing, the voice activity detection, which reduces hallucination and batching with no word error rate degradation. So the overall diagram, as you can see, the input audio is submitted to voice activity detection. Silence segments are being identified and removed. So you cut and match, and then you batch process the remaining pieces of audio or parts into 30-second windows, submit these to Whisper to do the transcription and to the phoneme model. And then you force an alignment between the output of the two models, and that's how you get the word-level timestamp. So it will tell you the timestamp of every single word in the audio recording. Now, a set of requirements are needed for installing the Whisper X library, and these are pretty much the same requirements that I've mentioned in the previous videos with Whisper S2T and Faster Whisper. They include FFmpeg, NVIDIA CUDA, and PyTorch. FFmpeg is a complete cross-platform solution to record, convert, and stream audio and video. So this is the tool that the set of Whisper X models and variants utilize to downsample the input audio signal into 16,000 hertz, that's sample per second, for example, and then process the downsampled version. And then you have the NVIDIA CUDA is the parallel computing platform, which allows you to run your code on GPU, and then PyTorch is the machine learning library with which you write your own code and then utilize the NVIDIA CUDA to run it on GPU, basically. So these are the main components, but there could be other prerequisites or set of requirements, which are usually included in a requirement or text file for you in Python, that you need to install to get Whisper X up and running. So we'll stop at this section, and then we'll head to Anaconda Prompt and start installing the tool itself. So now, as we mentioned that we start looking at installing the prerequisites or the requirements, we start with the first one, which is the FFmpeg. We head to ffmpeg.org. That's the first address I got. So click on the download button. And for my case, because I'm running on Windows, so I'm selecting the Windows operating system, Windows builds by VTBN. And then heading down to the list of files to the one with the biggest size that was created 10 hours ago, download this specific file. So download has started. Okay, open the file. Actually, the downloads, because what I'm going to do is extract this zip file. So extract it here in the same location. Okay. Rename the folder into FFmpeg only, and then cut this folder and put it somewhere else. In my case, I've already done that and I've put it into FFmpeg on my C drive. So you click into it and you click into the bin, you copy the bin folder address, and then you head to your Windows operating system. So basically, all what you are doing is installing or downloading an exe and then letting Windows know about its location. That's what we are doing. So environment variable, I head to the path variable and I edit that new entry, and then you can paste it over there. However, I've done that yesterday. So you can see the address has changed. So I'm going to go ahead and paste it over there. However, I've done that yesterday. So you can see the address over here, FFmpeg slash bin, it's already there. So my system now should recognize FFmpeg from anywhere, from any environment and the proof of that. So I head to the Anaconda prompt, whether I'm in a virtual environment or in the base, if I say FFmpeg, it knows the command and it's expecting me to give some parameters or arguments. So as you can see, it identified the command successfully. So going back to the list of requirements, FFmpeg is a tick. Now we need NVIDIA CUDA and PyTorch, so we will be able to install WhisperX. Now what I'm going to do, close this one, I'll open it again later on when we need it. Let's head to the WhisperX and first see the GitHub repo. So that's the GitHub repo. Basically, let me have a look at it first before doing anything else. So these are the points that we mentioned in the presentation. And the setup of WhisperX is pretty much simple. They say what you need over here is PyTorch 2, oh test it for PyTorch 2 and Python 3.10, other versions at your own risk. So from this, it's clear that we have to install version of Python is like 3.10. And then all what you need to do is create an environment, install PyTorch and CUDA support, and then you install the code itself. So now we go back here to the Anaconda prompt, we know what to do. First thing is to create a virtual environment, because as we did in the previous videos, create a virtual environment, chuck everything inside that virtual environment so it will be safe and you avoid cluttering your system with the different library versions. So conda create, I will call it WhisperX and the Python version as indicated in the GitHub repo should be 3.10. And I'm going to say accept the yes and no question. So the installation should or the creation of the virtual environment should be pretty much fast. That will finish, it's done already. Now we are still in the base environment, we need to activate the WhisperX environment. So think about this as like building a house. So you build a house and then you go inside the house and you make your changes, bring furniture and anything. It's a similar process over here. You create a virtual environment, then you need to go inside that virtual environment. So you need to say conda activate WhisperX. And now we are inside the virtual environment. So if I head back to the GitHub repo, it says create the virtual environment and activate it. You can follow the commands over here. And then you need to install PyTorch for Linux and Windows CUDA 11.8. So copy this command. And now that you are inside the WhisperX, paste it over here, click enter and let it install. So now that it has installed, so now we have installed FFmpeg and the NVIDIA CUDA and the PyTorch version required. What is left for us to do is to install the actual library. So you need to install this repo. And this is how you get WhisperX installed for you basically. So now WhisperX has been installed, so we are ready to run WhisperX. But hang on, we're going to write your code. So that's where we say we start Anaconda. And this is what we have done in the previous videos as well. So you start Anaconda and then you install Spyder, which is your IDE. That's where you write your code and debug it and check everything. Or you can do it by using a notepad or something and then command line. But let's do it through the Anaconda approach. So just like we went and activated the environment when we were here, we said conda activate WhisperX, we need to do something similar in Anaconda by using the graphical interface. Here it's much faster. You just drop the list, see which environments do you have and select the one that you want, which is WhisperX for this specific video purpose. Then we head down, we install Spyder from here. So this is installing now. So Spyder has installed, we are ready to launch it. So we launch Spyder and we are ready to start writing our code. So we head back to the repository of WhisperX. You will actually find an example of how to run the WhisperX. So I'm just going to copy it from here, bring the code here, ctrl a to select everything, ctrl v to paste the code, and let's have a look at the code. So what we are doing over here, we import WhisperX, we import GC, which is garbage collector, and then we start specifying the parameters. We are going to run the code on CUDA. The audio file name, in this case, I have to bring my own audio. So I have the same folder that I've utilized in the previous videos. So go to c to temp, and I have a bunch of files over here. I'm going to take this one, for example, the rocket versus mini rocket, which is one of my own audios. So I'll say c slash temp slash rocket versus mini rocket. The batch size that we can change, basically, so you can reduce it if you are low on GPU memory. And the compute type, which over here they defaulted to float 16, change to integer 8 if low on GPU memory. So I can change it to integer 8 and run it and see what happens. Now the remaining part is what loading the WhisperX model of interest. Over here, they have, by default, selected large v2. And that's where they say they achieve a lot of speed in comparison to the original Whisper. However, because I'm running on a laptop, so I start by checking the small dot English as we did in the previous videos. Because the large version 2 and version 3, each one of them is like a few gigs worth of size. So I'm not going to wait for that to download over here. Take the commented code. I'm not interested in that. And then the remaining part is doing something with the audio. It says load the audio file and then start transcribing. So this is the part where we import time. And let's benchmark how much time does it take us to actually transcribe the file. Okay, so time dot time. And after that, I put end equals time dot time again. Okay, and then let's modify this print statement. I don't want to print the results. I can print an F statement on a new line, which says transcription time equals end minus start. And this is how we get to see the code running. I know we have some issue over here. But before looking into the issue, let's comment that code part of the code. Let's first just make sure that the code runs successfully and we don't have any troubles. So click Run. Let's see what happens. It's going through, looks like it's going to continue. However, we had an error. So what happened over here? So the first problem that we suffered from over here was the fail to load audio FFmpeg something. So it looks like it's something related to FFmpeg. However, we have verified before that it wasn't really the FFmpeg because we can run the FFmpeg from here from the command line. So what is the problem over there? And the problem was actually because I believe we didn't put the row for the address to understand where the file is exactly. So let me save this. Let me call it call WhisperX, just like we did with the previous audio files. Run it again. Okay, let's see how it goes this time. It looks like it's progressing. It didn't fail, but it did give us a warning message that we will look into it very soon. However, first of all, let's just make sure it's actually running. Let's, we can also look at the task manager and morning file, the GPU. The GPU is being heavily utilized for a while, then it's released. So let's run it again and see what for a while, then it's released. 22 seconds. So 22 seconds to transcribe the rocket versus mini rocket audio file that I have. And if I go back to here to try and see to temp folder, the rocket versus mini rocket is actually around 42 minutes and 55 seconds. So almost 43 minutes. And we are transcribing that in less than half a minute. So 22 seconds only. So now we know it's working. And we can see the results as well. And the results are being put in a list. And then the list is made of a bunch of dictionaries. Each dictionary will have a number of keys and items. The keys are the starting time, the ending time and the actual text being set. So if you want to verify if this is really correct, which you can do, you can actually go back here, start the audio file. Hello, everyone. Today we will be talking about random convolutional kernels for time series classification. So it looks like it's doing a good job, you can actually verify that by checking more segments. However, if you look over here, the dictionaries represent the different audio segments. So the different portions of the audio file, each one of them where it starts, where it finishes. So this is in seconds and seconds. And what's the text being said in that portion of the audio file. And that's the result that you get from the model to transcribe. So now we know it's running. The other thing now we can look at it is the memory. So as in the previous models, the NVIDIA memory wasn't released for some reason. And that reason you should know about it now from the previous videos. It's the CUDA not releasing the memory for context retention. So what you need to do is use the garbage collector to clean the memory and the cache, basically, and then put it in a multiprocess or inside a process. So we can do that as well. So first of all, here are the two commands being utilized there. So they import gc, gc.collect, and then torch, and then delete model. So the first thing, first error we are getting it here, it says undefined torch, of course, because we haven't imported torch. So import torch. That should get the error away. And then there's another problem over here, which is not an error. It's just telling you that redefinition of unused gc. So you're importing gc again. Why do you need that? Because you imported gc at the beginning of the code. So you only need to import it one time, which is there. So now our code looks good. I'll clear this, start my task manager again. So monitor the, oops, sorry, monitor the GPU again. Some sort of memory, base level of memory is utilized for some reason in the GPU. Could be something, Windows doing something in the background. I'll start the code, let it run again. I know we still have a warning that we haven't looked at yet, but let's just monitor this and see if this is enough or we need to put it in a multi-process or process basically. So that's the warning that we got earlier. The GPU is being utilized heavily, and then the process will finish soon and we need to see some sort of memory released. If not, then we can always follow the same context as we did in the previous videos. So 20 seconds this time, even faster, release some memory, but not all of it. So we need to put it back in the same format. Before we focus on that part, we still can look at one more thing. So we mentioned previously that the whisper model is also capable of giving you a word level kind of timing. So this is where they do align whisper output. So let me bring this code portion now. So what's happening over here, they say whisper x load and align the model and then give back the results. So you saw the results, it was actually kind of like section based, not really word level based. So now if I just run these couple of lines, because we had the code already running up to there, so I'm just going to run this section and observe what will happen to the result in this case. So now this is ready. Let's look at the new result. So now what you have is the segments. Again, segments, starting time, ending time, and the text being set. But if you go inside each of these, you have it now at the word level. So hello. Okay, then everyone word by word. This is going insane. Too much details. I don't really need it for my own purposes. So typical application of transcription, you will just need like sections should be enough. But if you want to attach this transcription to a video file, so that the video can play with the actual word being said at that specific time highlighted, then you need the word level kind of transcription, it depends on your application. But as you can see, it's working, it's giving you at the word level and starting and ending time, and so on. So it is actually doing what it was advertised to do. Beautiful. The remaining parts over here, what you can do again is to assign speaker labels. And for that, you need the diarization pipeline. For this specific purpose, you will need to use a token that you can acquire from Huginface. So you need to go to Huginface, register, create an account, get a token, bring the token, put it back here, so that the next time it tells you the segment, this segment, this text has been said, it will tell you who said it. Is it speaker one, speaker two, and speaker three, and so on. So for this purpose, for today's video, we don't really need that part, I can actually take it from here, because I have another interesting thing to show you, which is a benchmark process for WhisperX against the other libraries. So I'm going to make sure that our code here for calling WhisperX is being placed in the same format as the other codes. So what I'm going to do, I'm going to copy to save your time, I'm putting a main function, so basically a process that runs a function and start process and join it. So this will be my function, I'm going to define this as my function, everything will go inside my function. And in this way, we will be sure to check the memory when we run the code so that we guarantee that the memory will be released to the same level it started with. So we do have a small error over here, which we need to observe and see what is it about. So it doesn't know where process is coming from. So of course, we have to say from multi processing, import process. Okay, save the code. Result local variables is assigned but never used. Yeah, I'm not gonna do anything with that right now. I just need to time it, benchmark it. So yeah, we can run the code again now. And observe what happened in terms of memory, if the memory is going to be released or not. So focus on the memory right now, we still have one small issue with that warning that we talked about, we didn't tackle yet. Let's just look at the memory and make sure it's going to be released to the same level that we started with before running WSPAx. Okay, and this is what we have done for the previous WSPA variants to make sure they all run in the same way. There you go. Now it took 19 seconds, and it released the memory to the same level it started with. Excellent. This is what we wanted. Now close that memory thingy and come back here. Let's see what's the problem. It says TensorFlow 32 has been disabled as it might lead to reproducibility issues and lower accuracy. If you really want to re-enable it, go and follow these steps. You can read more about it at this address. So bring this address. Let's see what's going on over here. I've actually visited this link before. So you can read what people have been saying over here, derisation pipeline output variability. So there's a variability in the output when you actually use the float32. So there is an issue over there, and the solution right now would be copy these two lines, which you can put somewhere over here. So torchback and CUDA allow TensorFlow 32 is false, and the same thing for CUDA. And close this. Rerun the code again. You shouldn't see that error message or warning message anymore. So this should get you up and running with Whisper X. You can now bring your own files, benchmark these files against the other libraries, and do whatever you want. But for me, the main thing that I wanted to show you is how to bring Whisper X up online, basically up and running, and then we can start really doing some benchmark against the other libraries ourselves over here. So let's just wait for this to finish, and then we will proceed with the benchmark process. There you go. 19 seconds for the rocket versus mini rocket file. So now that we have everything ready, let's go to the next step of the video.

ai AI Insights
Arow Summary
The speaker introduces WhisperX, a Whisper-based ASR system designed for long-form transcription with faster batch inference, accurate word-level timestamps via forced alignment, and optional speaker diarization. They explain Whisper’s limitations (imprecise segment timestamps, no word-level timings by default, buffered long-audio constraints) and outline WhisperX’s pipeline: VAD-based pre-segmentation, cut/merge into ~30s windows for batching, Whisper transcription, and Wave2Vec2-based alignment to produce word timestamps; diarization is available via pyannote. The demo covers Windows installation prerequisites (FFmpeg, CUDA, PyTorch), setting up a Conda environment with Python 3.10, installing PyTorch with CUDA 11.8 and WhisperX, then running an example script in Spyder. They troubleshoot an FFmpeg/audio path issue by using a raw string path, benchmark transcription speed (~19–22s for a ~43-minute audio), inspect segment outputs vs word-level outputs after alignment, address GPU memory not releasing by wrapping execution in a separate process and using gc/torch cleanup, and remove a TF32 warning by disabling TF32 for reproducibility.
Arow Title
WhisperX Overview, Installation, and Benchmark Demo
Arow Keywords
WhisperX Remove
OpenAI Whisper Remove
ASR Remove
speech-to-text Remove
word-level timestamps Remove
forced alignment Remove
Wave2Vec2 Remove
voice activity detection Remove
VAD Remove
batch inference Remove
long-form audio Remove
diarization Remove
pyannote Remove
FFmpeg Remove
CUDA Remove
PyTorch Remove
Conda Remove
Spyder Remove
Windows setup Remove
TF32 Remove
GPU memory management Remove
multiprocessing Remove
Arow Key Takeaways
  • WhisperX improves Whisper for long-form audio via VAD segmentation, batching, and alignment for accurate word-level timestamps.
  • It can achieve very fast transcription (claimed up to 70x realtime with Whisper Large v2) by enabling batch inference and using a faster backend.
  • Word-level timestamps are obtained by aligning Whisper output with an external phoneme/CTC model (Wave2Vec2).
  • Speaker diarization is supported through pyannote but requires a Hugging Face token.
  • Windows setup requires FFmpeg on PATH plus a CUDA-enabled PyTorch install; WhisperX suggests Python 3.10 and PyTorch 2.
  • If audio loading fails on Windows, ensure correct file paths (often use raw strings) and FFmpeg availability.
  • GPU memory may not fully release due to CUDA context caching; wrapping inference in a separate process plus gc/torch cleanup helps.
  • Disabling TF32 can remove reproducibility/accuracy warnings in the diarization/alignment pipeline.
Arow Sentiments
Neutral: Informative, instructional tone focused on explaining features, installation steps, troubleshooting, and benchmarking; minimal emotional language aside from brief positive remarks when things work.
Arow Enter your query
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript