Master Word-Level Timestamping with WhisperX: A Comprehensive Tutorial

Convert Your Audio To Text

4.9/5

3744 customer reviews

Learn how to use WhisperX for precise word-level timestamping in transcriptions. Follow this step-by-step guide using a Google Colab notebook.

WhisperX - Word-level Timestamps with Whisper - Subtitles Transcription

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: If you have been fascinated by Whisper for transcription of audio and video files and you are missing the one feature which is to do word level timestamping, look no further. You have got a new library called WhisperX. WhisperX can help you use Whisper and make word level timestamping for you which means every single second or every single word will be aligned with the right timestamp. Why is it important, how are they doing it and how to do it using a Google collab notebook is exactly what we are going to start seeing in this video. Welcome to One Little Coder. In this Whisper tutorial we are going to learn how to use WhisperX, a new library by Max Bain to do word level timestamping of transcriptions that we do using Whisper. Now what is it? So this repository refines the timestamps of OpenAI Whisper's model via something called forced alignment with phoneme based ASR models like Wave2Vec2. If you know originally Whisper is a much much better model than a model like Wave2Vec2 but what Whisper does not do well is it doesn't do word level timestamping. Which means sometimes Whisper could be inaccurate by several seconds and that could be avoided by something called forced alignment. Which is a process that refers to aligning the transcriptions to audio recordings to automatically generate phone level segmentation using another model in this case a phoneme based ASR model like Wave2Vec2. We are not going to get into the theoretical details of how this is done. We are going to get into the practical hands-on experience of how to do it for your own use case. I have prepared a Google collab notebook for you. This Google collab notebook will be linked in the YouTube description so you can straight away get started. The first step is you have to check whether you have got the GPU. If you do not have GPU click runtime, change runtime and then select GPU accelerator. After you know that you have got the GPU installed the next step is install WhisperX. The next thing is install Whisper. At this point you are very well to go ahead but just make one final check which is see if you have got FFmpeg. Most likely if you're using a Google collab notebook. I'm sorry. If you are using a Google collab notebook most likely you have got FFmpeg installed but if you are doing it on your local machine make sure that you also install FFmpeg. This is required for certain audio file format changes. Next when you run the cell it's going to ask you to upload a file. I've got a file that I've uploaded. I'm going to play this file for you first for you to understand what this file is. Let me play this file. As you can see this is a clip from the Joe Rogan podcast where another very popular podcaster Lex Friedman appeared. And this is a conversation between Joe Rogan and Lex Friedman. Just a warning if you're below 18 this clip has got some you know swearing so do not watch it. Maybe skip it. But if you're if you're a grown up just just go ahead. So this clip as you can see this even this caption that somebody has added for the YouTube short has got only sentence level caption. As you can see doesn't highlight word by word but rather it does only the entire sentence. But what we are going to do is we're going to take this clip and we're going to emboss a completely new caption. We're going to burn the new set of subtitles that are highlighted word by word or you know two three words together not just sentence word by word. Subtitle transcript timestamping is what we're going to do. I want to show you the output first how it looks before we move forward with the code. So let me get the output for you which is just coming right up. As you can see the caption is placed on top of the video and the green color gets highlighted the word.

Speaker 2: It's set in the position where the moon is currently. That's awesome. Fuck yeah. Take that piece of shit stupid fucking frisbee you got on your wrist. You can't you can't get this back because you just did it on record. It's yours man. It's such a resistance. The ping pong watch. The fucking stupid thing that writes it and erases it every minute. So dumb. But that watch. That watch is my favorite watch. Yeah. My pleasure.

Speaker 1: As you can see it has done a tremendous job in mapping every single word to the right moment in terms of milliseconds or microseconds. And you are going to learn how to do the same thing right now. It is a very simple CLI command. Invoke WhisperX. Add the input file. In this case the input file after you choose file and upload it. You would see the same file appearing here. And copy the name. Keep it safe for you. Add the input file name. Once you have the input file name. The next thing is you can use the model parameter to say what is the whisper model that you want to use. In my case the input is an English input which means I'm using a medium English model. But if you've got a multiple multilingual then use respective whisper model. The next thing is you're going to define the output directory where you want the output to be saved. And then finally the alignment where you want to specify which alignment model to use. Once again if it's the English model you can just sorry if it's English video clip or audio clip. You can simply use this default one. But if it's a different one then use the respective wave2vec2 ASR model. That will help you to do alignment or forced alignment for that respective language. The next thing is just run this and for the first time when you run this it is going to download the model. But after you run this thing it's going to be fast. I'm going to show you the file at the end of this process. It's going to save two files. One is the SRT which is a normal subtitle file. The second one is the ASS file which helps you do this colorful subtitling. Which is something that if you have ever seen a movie with MKV file this is how they add subtitle to that. So you can see how the subtitle is at a timestamp level. You could see the highlighting take is highlighted here. And then you could see what is then highlighted. This is highlighted here. Then you can see that's highlighted here. So this is how this highlighting works. And it's pretty good. It's pretty good for English language at least whenever I've tried. So now at this point we have successfully got the subtitle ready. So if you want to just stop this video here and then go produce more subtitles. You are very welcome to do it. But if you want to burn the subtitle to the actual video file itself. Let's say you are a creator and you want to add the subtitle to your YouTube Shorts or Instagram Reels or TikTok or whatever social media where you upload the file. Then this is the section that you need to be. Now you do not see a lot of text customization in this part that I am leaving it to you if you want to customize it. FFmpeg is a very popular library. So you would see a lot of help in customizing the text that we are going to burn on the video itself. But if you just want a simple solution. Now you've got the ASS file ready. So invoke FFmpeg which is a shell command not a Python command. Even in the case of WhisperX it's a shell command. Then the input I and the input file which is the .mp4 then dash VF. And then you give what is your ASS file. And then you give the output in this case .avi. Once you have this thing. Run this. It's going to generate the output file and you can download it and start using it in your VLC media player or any media player that can run a video file. That has got ASS file embedded on it. So I'm going to show you in real time for another video. And you're going to see how to do that. So let's go ahead and then run the cell again. So this cell I'm going to run it. And it is going to ask me to upload the file. So I'm going to go here. Paste the file. And the file name as you can see is new underscore video .mp4. So right now this collab you need to hard code the file name. And I didn't want to delay the video so that you know you can start using it. But I might improve this collab in future days to make sure that it's very easy for anybody to use it. Right now you need the video file name. Copy the video file name which is new underscore video .mp4. Copy and paste the file name here. And also you need to mention later on when you need to mention the new file name. So right now run this thing. It is at the end of this process. It's going to save two files. One is the ASRT file. The second one is the ASS file. I'm going to show you the clip first so that you know what is the clip.

Speaker 3: What do you think happens when we die Keanu Reeves? I know that the ones who love us will miss us.

Speaker 1: As you can see this is the clip for which we are going to transcribe. And you can see it is doing a great job. What do you think happens when we die Keanu Reeves? I know the ones who love us will miss us. And now the alignment is being performed. Let's see how well it does. Once again you give the input video name. Video name yes. Then you give the file ASS file name. And then you give the output. Just when you give output make sure that you are not mismatching the extensions. Now run this. This is going to create the output file for us. Which in this case would be called new underscore video underscore out dot avi. Let me refresh it. And we have already got the output file. So I'm going to download the file. And I'm going to save it. Once I get it to my local machine I'm going to open the VLC player. Right click the file and open it with VLC player. What do you think happens when we die Keanu Reeves?

Speaker 3: I know that the ones who love us will miss us.

Speaker 1: And as you can see it was quite accurate at word level. So you can even see Keanu Reeves lip movement and then see how the words look. This is not quite possible with Whisper out of box. But thanks to Max Bain and the new library WhisperX this is quite possible using WhisperX. I hope this video tutorial was helpful to you. In getting word level time stamping for your subtitling a transcription using Whisper and WhisperX. I link this Google collab notebook in the YouTube description. If you want any extension of this project please let me know in the comment section. Otherwise give a shout out to Max Bain for this amazing repository. Which might look simple but it is doing a tremendous job. A lot of people have been asking about it. So thanks to Max and thanks for watching this video. See you in another video. Take care. Stay safe.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3744 customer reviews

1/736

Verified Order

“I loved it”

Ivy

Oct 29, 2025

“Price is fair, accurate transcriptions and user friendly.I would recommend.”

Robert

Oct 20, 2025

“I am delighted I chose your service. The human interpreter did all I needed. I chose GoTranscript because of the time I saved by having this done. Thank you.”

Alfred

Oct 16, 2025

“So far, OK ”

Steve

Oct 15, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support