Mastering Multi-Speaker Identification with WhisperX and Pyannote

Convert Your Audio To Text

4.9/5

3723 customer reviews

Learn how to use WhisperX and Pyannote for accurate multi-speaker identification and speech transcription. Step-by-step guide included.

Multi Speaker Transcription with Speaker IDs with Local Whisper

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: Welcome back to the channel. In a previous video, I showed you how to use whisper to transcribe speech. You can provide audio file as an input and as a result, you will get the transcribed text. A common request that I saw in the comment section of that video was how to do multi-speaker identification. The process of multi-speaker identification is called speaker diarization and according to Wikipedia, speaker diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. This is exactly what I am going to show you to do in this video. On our discord server, there were a number of recommendations for different tools to use for this process, specifically a user who goes by the name wik49 suggested quite a number of different tools which uses whisper along with other speech processing techniques to identify speakers and I actually personally liked this WhisperX project. This is the one that I'm going to be using in this video. So I'll walk you through a step-by-step process of how to do speaker identification and speech transcription in this video. But before that, you might want to check out the prompt engineering discord server. The link is going to be in the description. Here's the official GitHub repo of the project. It's called WhisperX and it already has around six and a half thousand stars. Now how is WhisperX different from Whisper from the OpenAI? So it does quite a few more things than simple transcription and one of them is that it can do speaker identification by using another awesome Python project and this project is called pyanote.audio and this is specifically for speaker diarization. So in this video, I'll show you how to use this in conjunction with WhisperX. Now WhisperX does some additional steps on top of the normal transcription using Whisper. So let's have a quick look at how this works. First of all, you have your input audio. Then WhisperX does voice activity detection on top of your original speech signal. So basically it looks for silent segments within your audio and remove those. So you get these different segments within your audio. Then these are batched together. You want to batch multiple audio segments together for processing so that you can utilize GPUs. Then for each of the segment that you created during the previous step, it uses Whisper to transcribe that segment. There's another model which actually aligns the starts and end of each word so that you get better alignment or better annotation of the timestamps for each words. I'll put a link to the repo. So now let's go through the installation process and after that I'll show you how to use this in your own projects. Now in order to use this, you will need PyTorch 2.0 and Python 3.10 and according to the authors, use other versions at your own risk. Now for this to work, you will also need an NVIDIA GPU. So I'm going to show you how to use this on a T4 GPU from Google Colab. But if you have a local GPU, you can run this locally. So if you're doing the installation on a local machine, first you will need to create a Conda virtual environment. You can use Conda create, then provide the name of the virtual environment and you need to specify that you want to use Python 3.10.0. Then you activate the virtual environment and you will also need to install the corresponding PyTorch package, Torch audio and also the CUDA drivers in that specific virtual environment. Now since I'm going to be running this within a Google Colab notebook, I'm going to be using the pip install method. So basically you have the pip install command and then you provide the GitHub repo ID. So let me show you how to do that. So here I'm working on a Google Colab notebook, so I'm using the pip install dash dash Q command. You can ignore this part and then provide the repo ID, but this will download and install all the required packages for you. After that, we will need to import the WhisperX package as well as the GC package. Next we need to define which device we are going to be using. I'm going to be running this on a NVIDIA GPU. So this Google Colab has a T4 GPU. That's why we're going to be using CUDA. You can also use CPU if you don't have a GPU. Now batch size, I'm going to use 16. This is the default batch size that they recommend. And we're going to be using the floating point 16 for computation type. So first I'll show you how to do speech to text transcription. And after that, I'll show you how to use the pyannote package to do a speaker diarization as well. So first we need an audio file. What I did here was, so I went to the files section of our Google Colab notebook, right click, then click on upload. And then I uploaded this neural.wav file. This is the transcription of one of my recent videos, and I'm going to be transcribing that first. Next, we need to load the model. I'm using the WhisperX object, then calling the load function on top of it. Now I'm using the large v2 version of Whisper. You can use the large v3 as well, but if you don't specify the language of audio, it's better to use the large v2 version. Then we need to provide the device on which we're going to be running this. So in this case, we are running this on the GPU and the compute type that we need. So this will download the models file for us. Now you will see these warnings, which says model was trained using Pionaut audio 0.0.1. Yours is 3.1.0. Bad things might happen unless you revert Pionaut.audio to 0.x. But in my experience, I didn't really see anything bad happening, so I'm just sticking with it. If it's not performing well, just downgrade the version to the one that is specified in here. So next we need to load our audio. The way you do it is you call the load audio function on the Whisper SX object. In this case, we are providing the audio file name, so you will get an audio in the form of speech in there. Then you can use the transcribe function. So now again, we will provide the audio file that is just created and then the batch size. So the batch size that I'm using is 16. You could try a much bigger batch size depending on how much VRAM you have, and as a result, you will get different segments. So if you see here, I have this transcription, which says we have a new leader on the 7 billion parameter models on the Hugging Face leaderboard, and this one is from a surprising company. So the transcription that I'm getting out of it is actually pretty accurate. So here I'm looking at the actual results. So it has a single equal text, which contains the whole text of the audio. Then it has the start and end timestamps, and it automatically identified the language. So in this case, the language is English. Now this is chunk level transcriptions. Sometimes you are interested in getting the word level transcription, so you can do that in here as well. So that's basically aligning the words, and there is a special model that WhisperX uses. So let me show you how to do that. Now this is a two step process. So first you get the initial transcription from WhisperX model. Then we need to pass that through another model, which is called the Align model. The Align model accepts the language as an input. So from the results that we previously got, we're going to provide that as an input. That is going to be the language code. We want to run this on a GPU, so we provide the device type in here, and this will return two things. So one is the model. This is the alignment model, and the second is the corresponding metadata. Now we will need to use the Align function from WhisperX, and the inputs are going to be the initial segmented results. This is basically the transcription, then the alignment model, then the metadata corresponding to that specific model, the original audio. This is basically the speech, then we want to run this on our GPU, and we don't want to have alignment for each character, but we want to have alignment for each word, right? And that will give us the result. The result is basically two things. One is sentence level transcription, then you also have the word level transcription as well. So in more details, here's how it looks like. So you get transcription for each sentence that the model is able to identify. So you have the start and end time for each sentence. And then for each of the words in the sentence, you have the corresponding start and end timestamps. This is pretty amazing. Okay. Now in this next section, I'll show you how to identify multiple speakers in an audio and transcribe the audio in such a way that you assign each speaker, a different speaker ID. Again, we're doing exactly the same things. So we are defining the device type, the batch size, as well as the type of compute we're going to be using for audio. I'm using Lex Friedman podcast with Sam Altman. So this is I think more than two hours long audio since there are two speakers. So I want to use this and see how good the output is going to be. We are loading that audio is using the load audio function from WhisperX. Now we're going to be using the diarization pipeline within WhisperX. So in the background, this is going to be using the pyannote package, but this, we need to provide our HuggingFace token. So let me show you where to get that. So go to your HuggingFace account, then click on settings, then from the left-hand side, click on access tokens, and then you can either create a new token or use an existing one. I'm going to simply copy this. So I pasted my token in here and it's downloading the model, but let me show you a couple of issues that you can run into. So when you run this for the first time, you might run into something like this, which will throw an error. Now in this case, I'm running into this issue of pyannote segmentation 3.0. You need to agree to share your contact information to access this model. So simply log into your account and then request access to the model. And after that, you will be able to use your HuggingFace API access token. Okay. So I had to make a couple of changes to the code. I reduced the batches to four because of the GPU RAM usage. Second was that the original file that I was using is about two hours long, so that's going to take a while. That's why I reduced it by a shorter version. So I took the first seven minutes of that, then loaded that audio again, loaded the transcription model again, we are using V2Large. And after that, I transcribed that using exactly the same code that I showed you before with word alignment. So here is the result that you can see. So for example, here's the first sentence. This is spoken by Sam Altman. And then there is these word by word transcription as well. Okay. So this is basically the results that we get that we are going to feed into the second stage model, which is going to do speaker identifications based on the original audio. Now I already showed you the steps. We are creating a diarization pipeline, and in this case, we are providing our Hugging Face token ID or Hugging Face API token. So this is the diarization model. The next step is speaker identification, which is solely based on the audio signal. It doesn't have anything to do with the transcription, but that's why when we created this diarization model, we're going to provide the original audio. Then you can provide these two optional parameters. One is the minimum number of speakers. The other one is max number of speakers. Since it's a conversation between two people, so I'll just put it to two, but you can skip this. If you don't provide these optional arguments, then the model will have to figure out the number of speakers based on the audio characteristics that are present. So here are different timestamps on different segments that it found. And if I look for the number of speakers, so there are actually two speakers that it identified within this text. Okay. So the last step is to simply put together both the output from the speaker identification process and your speech to text transcription. Now we have the corresponding timestamps in these both outputs, so that is put together to assign word speaker model and you will get timestamps with the corresponding speaker ID. So here is how it looks like for each of the segment. You had the start and end timestamp of the segment and then the corresponding speaker ID. So here's how the final output looks like. You have the sentences, the corresponding start and end timestamp, then word level transcription as well for each of the sentence. Now here's another segment in which the speaker 001 is talking. I hope this was helpful. I'll put a link to both of these notebooks where you can either do just speech to text transcription, or if you want to do a speaker identification, you can use the second notebook. Let me know in the comment section below if you have any questions or if there are any specific topics you want me to cover. Thanks for watching. And as always, see you in the next one.