Master Speaker Diarization with OpenAI Whisper and Dwarkesh Patel's Tools

Convert Your Audio To Text

4.9/5

3726 customer reviews

Learn to use OpenAI Whisper for speaker diarization with Dwarkesh Patel's Google Colab notebook and Gradio app. Step-by-step guide included.

OpenAI Whisper Speaker Diarization - Transcription with Speaker Names

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: Hey, welcome to One Little Coder. OpenAI Whisper is really good in transcribing languages, transcribing audios from any languages to English. So you can give an English audio and it can transcribe it to English or you can also give any other language, it can transcribe and translate it to English. But OpenAI Whisper, what it cannot do out of box is speaker diarization. What is the speaker diarization? So if you have got a conversation where two people are talking, you want to label those two people individually. Sometimes it is very important whenever you are transcribing a podcast or something like that, which is not currently possible out of box using OpenAI Whisper. So that stops a lot of people from using OpenAI Whisper. We have got some open source contributions where we have got an amazing person called Dwarkesh Patel. He has created a Google collab notebook and also a radio demo that helps us do speaker diarization using OpenAI Whisper. Also an additional model from SpeechBrain. So we are going to learn in this video how we can use Dwarkesh Patel's notebook, Google collab notebook to do speaker diarization and also how we can utilize the Gradio application so that we don't have to code everything. The first thing that I'm going to show you is how you can leverage the Gradio application, which means you just have to drag and drop and you have speaker diarization in place. The next thing is I'm going to take you through the code so that you have an understanding about what is happening in the code. So that gives you the flexibility to tweak if you need to change anything. The first thing that you need to do is, as you can see, this Gradio application that is currently hosted on Hugging Face Spaces, where you can upload an audio. I just uploaded an audio. This is from The Batman Begins. I just literally downloaded it from this website. And you can see that this is a conversation between Lucius Fox and Bruce Wayne. While Bruce Wayne wants to borrow something from Lucius Fox. And you can see that it has actually done a good job in identifying Bruce Wayne, who is the speaker 2 here and Lucius Fox, which is Morgan Freeman, who is the speaker 1. And all you have to do is upload an audio, select the number of speakers, and then it is going to transcribe it and also label the speakers for you. Now, having seen how it is easy for us to do it on Gradio, let's jump into the Google Collab notebook. I'll link the Gradio application and also the Google Collab notebook in the YouTube description, which makes it easier for you to get started. So this is a Google Collab notebook completely created by Dwarkesh Patel. All credit goes to Dwarkesh Patel. I'll link it in the description. So where once you log in, you have to first make sure that you have got the GPU runtime, go to runtime, select change runtime, and then see if you have got GPU selected. After you have got GPU selected, connect to runtime and then you can see some notes about it and high-level overview of what is happening. The next thing is you can run this widget that will open a file browser for you. And once you select it, you can upload the mp3 file that you want to upload. Again, you can modify this code based on what you want. I'm just taking you through the existing code. After you upload it here, the mp3 file, then you need to select three things. The first thing is you need to select how many speakers are there. The second thing is you need to select what language is this. Third thing is you need to select which model you want to use. Why are these three things critical? First, the way the speaker identification is happening here is using a clustering technique. So for clustering, you need to specify the number, which is very important. For example, if you have got only two speakers and if you say three, that would confuse the model and then it would try to label three people and that can create chaos. So make sure that you are entering the right number of speakers in the dialogue or the podcast or the session that can help you correctly tag all the speakers involved in the audio. The next thing is if your audio is in English or if the speakers are speaking in English, only English, then selecting English will give you an advantage in the processing time. But if they are speaking in another language or multiple languages, including English, then you can select other languages. But selecting English here will give you an advantage in the processing time if your audio clip is already in English. The next thing is you need to select the model size. Right now, by default, the medium model is used, but this is a trade-off between accuracy and the processing time. The tiniest model, the tiny model would take very less time to process, but also that means that the error rate could be high. The large model could be the best, but it means it is going to require large computation. It's going to take a lot of time to process it. So based on your use case and the time and the GPU resources that you have got, select the appropriate model. I have selected medium. Next thing that we need to do is we need to install Whisper and we need to install PyAnnotate. After you install these two libraries, then the code comes in where we are importing all the required libraries. And while we are importing all the required libraries, you can see the speaker embedding model has also been loaded, which is what I was just showing you, which is from SpeechBrain, the company SpeechBrain. So this is a speaker verification with the eKappa TDNN embeddings on Voxel. I think it uses celebrity data. I don't know exactly the details of this, but it uses some data to do speaker verification. So we're going to use that to identify the speakers. And then the next thing is we can do annotation. That's why we have got PyAnnotation and we need to do clustering as well. So that's why we have got scikit-learn cluster import agglomerative clustering. So after you import all these things, you are going to verify what is the file format of your audio data. If the audio data is not a wave, if it is not a dot wave format, convert into a dot wave format. And that is what is happening here using FFmpeg. The next thing is we need to load the whisper model for transcription. Here, the model that we selected at the top English medium is going to get loaded. Once the model is loaded, then the next thing we are going to do is we are going to give the path of the MP3 and then that is going to transcribe and the result is going to be presented to us in individual segment where we have got the timestamp as well. And the next thing is where we are going to create the embeddings and while creating embeddings using the model that we have already got, which is the embedding model, which is the speech brain model. One thing that Dwarkesh has noted here is that whisper overshoots the end timestamp in the last segment. Keep that in mind. Sometimes that could be a problem for you, which means you need to do a little bit of data processing while you are manually doing this thing. After we create the embedding, then the next thing is we are going to do the clustering and basically you know how many people speakers that you have selected, the number of speakers like in this case two. So it's going to fit the embeddings for the number of clusters that you have selected, which is two. And it is going to label the speakers like speaker one, speaker two. And then finally, you are going to write the labels in a transcript or txt file with the relevant timestamp and with the relevant transcription, the English transcription, and that gets saved in the transcript.txt file. And finally, you have got the transcript.txt file that is going to show you what is it, which is exactly what we did. So now the code explanation is finished. I think this is quite enough for you to understand what is happening. If you want to make some improvements, you can make improvements. Can you ask this algorithm to automatically detect how many speakers are there? So you can do a lot of improvements here, but that doesn't mean that doesn't take away the good work Dwarkesh Patel has given to us, but you can also make some improvements in this thing. So to quickly show you the demo starting, I'm going to go download another file here. So let's pick another. So let's download this audio, which is from the dark night. And we are going to download this audio called the like it. So this is the audio. So I've downloaded this audio called like it. And I'm going to go back to my Google colab notebook. After I've run this, I'm going to paste this files here or, you know, upload it here. And that gets copied inside the local Google colab session. After the local Google colab session loads the MP3, just make sure that you select the right amount of speakers. As you know, we have got two speakers here. One is the Batman. The other one is the Lucius Fox. And then once you have it, everything else stays the same. You don't have to run it again. So you just have to run this so that the audio file gets converted into the right format. Next thing is whisper model is already loaded. So you're going to just do the transcription. And then we are going to run everything else, which is going to open the wave file and create embeddings for it to create embeddings and then do clustering for speaker identification, then create the transcription and store it in a transcript.txt file, and then display the transcript.txt file. So it's a speaker to now for high altitude jumps. This is and it says, OK, what about getting back into the plane? I would recommend a good agent. So it has not done a very good job in this case, as you can see that, you know, it has identified only two speakers. While there are only two speakers, it has not identified it as a four conversation, as you can see. But again, that could be also the limitation of the models that we have got. It is always good to test these models. So you can now go with a large model and then see how it works. But overall, this is a very interesting application. So again, thanks to Dwarkesh Patel for making this open source. You don't have to go through this Google Colab notebook. Already there is a nice whisper demo with speaker diarization available on Hugging Free Spaces. You can straight away start using this. And also you can download this to your local machine and then start using it if you want speaker diarization, which is to label individual conversation against a particular speaker using OpenAI Whisper when the transcription is happening using OpenAI Whisper. I hope this tutorial was helpful to you in doing speaker diarization with OpenAI Whisper. If you have got any questions, let me know in the comment section. Otherwise, all the required links will be in the YouTube description. Please check it out. See you in the next video. Happy prompting.