Master Speaker Diarization with OpenAI Whisper & Gradio

Convert Your Audio To Text

4.9/5

3720 customer reviews

Learn how to use OpenAI Whisper and Gradio for speaker diarization in audio transcriptions, thanks to Dwarkesh Patel's open-source tools.

OpenAI Whisper Speaker Diarization - Transcription with Speaker Names

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hey, welcome to One Little Coder. OpenAI Whisper is really good in transcribing languages, transcribing audios from any languages to English. So, you can give an English audio and it can transcribe it to English or you can also give any other language, it can transcribe and translate it to English. But OpenAI Whisper, what it cannot do out of box is speaker diarization. What is the speaker diarization? So, if you have got a conversation where two people are talking, Do you you want to label those two people individually? Sometimes it is very important whenever you're transcribing a podcast or something like that Which is not currently possible out of box using openai whisper So that stops a lot of people from using openai whisper. We have got some open source contributions Where we have got an amazing person called dworkesh patel He has created a google collab notebook and also a radio demo that helps us do speaker diarization using OpenAI Whisper also an additional model from SpeechBrain. So we are going to learn in this video how we can use Dwarkesh Patel's notebook Google Collab notebook to do speaker diarization and also how we can utilize the Gradio application so that we don't have to code everything. The first thing that I'm going to show you is how you can leverage the Gradio application which means you just have to drag and drop and you have speaker diarization in place. The next thing is I'm going to take you through the code so that you have an understanding about what is happening in the code. So that gives you the flexibility to tweak if you need to change anything. The first thing that you need to do is as you can see this Gradio application that is currently hosted on Hugging Face Spaces where you can upload an audio. I just uploaded an audio. yeah you know cave diving expecting to run into much gunfire in these caves this is from the batman begins i just literally downloaded it from this website and you you can see that this is a conversation between lucius fox and bruce wayne while bruce wayne wants to borrow something from lucius fox and you can see that it has actually done a good job in identifying bruce wayne who is the speaker to here and lucius Fox which is Morgan Freeman who is a speaker one and all you have to do is upload an audio select the number of speakers and then it is going to transcribe it and also label the speakers for you. Now having seen how it is easy for us to do it on Gradio let's jump into the Google Collab notebook. I link the Gradio application and also the Google Collab notebook in the YouTube description which makes it easier for you to get started. So this is a Google Collab notebook completely created by Dwarkesh Patel. All credit goes to Dwarkesh Patel. I'll link it in the description. So where once you log in, you have to first make sure that you have got a GPU runtime. Go to runtime, select change runtime and then see if you have got GPU selected. After you have got GPU selected connect to runtime and then you can see some notes about it and high-level overview of what is happening. The next thing is you can run this widget that will open a file browser for you and once you select it you can upload the mp3 file that you want to upload. Again you can modify this code based on what you want I'm just taking you through the existing code. After you upload it here the mp3 file then you need to select three things. The first thing is you need to select how many speakers are there. The second thing is you need to select what language is this. Third thing is you need to select which model you want to use. Why are these three things critical? First the way the speaker identification is happening here is using a clustering technique so for clustering you need to specify the number so which is very important for example if you have got only two speakers and if you say three that would confuse the model and then it would try to label three people and that can create chaos so make sure that you are entering the right number of speakers in the dialogue or the podcast or the session that can help you correctly tag all the speakers involved in the audio. The next thing is if your audio is in English or if the speakers are speaking in English only English then selecting English will give you an advantage in the processing time but if they are speaking in another language or multiple languages including English then you can select other languages but selecting English here will give you an advantage in the processing time if your audio clip is already in English. The next thing is you need to select the model size right now by default the medium model is used but this is a trade-off between accuracy and the processing time the tiniest model the tiny model would take very less time to process but also that means that the error rate could be high the large model could be the best but it means it is going to require large computation it's going to take a lot of time to process it so based on your use case and the time and the GPU resources that you have got select the appropriate model I have selected medium next thing that we need to do is we need to install whisper and we need to install pi annotate after you install these two libraries then the code comes in where we are importing all the required libraries and while we are importing all the required libraries you can see the speaker embedding model has also been loaded which is what we're just showing you which is from speech brain the company speech brain so this is a speaker verification with the e eKappa TDNN embeddings on voxel I think it uses a it uses maybe celebrity data I don't know exactly the details of this but it uses some data to do speaker verification so we're going to use that to identify the speakers and then the next thing is we can do annotation that's why we have got annotation and we need to do clustering as well so that's why we have got scikit-learn cluster import agglomerative clustering so after you import all these things you are going to verify what is a file format of your audio data if the audio data is not WAV if it is not WAV.WAV format convert into a .WAV format and that is what is happening here using FFmpeg. The next thing is we need to load the whisper model for transcription here the model that we selected at the top English medium is going to get loaded. Once the model is loaded then the next thing we are going to do is we are going to give the path of the mp3 and then that is going to transcribe and the result is going to be presented to us in individual segment where we have got the timestamp as well and the next thing is where we are going to create the embeddings and while creating embeddings using the model that we have already got which is the embedding model which is the speech brain model one thing that Dwarkesh has noted here is a whisper overshoots the end timestamp in the last segment keep that in mind sometimes that could be a problem for you which means you need to do a little bit of data processing while you are manually doing this thing. After we create the embedding then the next thing is we are going to do the clustering and basically you know how many people speakers that you have selected the number of speakers like in this case two so it's going to fit the embeddings for the number of clusters that you have selected which is two and and it is going to label the speakers like speaker 1 speaker 2 and then finally you are going to write the labels in a transcript or txt file with the relevant timestamp and with the relevant transcription the English transcription and that gets saved in the transcript dot txt file and finally you have got the transcript dot txt file that is going to show you what is it which is exactly what we did so now the code explanation is finished I think this is quite enough for you to understand what is happening if you want make some improvements you can make improvements like can you can can you ask this algorithm to automatically detect how many speakers are there so you can do a lot of improvements here but that doesn't mean that doesn't take away the good work Dwarkesh Patil has given to us but you can also make some improvements in this thing so to quickly show you the demo starting I'm going to go download another file here so let's let's pick another so let's download this audio which is from the dark night and we are going to download this audio called the like it so this is the audio so I have downloaded this audio call like it and I'm going to go back to my Google Colab notebook after I've run this I'm going to paste this files here or you know upload it here and that gets copied inside the local Google Colab session after the local Google Colab session loads the mp3 just make sure that you select the right amount of speakers as you know we have got two speakers here one is the Batman the other one is the Lucius Fox and then once you have it everything else stays the same you don't have to run it again so you just have to run this so that the audio file gets converted into the right format. Next thing is whisper model is already loaded so you're going to just do the transcription and then we are going to run everything else which is going to open the WAV file and create embeddings for it. Create embeddings and then do clustering for speaker identification then create the transcription and store it in a transcript.txt file and then display the transcript.txt file so it says speaker to now for high altitude jumps this is and it says okay what about getting back into the plane I would recommend a good agent so it has not done a very good job in this case as you can see that you know it has identified only two speakers while there are only two speakers it has not identified it as a four conversation as you can see but again that could be also the limitation of the models that we have got it is always good to test these models so you can now go with the large model and then see how it works but overall this is a a very interesting application so again thanks to Dwarkesh Patel for making this open source. You don't have to go through this Google collab notebook already there is a nice whisper demo with speaker diarization available on Hugging Free Spaces. You can straightaway start using this and also you can download this to your local machine and then start using it if you want speaker diarization which is to label individual conversation against a particular speaker using OpenAI Whisper when the transcription is happening using OpenAI whisper. I hope this tutorial was helpful to you in doing speaker diarization with OpenAI Whisper. If you have got any questions let me know in the comment section otherwise all the required links will be in the YouTube description please check it out. See you in the next video, happy prompting.