Train Voice Recognition Model Easily
Learn to train voice recognition models using Hugging Face datasets, convert data for training, and leverage Hugging Sound library for effective model training and inference.
File
Train your custom Speech Recognition Model with Hugging Face models
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hello everyone, welcome back to my YouTube channel. So today we are going to do a simple tutorial on how we can train a voice recognition model. So first of all, I would like to explain how we are going to train the model and how we can inference this model for our custom applications. So for today's tutorial, I'm going to take a simple data set from Hugging FaceHub. So this data set name is called USP-CVRT-HIGH. So it is available in this particular link. So let's just open it and see what exactly is this data set. So this data set is from this particular author. And so there are like some other forms of like there are some other data set also available like CVRT, USP-CVRT. So for this demo, we are actually taking this CVRT-HIGH data set. And here we can see the format. So the format is pretty straightforward, like we have the ID. So ID is nothing but the author ID. So who is speaking ID, so F02, F03, F04, maybe it will be like that. And then we have the target column where in the words, so what is the speech. So basically that sound signal, what it is. So downward each. So these are the words ideally which is being there in that signal sound. So now this is path. So path is nothing but the file name, along with that, maybe the drive path where the file is saved in author's machine. And then we have this speech signal. So speech signal, we can see like bits of array of numbers and for this demo, I'll show how we can convert this sequence into a speech signal. So if you have directly the sound signal, well and good, or if you are having the signal in the form of an array like this, then we have to convert that signal into a audio file and then train the model. So either way, like we will see the examples actually. So that's the data set. So now let's try and load this data set. The first of all, I'm just, I'm just setting the path here so that it's pretty straightforward for the tutorial purpose. So first of all, we are loading this data set and since I loaded it earlier, like it downloaded it very fast. So now we have this US speech. So your speech will be in the form of a data set, dictionary format, hugging face data set dictionary format. So it is having this train inside that we have this data set, it don't have the test. So I really like there are cases where we have the train data set, then the test data set. Maybe even, yeah, I don't know, validation data set maybe. So in that way we can array like that we will see, we can see the format. So since I'm comfortable Pandas, I'm converting this UA train into a Pandas object actually. So here, Pandas data frame. So I'm just converting that into a data frame here. And that data frame will be kind of available in this, in this particular example. So this, this will kind of have like a lot of data, the entire data of this data set. So I think it's around 20,000 records or something like that. Yes. So it's around 22,524 records. And now I am converting this, I'm just taking only a sample of it. Like I'm just taking it for around 20 samples. That's it. Like we don't have to worry too much on that. So it's just for a training purpose. So in your case, like you definitely you need to take the full data set or else you will not get a good model. So in this example, as we see, as we saw like earlier, like we have the target column, we have the speech column ID path. So now the next step is we need to split that file name. So here, like it is actually there in the drive link, like path actually. So for, for the sake of what to say, cleaning the data, like we can simply do that. So ideally what happens is it's available in this particular format. So ideally a path data. So this available here. So we are just splitting this path into path with the help of this delimiter slash. So we will get a list like this. And from there we are just taking the last item. So that is minus one. So if you have to take the first item, it is zero upon so on. If we need to take the last item, it's simply the index is minus one. So now we, we are getting the last item. So this stuff, I'm just putting it into an apply function. That's it. Nothing much. So that's all I'm doing here. And now when we run this command, we will get the file name. So file name is nothing but like all the kind of basically the same thing, just segmenting it up. Now as a best practice, like we can save this file so that it's easy to kind of refer it while training the data. We don't have to run this preparation script every single time. So now, so now what we have, so ideally what we have is we have the data frame, which contains a file name. We have the word, what that word means. And then we have the array of numbers. So now we need the speech output, basically the sound or the voice. So for that, there is a function like this, which is kind of converting the data into the sample rate of 16,000 and that will kind of help us to kind of create a sound file like in that, in that exact same file name. So that's the reason why I given the file name directly like this so that we get the sound signal in that form itself. So nothing much, we are just defining the sampling rate, then the samples as one. And then we have, we are just creating the wave object and then it will kind of, so the samples is nothing but the speech. So speech is the numbers. So that's what, so ideally it will be like, let's take one example. So it's nothing but that array of numbers ideally. So if we need to see one such example, let's kind of see what exactly is this array. So it's a floating one array and we can see it is. So this is the file basically. So this is what we are converting into, ideally this is the representation, we are just making it as a actual speech signal of wave format ideally. So we have a, so we have a function like this and let's kind of see ideally like how we can convert this. So now we have this, I just defined the output directory with a folder called audio. So I just set the folder as audio here so that we can ideally save all our audio files into this particular folder. So I'm just running and I'm getting inside that and I'm just, I'm just inserting all the files. So we got an error, I think I haven't defined this function, okay, I didn't call this function yet. Yeah. Sorry, I didn't run this function yet. So let me just run it once again. So now it ran, like it created this, created this sound signal. So if you open this, you will see like basically, so each of the file, yeah, so basically the sound is, yeah, disability sound like, oh, that's the reason why it's kind of not very evident. So that's why we are training the model also. Like it's a, it's for elderly speech detection purpose. So yeah, so that's where like we prepared the data now. So now we have the data. Okay. So now we have the data. So that's the best part. We have the data with the actual file name and we also have the reference of it. Like for example, we have the data and we are saying this file name, what is the word basically? This is a word ideally. So basically this file name represents this word and we have the sound also or the audio file also. So if, if you directly have the audio, you don't have to worry this much. We don't have to convert that into an audio file like this, but yeah, like we didn't have the audio. So that's the reason why I had to convert that into. So if you already have the audio, you can simply skip this script of data preparation. You don't have to prepare anything. You already have the data. Okay. So now next let's see how we can train this particular model. So for that we are using a library called hugging sound. So hugging sound is actually a very interesting library and it is built on top of this hugging phase models actually. So it has inference evaluation, fine tuning, so it's really great. It's kind of like a, what to say, it helps us easily to kind of train the model without much hassle actually. So let's import some of the items from hugging sound. So before that try, so you have to pip install this hugging sound and while pip installing maybe there will be a downgrading of your torch library. Maybe like, I'm not sure, like in my case it happened. So maybe you have to, again, you can, if you want your installation, you can either create a new environment or you can even create it in a separate virtual environment or yeah, or even you can install back your torch. It will not create much problems actually, but yeah, like you have to install this hugging sound and yeah, that has dependencies. So now yeah, just pandas and OS. So I'm just targeting the path into this audio and so here ideally, ideally basic, so if you have a CUDA instance, I would recommend CUDA. So, but yeah, like, so, but this model will be heavy, like the fine tuning is little kind of heavy. So in my case I have a CUDA, but even like when, when I'm training, I'm getting, I'm getting the memory error actually. So, so let's go with CPU, we're not taking much risk here. So let's go with CPU. If you have a good GPU, I would suggest you to follow this GPU method and we are just loading the model here. So the model is nothing but Facebook's wave2vec model, which is pretty famous and before this whisker model actually, whisper model. So you have the model ready and yeah, I just emptying the cache. So the next, next item is to kind of create the vocabulary file. So you can create a vocabulary file of your choice, like I'm just creating a dictionary like this, where we have the basic kind of letters and I'm just copying some, some other, I'm just creating this vocab dictionary and right now, and I need to create additional unknown and padding token also, I'm just creating it. So yeah, and finally we will be able to see the token set. So when, when I'm creating the token set, it automatically created the pad unknown. So ideally like maybe the way I defined it, it is in gap. So it, it didn't want to do in that way, like it, it automatically created that pad unknown S and slashes. So like you don't have to worry that much, like you can simply define your vocab dictionary and it automatically created while this token set is being called. So it's really great in that way. So if you go and check the tokens, you can kind of see ideally, so you have all the token, you don't have to worry on that. Like, so you have the unknown token, padding token, pad unknown. So yeah, everything is there. You don't have to worry on anything. So now let's define the training arguments. So our parameters of how we are training the model. So definitely learning rate is an important aspect and then maximum steps. So this variable is very important on how much we need to train. So I would suggest you calculate the maximum steps based on the epochs you need so that you can tweak it, like whether you need a three epoch training or 10 epoch or 15 epoch, how many were like you play with that number. And there is a calculation which you can actually use for evaluating this steps based on that give the number. When I was training, I gave a very huge number to kind of train a large model. And then for the safe, like I'm just saying, I'm just giving one, one and evaluation steps also. You can, you can give like, just like how and then we have the model arguments. So dropout and hidden dropout. So I'm just going with some default values, just running that. And then, so earlier we just created that data frame. I'm just calling that it just that I have my data ready. So it's already there. We don't have to worry. Now I'm getting inside that path. Now I'm creating the training data. So this is the crucial part where we need the training data in this particular format. So to train this particular model, so we need the training data in a form of a list and inside list, it should be in a dictionary format. So inside the dictionary format, we need path and the transcription. Transcription is nothing but the target word. What exactly is that word? So this sound signal corresponds to this word. So like that, we need to map it out ideally. So if you already have the sound signal, which is there, well and good, but you need that exact fine name to match, match it with the correct word. So that's where these kind of data frames and mapping will definitely help you, but it's of your choice. Like you can go with any data dictionary or like data structure of your choice. So it's completely fine. Ultimately, we need the data in this format. We need a path. We need a transcription. So that's what ideally matters. So now, so similarly, you can prepare the evaluation data set also. So I'm just simply preparing a data set. It's like you don't have to, I don't think you need to worry on this. Like it is simply, I just created some, I just wanted to duplicate, remove the duplicates of speakers when I was training for my other application. So, but in this case, you don't have to worry too much. Like you can simply, yeah, give the data of your choice. And next is we can train the model. So I'm just targeting the data into this out files. And here you can simply give the out files, but wherever you want to save the kind of model. So here I am, I'm planning to save the model here in this folder. So this is my output path and I can simply run this. So we are defining the output directory. We are defining the training directory, evaluation data, token set, which is important, training arguments and model arguments. So based on the training arguments, it will kind of train how much, like how many books we need to kind of, yeah, run that stuff. So yeah, then we can simply hit the fine tuning and then it will automatically start training. So it will take a while to train and if you don't have enough memory, you will definitely hit a memory error when you are using a GPU or if you're using CPU, the training will be slow, but it will run and make sure you have all the necessary settings ready. So it will take a while to train and then you will see the data or the models which are actually coming. So, so that's how we kind of fine tune the model for our custom data. And now once the model is trained, we need to inference this particular model. So ideally after training, you will be getting a sample output like this. So this is, this is one example of where we have the PyTorch model, which is, which got saved actually finally, after everything, and we have all the vocabulary and we have all the necessary stuff. So you can see here, like basically all the vocabulary file and everything like you, you, you have everything ready here. So all your model details will be available here. So this are, yeah, details, basically config file and we have the model here. So if you want to upload into HuggingFace, you can upload anywhere and wherever you want to host it in a cloud service or wherever. So now how we call this model. So that's where this model inferencing script is. It's pretty straightforward. It's a very straightforward to call it. So we have to load them, we have to, okay, let me stop the training. We don't have, we are not actually training right now. I've already trained it. So how we can kind of load this model. So to load the model, it's pretty straightforward. We are, we can import the HuggingSound again. And so where, so just target the model directory where the model files are available. So in my case, the output is actually there in sample out this particular folder. So just target that into model directory or something like that, and then just call it. So just call that one and then we have the model ready. So if you don't click it, you can see it's a HuggingSound switch recognition model is ready. Okay. So yeah, we like, if you want to check the parameters or everything, maybe, yeah, okay. You can see here, like it's a pretty straightforward. Okay. So now how to transcribe the data. So let's say I'm targeting again to the audio directory here, and I'm just giving a file as an example. So here, if you just transcribe, so model.transcribe even audio path, just hit the model.transcribe and in the transcription, you will see the output. So right now it's a very, what to say, I just trained it with that little. So that output is nothing but I think 20 samples or something. So that's why I didn't train with, but when you're training with a large data set, you will definitely get a good output. So you will get the probability, you will get the transcript, everything will be well. So train with a good data set and like with a good number of epochs, you will definitely get a good model ready. So that's it. And thank you guys for watching this video and have a great day.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript