Fine-Tuning Wav2Vec2 for Speech Recognition

Convert Your Audio To Text

4.9/5

3727 customer reviews

Learn how to easily fine-tune Facebook's Wav2Vec2 model to enhance your speech recognition tasks using PyTorch and HuggingFace.

Speech Recognition in Python finetune wav2vec2 model for a custom ASR model

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hello everyone and welcome back to my another video. Well, this video right now will be a little different from previous ones. Well, recently I was investigating and checking the Word2Vec2 framework for self-authorized learning of speech representation. Well, that's kind of paper from 2020 and if you are related or doing something in speech recognition tasks probably you already heard it. Well, this is a second version of this kind of paper. There was, of course, the first version, VotoVec 2, simply, and this is a release paper from Facebook. And basically, what this is, is that kind of paper that tells us how they released a model that is right now is one of the best in speech recognition models overall. Well it's three years already for this model but still it's one of the best models we can use. So basically in this tutorial I'll demonstrate you how easily we can fine-tune this model with our own data for speech recognition tasks. So I'm not going to cover the theory and the model specifications itself. I'm mainly gonna give you a code that I used to fine-tune our model on my own data with myself, my own tokenizer and etc. This will be pretty simple and we are not going deep into the theory or the coding stuff. I'll simply gonna cover it in short. So basically that's pretty nice model because what we need to do is simply we read our audio data, we receive this kind of raw audio and we simply feed this data straight into this model and it has a convolution neural networks inside, transformer encoder layer and there it gives us output. Well basically this model is focused on CTC loss and that's what I'm gonna use in this kind of tutorial and demonstrate how to use it as simple as it is and we don't need to train this model from scratch we can use the pre-trained model and we fine-tune on this model So basically this paper demonstrates what they did and they explain why they receive such great results and etc and we're gonna use it basically we're gonna fine-tune this. Of course there is some results of data that are labeled and labeled they do this some comparison on it but I don't think that you are so interested on this. So basically this model is trained and published on HuggingFace and we can use it from this. And this is the base and you can see that's the Facebook and it gives us some explanations and of course in HuggingFace you can find some explanations how they train this model and etc. So I'll give you a video tutorial how to train this model and how to run the inference when you want to deploy it and if you don't want to use iPort or PyTorch installation itself there. So what we've got to do is simply jump into the code of my ML2 package that I was working recently and so on. So you might see that I have what to work to Torch tutorial here and you might heard it right now I'm not going to use TensorFlow and I will use a PyTorch and at the end of this tutorial I will explain you why and that's gonna be pretty simple. So what are the requirements from this tutorial is that you need Torch at least this version because I tested on it and I'm not sure if it will work on higher versions and you need transformers so you could download this pre-trained model you need this ML2 package that I recently released but I'll update the version that you're gonna use and of course ONNX and there might be ONNX runtime. Okay let's import this. That's right and let's go straight to the training code here and I'll give you explanation what I do here. So basically I import many stuff, most of the stuff is from my ML2 package, as you may see there is a model, CTC laws that I use, data provider, metrics to track character error rate and board error rate and of course some callbacks, early stopping, warm-up cosine, model to ONNX tensor board so we could track the metrics, how we train them etc and model checkpoint and of course there are some augmentations related to audio, so random audio noise, random audio pitch and time stretch. I implemented these few augmentations to improve our data, to scale it so our model would fit it better on it. So basically what we do here, I still have as you can see WAV2VEC2 for CTC that basically will be from Transformers library that we should have installed while using this and there's a few more functions from PyTorch, but you don't need to focus on it right now. So what I'm doing here as you might see I download a simple kind of LG speech data set that's pretty huge data set that has 30,000 samples of speech that are labeled and what we need to do we simply need to download them and I use this kind of function to download it. So it will be placed in datasets, LDOt speech, metadata and so on. So if I open my datasets you can see that I have this data already downloaded and it's pretty simple. So let's move on. So next I have some here a vocabulary that this is the usual stuff that have this kind of characters inside this data set. So what I need to do, I need to pre-process this data set. So I read this metadata path, metadata file, that user separator of the following and I simply read the strings of transcriptions in. And what I do here, so I join this kind of labels and I use lower, as you can see. this means that I don't want to have a capitalized chart test, I lower all of them. That's for simplicity reasons because of course when we are trying to recognize speech we don't care if it's a capital or not so simply it's way easier to train the model when all chart tests are lower and this simplifies our vocabulary that we need to use. So great, so then I use a data provider. That's very simple to TensorFlow data provider and actually I inherited everything from the TensorFlow data provider but you might see that it has some additional parameters as workers use multi-processing or not and etc. So this time when we are working with sound data it's really hard on the CPU side because it needs to load all this audio from our disk and if we Need to augment it. It's even harder. So usually if you have a strong CPU So I recommend to use it with multiprocessing, but if if it's there something wrong it It can't you be used on multiple on CPU it will be used as thread pool executor and etc. It simply iterates all our data. So what we do here? So we skip the validation, I have the configurations here that you can dig in and check what it has and how many epochs I want to train, what will be my batch size, what will be learning rates and etc. What will be my map epochs, whether I want to use mixed precision or not and etc. There are many different things and this config is saved along the trainable model. So let's move on and as you might see I'm gonna use this data preprocessor. This is audio reader. This means that it will read audio of a sample rate of 60 000 and it will create a specific audio object of my MLT object. Then we need to use label indexer that will label our characters into the integer representation and then we need to use this batch post processors so when we are training our model we need to have bad audios for example eight audios in one batch so when we want to train our model all these audios should be the same length this means that we use audio padding and I recommend to use this on batch and then it will be padded to maximum size of possible length in the audios and the same we apply for labels and we don't need to pad for example if we have sentences with 20 characters we don't need to pad it to 100 characters. It will be a little harder to train our model so for efficiency it's better to pad only to the maximum lengths that exist in our batch. So this is the purpose of this and this as I mentioned is whether we want to use multiprocessing or not and if we are training our model on some weaker CPU it might not handle this very much for example on my computer I can't use multiprocessing on Windows at least but if I try to run this on Linux it works so it depends on what operating system you have, what CPU you have and etc and this is kind of very good worker that when we work with audio because it will work in a background load everything for a trainable model for us in the background. That's pretty amazing how it works. And there of course we split our data into validation and training and we're gonna save this along the model what our training and validation data sets. So this means this is fit simply for validating our trainable model. And if we want we can use augmentation and I am not using it because it eats a lot of CPU power and it's pretty hard to train. It takes way more time to train augmented audios. So the crucial part right now is here. We need to create this VAR2VEC model from PyTorch transformers. So basically we use VAR2VEC2 for CTC and we use it from pre-trainer. And as you see I am using pre-trained name and this is the specific name from Hugging Faces. And we're simply gonna download this model and load it on. And what I'm doing here differently is that hidden states. I'm changing my vocabulary size and I ignore mismatches. This means that I'm changing the head of this model and simply put my own head that will be trainable for classification. And when we feed the forward pass here we input our audio data, audio batches, etc. and we simply receive the outputs and we use the logits from these outputs into our into the logsoft max function and whole output go to the ctc loss of course you can implement this differently but i for me i found this is pretty good implementation so i follow with it so here we create this custom model of pytorch and we have it here and i put this on GPU because I'm totally not recommend to try to train it on CPU because you're gonna need to wait months to train it I believe and of course if you want to train it on pretty large data set you need kind of reason very strong GPU or use multiple GPUs and of course pretty strong CPU. Remember that to train or out transformers on outer it's it takes a lot of time so it's up to you so next we define a bar map cosine design and I'll explain what it does but it's pretty necessary to to use with transformers so it means that we start from really low learning rate and every epoch we increase this learning rate to faster and better train our motor so that I use the tensor board that's pretty self-explainable. We use early stopping, we track the validation character error rate and of course we could track the word error rate but I chose to use character error rate. It's up to you what you want to do. So next I use a modal checkpoint this means that I'll save my modals as modal.pt, this is the weights and I will save the best weights according to validation character error rate. That's what I gonna use. And of course at the end I want to save my best model into ONNX format so I don't need to save the PyTorch model because it's pretty hard when I'm using a PyTorch to load back the train model on another platform. For example if I want to use it on Raspberry Pi it's pretty impossible to load this back without knowing the architecture or without installing the PyTorch. So I chose to convert the model to ONNX and then it's very simple for me to run this on any other device I want. So that's it and here you might see that I have a custom model object that handles all the training stuff, the callbacks, the metrics and etc. So here what I do here I feed to this model a model as you can see custom model that's my Wav2vec model I use ctc laws and that's kind of custom laws and we here use my our vocabulary for this kind of data set and then I use Adam B optimizer you can use simple Adam it work both work well well but it was shown that Adam W works slightly better, so why not using it. And then I want to track my character error rate and word error rate, that's kind of usual when we are working with words, audio or any recognition related to speech, words, characters, doesn't matter. Then I use a mixed precision, we can find it is in Configs and usually if you have good GPU it will dramatically increase the training for you. Two times at least and it will take less memory on GPU. So we can use larger batches. But it's up to you. I chose to use this mixer precision as true. And then I saved this train and validation CSV files from my split. So later I could test how my module works that I train and convert to ONNX so I want to use this on same validation data because of course on train data it will work well and that's it we call the fit and we waited to finish well I don't want to train it right now because if I start training my CPU jumps to 100 and I might face a blue screen or simply the video recording might start lagging insanely but to prove that it works I can start it immediately and kill it when it starts training. So let's run and let's see how it keeps loading something here. Okay and as you can see it takes 1474 batches to train and that's great. As you can see it doesn't handle multi-processing so it's switching to multi-threading. That's as I expected because I know that it doesn't work on my machine like that. And within the first batch it will be pretty good. Okay, and as you can see it keeps training and it showed us character error rate, word error rate and we simply need to wait until it completes this stuff. But right now because I'm recording it's way slower and I believe it might be lagging on my side so I don't want this to continue training while i'm recording so great i'll kill it because i've already trained this model and i already test this everything out let's kill kill kill kill kill this stuff okay i'll simply kill this one so great and i believe you're interested how it trained how you can prove it okay uh i have a pre-trained model and if i go to my models i have vote2vec2 and here is my trained model that already has a PT weights and model on next. That's a trained model but first before going to the test part let's look at the TensorBoard how it trained and I have a TensorBoard for exactly that what I showed you and let's maximize this one and you might see that it started training and there was 10 epochs of warm-up and it wasn't improving at all. So that's the worst part with this model because when we it's training for such epochs and it's not decreasing we are not sure whether it's training or not but look at this after 12 epochs it dramatically dropped somewhere here and you might see that train is you check the error rate of training is 0.21 and validation is only 0.01 so that's a huge different and if we scroll to the end you might see that it dramatically decreased and let's look at the word error rate because that's related to words and you might see that at the end our training was only 2% error and validation was 1.8 error so this means that it works on our 8 audio data and our loss looks exactly the same So now let's look at our learning rate. So that's exactly what I was talking about, what is this warm-up. So this means that you might see that it starts increasing the learning rate to some learning rate we define. So this is 1.1e-5 is my learning rate that it should have achieved during the warm-up and it achieved this and it trained it, continued training and it was decreasing our word error rate was decreasing everything was just great and it trained actually, loss looks pretty the same and we don't need to look at it it was simply also was high high high and then dropped dramatically so great it works so right now you might ask well prove it that it works okay great Why not? So if I go to my models, here I have a test.py script that I created and here is my ONNX inference model and CTC decoder and character and error rate calculation scripts. So what I do here, so simply here I have my model training, you can see 4, 3, 4 and I load this model ONNX with my following object and what we do here, how we do this prediction. So here is our audio path and the true label so maybe we are interested to see what is the label. Great, not a problem. Let's print it out here. Print label and it will print in terminal for us. And right now I read this validation CSV file that it was save it along the model here and we're gonna iterate through it to see whether it works or not and we'll see what are the true labels of it and we are not interested in the prediction because this error rate is pretty low so we might not notice the difference accurately without having time into it. So great and if I run this for whole this data set and we will see what is the character and what error rate for my kind of labels, true labels and etc. So let's run this simply and you'll see that it it's pretty nice and it works. So simply what we do we use our live librosa to load this up and we load this audio. We have this audio raw data that we put into this kind of prediction model and it expands them, runs the inference and use a ctc decoder to decode the text. And that's it. And you might see here is the output of my models. And you might see that there is child error rate, word error rate, what are the results and etc. And let's stop this right now. I'm not interested actually. And let's move on. And we can see what are the true labels. Oh, I see. It's not the better fact I see it corrupt me Capitalized letters because all all sentences start from from a capital So that's my problem. I need to change this transcript to prove it that it works and basically, you might see that I that's my mistake I teach more model to predict speech without capitalized letters so yeah that's my mistake I need to look at it closer so anyway I'll fix this training code for you I don't know maybe I'll train on another model but I'm not recording another tutorial right now because I'm not interested to do so and of course I'll publish all the model there into my github repository and etc. You can find the link in the video description below if you want and there everything is. But here is the idea how simple it is to fine-tune this Mav2Vec2 model and you can use audio wherever you want. So basically it works as you and if you can try to record your audio and use the prediction but it will be pretty hard and it will be not that accurate because this kind of data set that I used to train is very specific with accent of people that were reading books and it might be not that great for you but still it works and you can try to use your own data to fine-tune this model and you will see that it works pretty nicely. But I don't want to invest a lot of time into explaining this stuff here. So basically it works and if you like this video please don't hesitate to smash the like button, subscribe and if you have any questions drop the comment below and you will see how to solve this. So basically let's go back to one stuff. If you remember I asked I'm not I do not recommend to train this on tensorflow and there is a huge problem with this because we also we can load transformers the same model that we use in PyTorch but there's a minor difference that training takes around five times longer. I don't know why. It uses the same CPU, same GPU as the PyTorch but it trains way slower and I can't explain this so I don't know why we would like to train this model on TensorFlow, if we can faster train it on PyTorch, and if we anyway are going to deploy this into some ONNX format and use it. So it's up to you, you can try to train this on TensorFlow, but I do not recommend to do so, I recommend using only PyTorch, so it's up to you. So that's it about this introduction with VotoVec and this code you can find on my github link in the description and try to train your own model or you can try to run my own model that I'll upload link also in the description. So that's it about this and and that's it. We'll see you in our next video tutorial. Thank you again for watching and we'll see you next time. Bye.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support