Comprehensive Whisper AI Fine-Tuning Guide

Convert Your Audio To Text

4.9/5

3723 customer reviews

Learn how to fine-tune Whisper AI for language-specific tasks with detailed steps and essential tools. Enhance your models with our comprehensive tutorial.

OpenAI Whisper - Fine tune to Lithuanian step-by-step with Python

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Welcome to Wishper AI fine-tuning tutorial. To implement Wishper AI fine-tuning we have an action plan. Step number one, loading the dataset. Here we will load the dataset upon Wishper AI requirements. Step number two, prepare mandatory components for fine-tuning, feature extractor, tokenizer and data. What is feature extractor? Wishper has Wishper feature extractor functionality which allows you to pre-process the raw audio inputs with the following steps. Pads and truncate the audio input to 30 seconds length. Convert the audio input to log-mail spectrogram input features which is required originally by Wishper AI. What is tokenizer? It is like a model which performs the sequence-to-sequence mapping. The Wishper AI model outputs a sequence of tokens IDS. Originally Wishper AI has Wishper tokenizer functionality which post-process the model output to text string, maps each of these tokens IDS to their corresponding text string. What we are considering here, there is a specific tokenizer for each language. So firstly we will prepare feature extractor and tokenizer and then we can combine them with Wishper processor. This is step number 3. And what Wishper processor is doing? It simplifies using Wishper feature extractor and Wishper tokenizer. And then it can be used on the audio inputs and model prediction as required. And after that it is a good time to prepare our data. This is step number 4. The pre-processing data upon Wishper AI requirements has 4 main steps. The first one, make it to 16000 Hz audio rate. The second one, resample the audio data. The next one, compute log-mail spectrogram. And the last one, encode data encryption to label IDS using Wishper AI tokenizer. And the most exciting step, training and evaluation. This is step number 5. In this step, 5 number 1, we will define one important object, a data collator. It will make input features to be batched to PyTorch tensor by using Wishper feature extractor. Wishper tokenizer will padding labels and replace tokens with "-100", to ignore loss correctly. And the last one thing. If BOS token is appended previously, cut BOS token as it will be appended later anyway. After that, in step number 5.2, we will define evaluation metric, which is WER, word error rate. We will load WER metric from hacking phase. Then at the step number 5.3, we will load up to the train checkpoint. It is original Wishper AI model, the small version in this tutorial. Then at the step number 5.4, we will define training configuration. I will provide some documentation on that. One thing to mention, I strongly recommend to use GPU in order to avoid errors as you can see right now on the screen. And finally, we will start training.

Speaker 2: In other words, fine-tune to different language.

Speaker 1: I recommend to install PyTorch for your system by following guidelines in this official PyTorch website. And jump to the hands-on session. We are on the hands-on with Wishper AI. And here, first of all, I suggest to set the runtime engine to GPU. If you are using Google Colab, go to Notebook settings and select Hardware Accelerator to GPU. For this, you should use Colab Pro, which costs approximately $10 USD per month. To be sure that we will be using GPU for this tutorial, let's restart the runtime at the beginning. So, before do all the steps, let's check that our GPU is active on the session. Perfect, Tesla T4 is running good, so we can start with the steps. One last thing before doing it is to install the ffmpg package on the machine. These three lines doing it. I strongly recommend to do it on your site as well. Ffmpg is a complete cross-platform solution to record, convert and stream audio and video. This is very beneficial using together with Wishper AI model. Ok, it's installed now. We can delete the cell, so we don't need to see it anymore. Next, we need to install Python packages which we will use. Dataset to get required dataset for Wishper AI. Official Transformers repository from Hackingface. Libreosa to extract auto-information. Evaluate to define our evaluation metric word error rate. Jiver similarity measurements for automatic speech recognition evaluation. Gradia to build web application if you want.

Speaker 3: All dependencies are being installing to the Python virtual environment. And it is finished, we can close the cell.

Speaker 2: So now we can import the main dependencies that we will use in the tutorial. Dataclasses, Typing and Torch.

Speaker 3: Fine.

Speaker 2: Now step number 0.

Speaker 1: Logging to Hackingface. Here are the dependencies needed for this step. And here is my token. You should go to Hackingface portal to generate your token needed for your login. You can create it by clicking on New Token right here. Come back to the code. So once we have our Hackingface token we can login to the portal. To do it in readable manner I prepared a simple function where you can pass your Hackingface token as string parameter. Perfect, that works and now we are connected to Hackingface. Step number 1. As it planned, now I came to download a language dataset for Whisper AI model. That means I need to download a language dataset that I want to fine-tune Whisper AI model to. As you can see, we will have train and test set for this. You can choose from many available language set if you are using Mozilla Foundation as data source. I want to fine-tune Whisper AI to Lithuanian, so I choose lt as short name for that. Also, you can see that I am using my Hackingface token to download this data. Yep, I set it to lt to get dataset for Lithuanian language. Also, I remove columns from dataframe which are not required for the model, such as accent, age, gender and other ones you see in this list. Ok, let's get the dataset right now. Here you go, downloading is finished and here is the output of my dataset metadata. I have columns, audio and sentence. One set is for training and one is for testing. All what I need so far. Step number 2. Prepare feature extractor, tokenizer and data. In this step we will use Whisper feature extractor and Whisper tokenizer. We can define our feature extractor from Whisper feature extractor using from pre-train method. In the same way we define tokenizer using Whisper tokenizer. Here you can see the parameter specifying the language I want to fine-tune the model to. In my case it is Lithuanian and task is transcribe. Also, we can specify which Whisper AI model version we want to use. My GPU is not super powerful, so I want to use a small version. You can change it to large for feature extractor and for tokenizer. Also you can set it as medium, whatever you want but I will keep to a small version and language is Lithuanian. Ok, our feature extractor and tokenizer are done and they are downloading right now.

Speaker 3: It can take some time to complete. Step number 3.

Speaker 1: Combine elements with Whisper processor. To complete this step we will use Whisper processor which combine our feature extractor and tokenizer. Here we need to specify the Whisper AI model version, language we want to tune our model to and the task the model should do. Task is transcribe. Very simple. Fine, this cell completed. Just take a quick look what is inside the processor. If you scroll down a little bit you can find some interesting metadata about our dataset.

Speaker 2: Ok, that I just want to let you know and we can go next. Step number 4. Prepare data.

Speaker 1: It is very important and little bit more complex step than we had before. To better understand what is going on we will print random audio example from our dataset. Next very important. We will change the sampling rate from 48000 to 16000 Hz per second. It is required by the original Whisper AI model. And here we can make extra print statement to check if this audio rate downsampling works. Below of that we should a map function which applies the preprocessing steps to full dataset. For this purpose I prepared a function to prepare dataset named prepareDataset. PrepareDataset function. And here we have three small steps. The first one resample the audio batch to 16000 kHz. The second one compute LogMail input spectrogram from the audio sample. And the last one encode target text to label IDS. Tokenizer will do it. And in the final this function returns a preprocessed batch of audio data upon Whisper AI requirements. So let's run everything here right now. In this print statement you can see the effect of audio rate downsampling. Firstly we had 48 kHz and after downsampling it became to 16000 kHz. So let's wait until the full preprocessing pipeline will be completed. It can take 10 or 15 minutes depends on your machine. Please be informed that you can apply multiprocessing to accelerate this step. In my collab with GPU it can't take until 15 minutes.

Speaker 2: The second batch.

Speaker 3: Let's wait few more seconds.

Speaker 2: Perfect, prepareData step is finally completed. Step number 5. Training and evaluation.

Speaker 1: That is a critical step in this tutorial which consists of few smaller steps. Step number 5.1. Initiate the data collator. To initiate the data collator I wrote this dataclass function which will do everything. It is data collator speech-to-sequence with padding. So what it will do? It will use data collator to perform speech-sequence-to-sequence padding. First of all it will split inputs and labels since they have to be different lengths and need different padding method. It will create a PyTorch tensor with these tokens. Then it will get tokenized label sequences and pad the label to the max length. This will also return a PyTorch tensor. Then according to whisper.ai requirements it will replace padding with minus 100. And finally in here it will analyze the beginning of the sequence position and do additional manipulations. Finally this function returns a given batch but with pre-processed labels. You can take a look what is inside our data collator. You can see the metadata for example sample rate. Also you can see tokens and other parameters which are understandable.

Speaker 2: Perfect. Everything what we need. Step number 5.2.

Speaker 1: Define evaluation metric. Whisper.ai use WER metric which means word error rate. We can load this metric as variable right here from evaluate package. The metric is prepared now. And we can check what is the metric is exactly is about. As you can see it returns a float value which is the word error rate. You can see some examples how to use this metric just below. That's it. Go next. Step number 5.3. Load a pre-trained checkpoint. So here we load a pre-trained Whisper.ai model to variable model. Hackingface has a special functionality for this. Whisper for conditional generation. So we will use this. For this tutorial let's load small version of Whisper.ai model from which we will fine tune to Lithuanian language. So let's run this cell and download this pre-trained checkpoint to our local storage.

Speaker 2: Small version takes approximately 1 gigabyte, not too much. And it is done.

Speaker 1: Let's quickly check what the model variable has inside. And here you can see the full neural network architecture with parameters which is super interesting. You can check layers, nodes and other parameters such as used activation functions as well. Brilliant. So we will use this for fine tuning. We just implement few arguments overriding which are directly related to tokens and Whisper decoder. You can read more about that in the following link.

Speaker 2: So let's do it quickly before training our model for fine tuning. Step number 5.4.

Speaker 1: Define the training configuration. You can check for all arguments you can pass as training argument by using this link. Here you can see the full list of hyperparameters with default values which you can adjust and use for your model training operation. So here is the combination that I will use for fine tuning Whisper model. Don't forget to specify output directory which correspond to model version you are using. In my case again it is small Whisper.ai version and the language indicator in my case Lithuanian which mean LT. Here I set push to hub equal to false. But if you want to push the model to hacking phase hub after fine tuning you can change this value to true. So run the cell now. And now we are ready to initiate a trainer where you put all what we did before in one place. It is training arguments from here, train model, training and test sets, data collator, compute metric and tokenizer. Oh, I forgot to add compute metric function which will calculate this evaluation metric during the training process. So here it is. It also do some manipulation with padding. And finally calculate the numerical value of the metric for given sample right here. So no error anymore and I think we are ready to start training. In other words fine tune Whisper.ai model to Lithuanian language. Step number 5.5 Training Training will take approximately from 5 to 10 hours depending on your GPU. Hit a training starting where you triggered this line of code. Some print statements indicating the start and end of this. In advance, thank you for watching this hands-on tutorial, subscribe this channel and I promise more high-quality content soon. So we can start fine tuning our model right now. Training is started. So I think these steps we made during this tutorial are now clear for you. The full code you can find in the GitHub repo, the link is in the description of this video. I see that training can take almost 6 hours to complete. So for now I say see you in the next video. Good job learning.