Fine-Tune OpenAI's Whisper for Custom Speech Needs

Convert Your Audio To Text

4.9/5

3716 customer reviews

Learn to enhance Whisper's vocabulary, handle accents, and improve recognition of unique terms or languages through fine-tuning techniques.

Fine tuning Whisper for Speech Transcription

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: I'm going to explain how speech-to-text models work, and then I'll show you how to fine-tune a speech-to-text model like OpenAI's Whisper, so that you can add new words it's not familiar with, or even familiarize it with new accents or languages that are lesser known. So this video is all about fine-tuning Whisper for speech transcription. We'll start off, I'll tell you what Whisper is. It's a speech-to-text model by OpenAI. I want to give you a very quick demo with a tool that I think you'll actually find useful if you want to do transcriptions yourself, for example, of YouTube videos. Then I'll talk about use cases for why you might want to fine-tune these models further. Maybe you want to add new words to the vocabulary, as I'll show you in this video, or maybe you want to fine-tune on a difficult accent or a difficult language. I'll have a little theory section explaining how transcription models work. Actually, they borrow a lot, or they're very similar to the language models I talk about in other videos that are of the GPT type. I'll then talk very quickly about how we prepare for fine-tuning. In other words, how do we prepare a dataset in order to fine-tune a speech-to-text model? I'll give you an overview of a repo that you can purchase access to that will give you all the scripts you need. Now, there are many free scripts available online as well, and I'm going to put those below too in the description. Then I'll go through a full worked example all the way from creating sound snippets to fine-tune on through training a Whisper small model and through then evaluating the performance after that fine-tuning. And I'll finish off with a few pro tips. This is a speech-to-text model by OpenAI, and it's called Whisper. It's available under an Apache 2 license, which means it can be freely used for commercial purposes or research purposes. Now, we're going to focus on the Whisper small model. There are a few Whisper models available. I'll just show you the sizes here, going from a tiny model with 39 million parameters all the way up to a large model with 1.5 billion parameters. So already, if you're familiar with language models, these are a lot smaller. The smallest Lama2 model from Facebook is 7 billion parameters. And you can see here the largest speech-to-text model is 1.5 billion. So already you can see that as a client, you can see that as a task in terms of complexity, converting from speech to text is quite a bit easier than trying to predict the next token in a sequence as we do with GPTs. Now, something that I've found is the small model actually performs very well. So in a lot of cases, you can just do transcription using a model that has a quarter of a billion parameters and get some very good performance. Let's just take a very quick look at the files in the repo. We can see here the model itself. It should be the largest file. So let's see here for the largest file. Here it is, model-safe-tensors. It's just under a gigabyte in size, 967 megabytes. And you can see if we look at the configuration file, it looks somewhat like language model. It has still got the attention heads. It's got layers, dropout. So it has a lot of the features that a language model has. In fact, in many ways, it's a language model with kind of an audio portion that's connected onto it to feed in audio information that's then used to predict the tokens. And we'll go into that a little more later on in the video. Now, just to get you into this, I want to straight away show you an example. This is a notebook that I'll link below and you can check it out yourself for free. I actually use this notebook if I want to transcribe YouTube videos, which is what I'm going to show you today. Here, I'm going to install Whisper, which is an open AI library. And I'm going to install this quick package that allows me to grab audio from YouTube. And over here, I have a YouTube video. This is a video I made some time ago. It's about the fine-tuning repository. It's a short video, so suitable for a quick demo here. And you can see that you can turn on captions in YouTube. These are the auto captions that are generated on a word-by-word basis. And one drawback I find is that it doesn't break it into sentences. So actually, it's not the easiest to read these automatic YouTube captions. And so I typically use Whisper to generate some captions instead. And that's what we're going to do right here. So I'm going to just copy this short code of the YouTube video and head back over to the notebook. And I'll paste in here the address or that short code rather. And that will be sufficient now for me to download the audio for that YouTube video and then run Whisper using the small model, using English as a language. So it should extract fairly quickly because it's an audio file. Audio files tend to be fairly small in size. It's just going to save it locally here to my folder structure. And then immediately, Whisper should start to transcribe. And we're going to generate a transcript file, which is called VTT. It's a typical format that combines timestamps with text. And here you go. So we're already generating the transcript. And you can see here, it's matching up with the content of this advanced fine-tuning video. And it's pretty quick in terms of speed. In just a few more seconds, we're going to have the full transcript complete. And notice as well that it completes the sentences nicely, adding in commas. So I find that this is much easier to read on a YouTube video than the automatic captions. Now, when that's done, it will have created a variety of files here, including a text file. If you want to take a look at the text of that video, that's also helpful if you want to use a transcript for machine learning. And also there's the VTT file, which is the one that's of use. You can actually upload it to YouTube. This is a caption format. And you can see here that it gives timestamps in addition to the text that's provided. So this is really a beautiful model. It was released, I think, in 2022. So it is a bit of a time ago, but the quality I find is quite good. Now, something you'll notice straight away in this transcript is, of course, the model is not going to be familiar with certain words. For example, the spelling of trellis, at least for my website is with one L. And here it's got two Ls, but of course it doesn't know that because that wasn't in the training data. So it might be nice if I fine-tuned the model on some audio with trellis and then gave a transcript with just one L. That would be a way to fine-tune it. Also, it's not going to be able to get words around certain models like Lama70B here. It just doesn't really appreciate that Lama70B exists as a model. And so it would be nice to fine-tune this model so that it knows some of the more modern terms. And that's actually, here you can see, instead of safe tensors, it has safe hounsers, which is the best it could do with my Irish accent. It does get PyTorch though. What I'm going to show you later on is an example of me fine-tuning on a series of terms, actually language model terms, like Mixtral 8x7B or Mixtral or Lama2, all these terms that would not be familiar to the language model. And we'll see how by doing a simple fine-tuning, we're able to accustom the language model to a wider training set, let's say, so that it can generate more accurate captions for our application. Now that you can see a little bit what Quisper does, let's move on and talk about some of the use cases before describing the technical approach to getting from audio to speech, and then moving on to the full example, going all the way from some audio for fine-tuning through to a fine-tuned model. There are a few use cases, I'm sure there are more, but the ones that come to mind for me are first adding new vocabulary that the model isn't familiar with, so new words or phrases. Also, you can fine-tune to improve on an accent. Maybe the Kildare accent is difficult to understand for certain models. I actually think most models are fine with my accent. But if there is difficulty, you can always fine-tune on some pieces of audio of that accent. Likewise, if you have some languages that are not commonly used, for example, the Irish language, it doesn't seem like there are many models for transcribing Irish. There is an opportunity that I could fine-tune a model so that it's able to transcribe better in Irish. Now, let's move on to how transcription works. And what I'll talk about here is how we convert sound from some kind of audio file into the text, the captions that I just showed in the demo. And there are kind of four steps that I'll break this down into. And first off, sound is a vibration. So if you think of maybe the membrane of a drum and it's vibrating as it hears a sound, you can imagine making a measurement of where the surface of that drum is every millisecond. And you measure how far it is from the middle position. So if you do that, or perhaps if you think of the membrane moving like this in response to the sound, you can record that displacement over time at a certain sampling frequency. And when you do that, you see this typical diagram of sound, which is this kind of a streaky line that's going across the X-axis. For example, maybe I can show you a diagram from this notebook here. What I'm trying to say is, if you have the surface of a drum or a membrane that's vibrating due to sound, and you just record its position over time at tiny, tiny increments, this is the kind of graph that you get. So here's a very large displacement, smaller displacements, smaller displacements. So this is a measure of the amplitude of where that kind of membrane is moving. And the more often you sample it, the sampling frequency, the better you can represent the movement of that drum. And typically for whisper models, the sampling is at 16,000 Hertz. So that's 16,000 times per second. There's going to be a recording of that amplitude. So this is the first step to think about in transcription. We have a sound and it's represented by a graph, which has an amplitude. Now, the next step is we want to convert it from this graph here that I just showed. We want to convert it actually into frequencies because underlying what looks like this mess here, there are actually different frequencies that are all overlapping. And the best graph I have to show this is right here on this Fourier transform notebook. So here you can see that same graph. It's just an amplitude graph that I was showing here. So we have an amplitude graph here and underlying all these amplitudes, we have various frequencies. So we might have low frequencies and higher frequencies. And when you crash all of those frequencies together at the right amplitude, you end up getting back to this kind of a messy graph here. But the key insight is that this messy graph of our voice can be broken into a series of frequencies, a series of distinct frequencies. And the technique used to do that is called the Fourier transform. So doing the Fourier transform, you can convert into individual frequencies. And let me see if I can show here. So instead of just having a messy graph like this, you can convert into a graph like this where the X-axis is frequency. And you will see that actually the human voice is made up of more distinct frequencies. Now, this isn't a human voice. It's a lot cleaner. A human voice has got many, many more frequencies blended in. But this is an example of how you can have three distinct frequencies. And by doing a Fourier transform on the raw sound, you can pull them out to identify what those frequencies are. So all of this is well-known and we haven't gotten to language models or even to neural networks yet. We've got our sound that's been recorded as an amplitude. It's converted then into frequencies. Now, there's one more twist to this, which is that the human ear doesn't hear all frequencies the same. Furthermore, the human ear doesn't hear loudness the same, at least not in a linear way. The human ear is more logarithmic. So as the frequencies increase, it's actually harder for us to distinguish the same difference between two frequencies. We might hear a difference of 10 Hertz between 100 and 110 Hertz, but it would be very hard to hear a difference between 1,000 and 1,010 Hertz. So the human ear is actually not linear in frequency response and it's not linear either in loudness. And because we listen to human voices, and this is what we're trying to transcribe, it makes sense for our model to actually use the data as the human ear experiences it, which is in a more logarithmic pattern. And the transformation for this is called the MEL spectrum. So rather than actually using a direct Fourier transform that just shows the frequencies, we show the frequencies as though it's a human ear that's listening. And indeed, in this article, there's another piece here. So it shows you the relationship between the frequency and the pitch in MELs, which is a measure of how the human ear responds. So just to recap, we get the amplitudes of the sound versus time, we extract what the underlying frequencies are, and then we adjust the frequencies and the loudness so that it's represented in a way that the human ear would interpret it. And this is called a MEL spectrum. And you can see here a MEL spectrum drawn out. So here we go. And this is a snapshot of sound versus time. And here are the frequencies. And here is the amplitude in decibels, which is logarithmic scale. And so you can see at a time two, you can see which frequencies are being heard at which level of decibels. And it's this representation here of a snippet of sound that's going to be used as an input to our model. And we're going to represent this by a series of frequencies and amplitudes. So a series of frequencies in Hertz and amplitudes in decibels. And in the same way that for a GPT model, for recursive language prediction, we use tokens. Instead, when we're doing speech to text, those tokens won't represent sub words. They're going to represent essentially decibels and the key frequencies of the sounds within a given time slot. So just to say that once more, if you're thinking about a Lama 2 model for language, you might have a context length of, let's say, 512. So you have 512 sub words, different words like a and the, that are being fed in one by one. Whereas if you look at a model like Whisper, it's taking in tiny fractions of a section, of a second instead of sub words. And within that second, it's representing the sound as a combination of frequencies at certain decibel levels. I've explained how to describe how the sound that we hear is characterized. And the answer is, it's characterized with frequencies and decibels according to a male spectrum. Now, the next thing is we need to input these into a transformer. And the good news here is that much of what we'll see is very similar to in a GPT transformer for a causal inference where we're predicting the next token. And this is a really great diagram explaining how it works. So on the right-hand side, and I'm actually going to zoom out for once because I need to see the full diagram. Here we have actually what is a GPT. So this is a causal transformer where we're trying to predict the next token. So you might have the tokens the and quick that are being input. And given the and quick, the prediction will be brown. And then once we have brown here, we'll be able to predict the following token here, which I assume is fox. And so this is recursive token prediction with words. And actually, this is the very same thing we do with transcription, except we additionally will put in information from the sound that we have. So the sound, as I said, is represented by a log-mell spectrogram. So we have this representation of frequencies and decibels. And we have that representation for each tiny time step. And I briefly mentioned that Whisper typically operates with a 30-second total window input. And that is segmented into tiny increments that actually overlap as well so that we don't cut off any key signals kind of halfway. So we have these tiny, you can think of them as sub-words, but they're actually just audio representations now because we're taking in a sound. So this log-mell spectrogram, this data will be represented as frequencies and amplitudes. And it's going to go in to what we call an encoder block, which is a neural network that is gonna process this through many layers and then feed that information into the decoder, which is our language model. So basically the language model has got information from the input sound for that 30-second block, but it's also got tokens that are coming in. Now, the key thing is the very first pass, there won't be any tokens that are coming in because we won't have predicted what the first token is. So to predict the very first token, it's just going to be relying on the sound. But let's say the first word of the transcript is there and the next word is quick. Well, in that case, there's gonna be audio and quick as information that can be used to predict the next word, which is brown. Now, there's a little bit more subtlety here, which is that when we train these models, we often will train using data that has got a prefix for the language and a prefix for the task. So you can see the EN here. This is indicating to the model that it's English. And this is indicating here that it's a transcription task. So when you run this model with an audio segment, actually there will be some tokens you put in, but those tokens are to say what the task is and what the language is. Now, actually you can leave out the language token and get it to predict the language token and the stronger models will even predict what language it is in some cases. So you could basically start with very little other than the audio coming in as information. But I think more commonly or more robustly, you would feed in a token to say what language it is and to say what the task is. And then you would start to get the model to predict the first, the second and a few more tokens for that snippet of audio. And that brings the whole process together. It's starting off with a sound. You have a vibration that's recorded. It's converted to frequencies. Importantly though, frequencies in a human sense. And then those frequencies, the log-mouse spectrogram are input into the model. They're sent in through an encoder, which then provides extra information to the language decoder that has been used and is very similar to the GPT videos I talked about in many videos before. With that overview of how a model like Whisper works, let's dive into the fine tuning. The good news is a lot of the fine tuning is going to be quite similar to what I've already gone through for GPT models that are used for next token prediction. As usual, we need a dataset if we're going to fine tune and we're going to need the dataset to have both sound, which is what we're transcribing from and we need to have high quality text. So a high quality transcription that corresponds to that sound. More specifically, we need some kind of MP3 or WAV wave file and we need a transcript, which typically we would save in VTT format. And for these sound and text pairs, we're going to need to have a training set and we're also going to need to have a validation set to check the performance as the training progresses. So I'm going to move over now to the advanced transcription repo. This is a paid repo, but as I said, you can try out some of the free transcripts, rather some of the free collab notebooks, if you prefer, which I'll put in the description. Just a very quick overview. This repo will allow you to prepare a dataset and push it to Hugging Face, a train and a validation split. It will take in an audio file and automatically split it into 30 second chunks with the corresponding transcript chunks for each of those sound chunks. And once you have that data, there's a script here, Whisper Fine Tuning, that I'll run through step by step now in this video to fine tune a model and then evaluate the performance. What I'm going to do is clone this over to VS Code. And I have here the advanced transcription repo. And I'm going to set up some data that we're going to train on. So remember, we need some audio and we need a transcript. So I'm going to show you a handy way to create some audio if you want to fine tune for an accent. So this could be fine tuning for an Irish accent or fine tuning for some specific words. Now, what I want to do is I want to get the Whisper model, I want to get it more accurate on some language model terms. So what I've done is gone through some different terms I found online, some technical models, I'm sure you've heard of some of these, like PHY2 or Mixed RAL 8x7B or maybe OpenChat. So I've put a whole series of words here and I'm simply going to read them out into an audio file. And that's going to create my training set. And then I'm going to transcribe that audio file and that will give me the transcript that I need. Now, I've further just taken that text file and I've just mixed up the words into a different order and I'll read these out with some comments in between to create a validation set. And by recording this and then transcribing this data set, I'll have created my validation sound and my validation transcript. Actually, I've already done that and I've got the MP3 here. I'm not sure if this will be possible to hear. Spin. Self-play fine-tuning that improves LLMs. Trixie, it's a form of fast inference. So yeah, I have some sound there and I've also recorded, you can see in this two-minute snippet, validation sound. MMLU is a means of testing performance. So I have the sound ready and next I'm going to move to run the Whisper fine-tuning notebook because this notebook will allow you to take in the sound I've just played, the train and the validation and take in, in fact, just take in the sound and it's going to generate a transcript for that sound. Now, it's easier if I just show you how that works. So the way you'll do this is either you can run Whisper fine-tuning locally. It will actually run because the model is slow. The model is small so you probably could find ways to run on your laptop, but it's quite easy to run in a free Colab notebook. So that's the approach that I'm going to take here. So basically just upload this to Google Colab and you want to make sure that you connect to a GPU for fastest inference. If you're not connected to a GPU, you just go to change run type and select T4 GPU. So it looks like we're up and running and we're connected and I'm going to scroll all the way to the top of this script here and get started with some installation. Here, we're just setting up some different handling of files and we're going to set up some of the HuggingFace libraries that will help us handle the datasets and the fine-tuning. As often in these tutorials, I'm going to connect to HuggingFace Hub so that I can push and pull models, including from private repos. And I've just clicked on this button so I can go over and grab a token here for authorization. Now, once authorized, we'll move down here and select the base model for training, which is going to be the WhisperSmall model. You could pick a larger model if you want to improve quality, but you'll see the quality is quite good. I'll set the language. The task is set to transcribe. Let me just increase my screen size a little bit here. And I'm going to define for later on the paths for pushing the fine-tuned model. So I've got the Travis.org, set that to your org or username. And then I've defined a repo called LLMLingoAdapters and then LLMLingo. And then I've just set up the repo, which is a combination of the org and the adapter name. So once that's done, we'll move on and we're going to try and generate a transcript for the audio that I've just created. I'd like to show you this live. So what we're doing is we're creating a pipeline here consisting of the Whisper model, which is the base model. And we're going to ask it to transcribe in 30-second chunks. And we're going to ask it to use, if possible, CUDA, which is the NVIDIA GPU. And once we have the pipeline set up, we're going to set up a function called process audio and create VTT. Basically, this is going to run the pipeline on the audio file that we specify. And it's going to return the text and also the timestamps. And with the text and timestamps, we're going to be able to create a VTT file just like this here. Now, I'm just going to go up here and log in. And I'll define this here and run these cells. And next up, I'm going to be able to run this function on the train MP3 and also the validation MP3. Now, first, I need to upload those two files. So I'm going to just upload them here locally. And you can see the files are now available locally. And my pipeline is loaded. So I'm ready now to run this cell, which will process the training. And then I'll run the same to process the validation. Now, I'm running this with the base model. So what we should expect is that it's not going to get the new words correct. That's what we're expecting. When that cell has run, we should be able to see here on the side panel the appearance of the VTT files. So let's just take a look at the train. So here we have a spin self play fine tuning that proves LLMs. Trixie, you can see is incorrect. Phi 2 is incorrect. And Mixedrel is incorrect. And it will be the same with the validation file. But the reason I've gone through this is because it's much faster to generate the VTT and then correct it than to try and write the transcript from scratch. So I recommend making the transcripts like this and then just go in and read it and manually correct it. So say I'll type in phi 2. Here we have an alternative to flash attention. I'll just make a small correct correction to this. And Mixedrel, I'll fix this up here. So we've got Mixedrel that should be Mixedrel 8 by 7b. And it's actually mixture of experts model solar 10.7b is it's actually a Mixedrel not Mixedrel. And it's giving me some annoying autocorrects here. Here we go. I fixed this. Solar is a Mixedrel model. Yep, that's fine. Open chat is a fine tune of Mixedrel. So you get the picture. Basically, I've taken the train.VTT that I've generated. I've done the same with the validation.VTT. And that's allowed me to create a clean set of VTT files that will be paired with the audio snippets just in a moment in order to create our data set. Now, it is possible to use to use a GPT in order to do the correction of your transcript. For example, you can say I want your help in correcting VTT file transcript. I'll give a list of words that the ASR, which is automatic speech recognition was not familiar with respond in a code pen with the contents of the updated VTT file. And then you can give the raw VTT by the base model. And then you can give a list of the keywords that I showed you right at the start. The keywords that we're trying to fine tune for. Now, this is not necessarily fully robust. It can mess up the timestamps, even with GPT-4. So I recommend either writing a more elaborate and robust piece of code if you want to automate this or just manually correct it yourself. OK, so we have these base files and you can assume that you've now gone ahead and corrected the VTT files as I have done. And here's something I prepared earlier. So this is a fully corrected file. You can see that all the terms are correct because I read through it. So Microsoft PHY2, something like this, SCLM, GPT-4. So all of the terms, CloudInstant1, everything here should be in order. And once you have your VTT files and your MP3 files, and by the way, it's fine if you record in like M4A or WAV, it doesn't matter, or WAV, sorry. You can always convert it online just by using a free converter. Or adapt the scripts accordingly. But once you've got that done, you can just run Python. And next you want to write, let's see, I'm going to run the prepare data.py file. And I am in a virtual environment here. There's all instructions in the repo, in the readme for how you should set up. You should always set up a virtual environment if you're going to install Python packages. So I show the instructions here, setting up a virtual environment. Then installing the requirements from requirements.txt. Okay, in the meantime, because I've run Python prepare data, what that has done is, we'll take a quick look at the script. It's basically taken in these files, the train data and the data validation in MP3 and VTT format. And it's going to split that into 30 second segments and pair the text with the audio and prepare everything as needed for Hugging Face. And finally, it's going to push it to a target repo that I've defined on Hugging Face. So we can take a look at what that is by going over to Trellis LLM Lingo, which is the data set I've set up. And you can see here, we do have these 30 second snippets. Maybe I need to refresh. Here we go. So here's our 23 second snippet. By the way, I keep each line together. So if adding a new line of the transcript takes you over 30 seconds, then the code won't add that line. So that's why everything is always under 30 seconds. And you can see here the text that goes along with it. And there is a validation as well as a training set. And this is a public data set. So you can check it out yourself just under LLM Lingo. So to recap, we have used the fine tuning notebook and on our audio files to generate transcripts. We've then corrected the transcript. So we now have a data set that has both audio and text. And now that we have that, we're on a position to load the actual training and validation sets from Hugging Face. So I'm going to go ahead and load those data sets from Hugging Face. And next, we're going to go through a few more steps of loading. So in a causal language model where you predict the next token, you need to tokenize your inputs. But in an audio model, you need to convert the mp3 file or the file of your raw amplitudes into the feature set. In other words, the frequencies and the decibels or the representation of that. And so instead of a tokenizer for your audio, you need a feature extractor. So here's how we load the feature extractor for the Base Whisper model. We do still need a tokenizer because we have to decode or de-tokenize the tokens that are predicted at the end of the model. And then there's also what's called a processor, which is really a wrapper for the feature extractor and the tokenizer. So we have that loaded as well. Now here, I've just printed one of the elements, could just print the zeroth element of the training data set. And we can just take a quick look of what's in there. And let's just run these cells as well. And see, we can see that the data set is indeed six rows of training data and then five rows of validation, which matches what I just showed in Hugging Face. And here you can see the audio. So the audio has got a path to a segment of audio. That's the first segment or the zeroth segment. And then there's the sampling rate. So this is sampled at 48,000 hertz. Actually for Whisper, it needs to be 16,000. So we'll downsample that later. And then you can see the text corresponding with that first snippet here. And note that the text also has a start time and an end time. So there's extra information as provided on this repo here. So I've also pushed start and end times for each snippet. Not that you have to, but it's a nice feature. Okay, so as I said, we need to downsample because the data is at 48 hertz. So we're going to set up here. This is a way, first of all, we're just going to cast, we're going to add a column into the data set that is for the correct sampling frequency. So here we're adding a column to say that every row should be at 16,000 hertz. So if we run this here, you can see now the sampling rate is shown at 16,000 hertz. But now we actually need to resample it at 16,000 hertz. So this is why we're going to run through a batching process. And we're going to use the feature extractor to... Actually, when we reload the audio, it'll automatically now be loaded at 16 kilohertz because that's what's specified in the audio column. This is all handled by the library in the background once we have that column for the correct sampling rate. And the next thing we're going to do is for each piece of audio in amplitudes, we're going to convert it into the MEL spectrogram, so the frequency and decibel representation using the feature extractor. And so that's why that comes in there. And then we're going to extract the text and tokenize it. So that's going to be called labels. So basically, we want to convert our data rows into rows that have two key columns, one column for the features, which represents the audio and one column for the samples. The labels, which represents the sample. OK, so moving on here, we are going to apply that to the full batch of data. I think I can just go ahead and define that function. And here we have a data set that has audio, text, start and end time. And we want to replace that with one that has not audio, but the features and not text, but labels, which are tokens. So here's how we do that. And after we have applied that, we should get a train and a validation set with just input features and labels, which is indeed what we get. And indeed, when we run a print of the data set, we are going to get this here. OK, so we have the data ready now. We have the features for the audio. We have the tokens representing the text. So we're now going to set up a data collator. This is going to organize the data. Into a batch, as is required, it's going to set up padding tokens. If somebody audio is too short, shorter than 30 seconds, it will just add zeros. It's going to do the same with tokens. If there needs to be any padding, it will say that if there's any padding tokens, then ignore them for calculating loss, because that's not what we're interested in for the purpose of updating the model. So all of this is handled here by the data collator. And we're going to just initialize the data collator like this. And move on to evaluation. Now, one of the evaluation metrics in speech to text is word error metric. So basically, you predict a given subtoken and you compare it to what it should have been. And you can see if there's an error or not and calculate an error rate off of that. This is all prepackaged as a metric. So we just have to run a very short cell. And here there's a little bit of code around which words to consider. We don't want to consider pad tokens. So we need to allow for that when we define the metrics. All right, so our data is all ready at this point. And we're prepared to load a model and pass it through the training process. So as we do with causal language generation for next token prediction, we're going to load the model here. And we're going to load it. We're not going to load in 8-bit. It is possible, but it's a small model. So I don't really see the need. And we're going to use the GPU. So I'll set the device map to auto. So here, this base model is the whisper small. And we're going to set some defaults here that we're not going to force any decoder token IDs. This is where you might want to force the model to always output the language as being first or the task. But we're not going to bother with that. In certain cases, it might help with training because it guides the model in knowing initially what the language is. But we're going to avoid that for now. Next up, we're going to apply LoRa. So I talk about this in many of my videos. I have a very short video on LoRa you can look at yourself. But let me just briefly show you here in the slides. Rather than training each of the large matrices in this model, we're actually going to freeze this matrix. And we're going to apply some new small matrices called a low-rank adapter. And these two matrices have a lot less parameters in them. And we're going to train these while maintaining these frozen. So when we back propagate the information down through this matrix here, we're not going to change anything. We're just going to update these smaller matrices. And when we're done at the end, we're going to merge this LoRa on top. And there's going to be a LoRa for all the matrices in certain modules within the overall transformer, specifically two of the attention layers. So with that said, we're going to set up the LoRa configuration. This is the rank of those smaller matrices. You can think of it like the height or the width of the LoRa matrices. And we're using a LoRa alpha of 64, which implies that the learning rate relative to the learning rate of the base model is going to be 64 over 32. So two. So the effective learning rate is twice what we define in the trainer. And the modules we're going to target are the Q and the V projections of the attention. And we will use some dropout. So I'm just going to define that. And we're going to get a parameter efficient fine tuning model, which basically means we take the base model and we set up these adapters out to decide that are going to be trained. And by doing this, we'd only have to train 1.4% of the total parameters in this combined model plus adapters. So using a LoRa is a way to improve the efficiency of training. And actually, it turns out that using a small number of aggregate parameters performs better. It converges more quickly than trying to train every single parameter. Okay, next, we're going to set up the training arguments. We're going to output results of the training to this trained model name directory, which will be saved here on local. And we're going to use a batch size of three. Remember that the training currently only has six. There's only six rows in the data set I showed you. So here in training, there are six rows. And I'm using a batch size of three. So every epoch, there's only two. There's only going to be two steps in each epoch. But I like having a batch size more than one because it kind of averages things across multiple data points. And that smoothing can help with stability of training. The learning rate is quite high, but this is a very small model. And typically, the smaller the model, the higher the learning rate you can use. And I've set five epochs. We'll see how it progresses throughout. And what else should I highlight? I'm using a batch size of one for eval. You could use a batch size of five, just five eval rows. But I think that's fine. The max generation length needs to be longer than any of the 30-second snippets, which is fine. And the save steps, we've got five epochs and save steps of 0.2. That means we're going to save every 0.2 or every 20% of the full run. So that means every one epoch. So next, we're going to pass those arguments into the trainer. We'll take in the model. We'll take in the training and validation sets. And we're going to compute the metrics. Here, I'm just pointing to what we should use as the tokenizer. It's really misleading to say it's a tokenizer because it's a feature extractor here that's being used for converting the audio into features. And when we run through the training, it's actually really fast, takes 1.35 minutes. You can see that the training loss is falling, the validation loss is falling, and the word error rate is going down. Now, you won't expect the word error rate to go down to zero because we're really just training on a few words, which are a small proportion of the training dataset. The model is still going to, it should be making mistakes unless it was a really powerful model. And this is indeed a small model. But you can see it's clearly improving here. Okay, so we're done with the training here. And what we're going to do now is grab one of these checkpoints. I've actually had to pick up the recording again, but I've run again the training. And in this case too, you can see the best point according to the word error rate is the fourth checkpoint here. And when we check out the files on left-hand side, you'll see that all the checkpoints are saved. So there's two, four, six, eight, and 10. Now there's two steps per epoch because there's six rows of data in batches of three. So checkpoint 10 would correspond to the end of the fifth epoch and checkpoint eight is probably the best one. So I'm going to set the adapter to push as checkpoint eight. You could load an adapter here from the hub. Obviously that would, maybe you want to compare with some other adapter and do a quick test on that. But generally you want to pick an adapter from the training you've just done. You can see when we print that, it prints checkpoint eight. I'm going to apply that adapter now onto the base model, which is using the PEFT, parameter efficient fine-tuned model. And once that's done, you can also push the adapter to the hub if you'd like. I did that earlier on an earlier run of the script. Next, we're going to merge the adapter onto the base model. So this is taking checkpoint eight and merging it in. This is basically getting rid of the adapters here by merging them onto the base model. So now we're just back to having a simple base model. They're about simple, but we have a base model. And if you print the model, you can see all of the layers. There's the encoder, which takes the features from the audio, puts it through a number of layers. And then there's the decoder, which is a text-to-text model. It takes in tokens and then generates the predicted transcript. So I typically like to save this model. This is the merged model. You can see it's appearing here in the whisper small LLM lingo, all the files. Also, I've saved the processor, which saves the extractor, the feature extractor, and the tokenizer. So that's all present here. And next, we're going to set up a pipeline to run an evaluation. The model that we want to evaluate is the trained model. So I'm actually picking the trained model that's saved on the disk here. We're going to process it in chunks of 30 seconds and use the GPU if possible. Now, you can for stability force the first tokens in each chunk to be the language and the task, but the model is actually pretty robust, so we don't need to do that. And so I've gone ahead and I've run this process audio and create VTT. What it does is it takes in the validation file. So this is not what we trained on. It's a validation audio, the mp3 file, and using the fine-tuned automated speech recognition model that we've loaded here. We're now going to get a transcript. And when that is run, we can open up the transcript that's generated, evaluation.vtt. And you can see here some improved performance. Now, I'll show you where it has done well and where it still has room to improve. So it's, for example, getting correct a lot of terms, like ye34b chat. It's getting that perfectly. It's picking up mixtral 8x7b. And again, up here, mixtral 7b, it's picking up. And clod 2.1, it's picking up. So it's getting all of this very well. It's getting phi2. Previously, it was spelling that phy, so it's getting that correct. Trixie is cky, so it's not getting that correct, which is not exactly right. And also, this here is notux. There's no it. This here, itby, is actually an 8. It's not recognizing me saying 8 in Irish English. But you can see it's already, with just some quick fine-tuning, getting very good performance on some added words. I think if I made a longer transcript, I've only said each keyword only about once in the training set. If I made more of a transcript where I said the same word in multiple contexts, I'm pretty sure it would be able to pick up pretty much all of the words that I'm training it on. So with that, now that the model is evaluated and working well, you can go ahead and push it to hub. You can use the safe serialization parameter to push it as safe tensors. And then you also want to push the processor to the hub as well so that people can make use of that model. And indeed, after you've done that, you'll see the model should appear. Here, you can see the files. And we have a lot of files, added tokens, the configuration, the safe tensors, about a gigabyte. So it's the same size as the base model, which makes sense. Because we've merged the model, so it should be the same size as the original. And you can see the tokenizer as well. And that brings us to the end of the script on fine tuning. That's it. Before you go, you'll find all of the free collab notebooks linked below if you want to check those out, and also a link if you want to pay for this repo. Now, a few final tips. If you want stronger performance than what I showed, you can try out the Whisper, Medium, or even the Large model. If you want to improve the fine tuning performance, I'd recommend doing more recording than what I did, which was just a two minute sample. If you have a list of words that you want to fine tune on, you can just read out those words even more times, maybe in an even different order, and or with different phrasing and different explanations between them. Actually providing a little context on what the words means can also help because the transformer considers up to about 30 seconds of text when it's decoding to produce your transcript output. As per usual, let me know if you have any questions right down in the comments. Cheers.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3716 customer reviews

1/729

Verified Order

“I am beyond happy with this service, which I am using it produce interview transcripts for my dissertation research. The interface is easy, the customer service was prompt and informative, the transcript is accurate, and the pricing is wonderful. I will recommend GoTranscript to anyone who is in need of affordable human-powered transcription services.”

Justin McDonald

Jun 29, 2025

“great work. quick and professional”

christian oradesky

Jun 28, 2025

“Very quick turnaround and nicely done!”

Chris Irwin

Jun 27, 2025

“I love your service - it's super accurate and clear and better than Rev. :) ”

Jodi

Jun 26, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support