Exploring Whisper V3: Setup, Usage, and Benefits
Discover how to utilize Whisper V3 for speech-to-text transcription, including setup on Google Colab, error rates, and comparing it to previous versions.
File
Use OpenAI Whisper For FREE Best Speech to Text Model
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: on dev day along with the other products openai also released whisper v3 which is their state of the art speech to text model and it's available through their api however they also released their open source version that you can run on your local machine that's exactly what we're going to be doing today and i'll also show you when not to use this model so whisper model comes in five different configurations so you had tiny base small medium and large and then you have large v2 as well as large v3 which is the latest model. Now the smallest models are either trained on English only or multilingual data whereas the large models are only trained on multilingual data. First let me quickly walk you through the memory requirements for each of these models before showing you how to run this in a Google Colab notebook. Now in terms of the VRAM requirement you need around 10 gigabytes of VRAM to run the large variation of a vSphere and this holds true for both v2 and v3 of the vSphere large model. For the tiny model you just need nearly one gigabyte of VRAM. Now for the smaller models if you want to transcribe English only you need to use the .env at the end of the model name but if you want to use the multilingual version you just need to provide the model name. For a large model, it only supports multilingual data. Here is a quick comparison between the error rate of v2 and v3. As you can see v3 seems to be performing better compared to the last version of whisper. However, there are cases in which v2 is still better and I'll show you that in a bit. So I'll quickly demonstrate how to use the third version of vSphere in this Google Colab notebook that I've put together so first keep in mind we can run this on the free Google Colab notebook because the vSphere v3 needs around 10 gigabytes of VRAM and T4 GPUs that are available on the free tier of Google Colab has around 16 gigabytes of VRAM. First we need to install some packages so that include both the transformer package, accelerator package and I'm also downloading the dataset package but in this video I'll show you how to upload your own audio file to Google Colab notebook in order to just transcribe that. The T4 GPU does not support flash attention but if your local GPU supports flash attention you can also install that and that will improve the performance and speed of transcription. After that we are importing required packages. So we are importing Torch and then AutoModel for speech sequence to sequence as well as AutoProcessor and Pipeline. Next we are checking if GPU is available on this machine or not. If the GPU is not available then it's going to run on a CPU but it's going to be extremely slow. After that we are setting to use floating point precision of 16-bit if we are running this on gpu otherwise for cpu we're going to be using 32-bit precision now as you can see we have the gpu available that is why we are running it on float 16 so next we are setting the model that we want to use that is list for large v3 and we are loading that model so we provide the model id then the data type or the floating point precision that we chose since we are running this on a free Google colab so I set the low CPU memory usage to true and we are also setting use safe tensor true now if you're running this with flash attention I'll show you you just need to set another parameter in here now this is going to be loaded on the CPU that's why you need to move that model to the GPU after that we are setting our auto processor this is going to be used both for the tokenization process as well as for featured extraction from the speech signal. Now there is one major change when it comes to the processor that we are using so the number of feature bins that it computes are different so now it computes 128 male frequency bins instead of 80 compared to v2. Now in order to do speech to text transcription we will need to create a pipeline and this is a class within the transformer package we want to do automatic speech recognition then we provide the model that we created in the previous step for tokenization as well as for feature extraction we are using the processor that we create then we set the maximum number of new tokens that it's supposed to create now you can set this to a high number depending on how how much VRAM you have, I think if I go beyond 128 tokens, the T4 actually starts running into issues. So that's why I'm setting it to a low number. And then the chunk length in terms of how many seconds is supposed to be the batch size you can define if you want timestamps during the transcription, which are very helpful for something like YouTube videos. So you can set this to true and it will return timestamps. And then we set the floating point precision as well as on which device we want to run this. Okay. So we created the pipeline. However, we need to provide an audio file to transcribe. So for that, I'm using an audio file, which is one of the videos that I created. So GPT-4 vision, that's the file name. way you upload files in here is you can simply click on this file icon and then click upload. I'm just uploading this wave file. So you can just upload the file and wait for it to finish uploading. So here the wave file is uploaded. So here I'm just providing the path of the file. Now in order to get transcription, you just need to provide the path of the file to the pipeline that you created and you will get the transcription. Now here you will see this magic function that i'm using so i wanted to actually time it this file is around 12 or 13 minutes long and i wanted to see like how long it's going to take actually for it to transcribe so the way this time it function works is it's going to repeat the same operation multiple times so for example you see here it ran this seven times and there is an average duration of 7.14 seconds that it took with a pretty small standard deviation. So it means that for a 12 or 13 minutes file, it's taking around 8 seconds to do the transcription. Now there are some other options that you can set and we're going to look at those in a bit. So there are two fields within the results. One is the text and the other is chunks. So if you call the text key on it, it will give you the transcription and it's actually pretty accurate. So for example, you look at here. So this is a video that I created on how to use GPT-4 vision to basically look at frames of a video and create a narration of that video and the transcription that it came up with is pretty accurate. So I'm going to put a link to that video in the description. If you are interested watch that. Now if you use the chunks key on the results. So for example here's the sentence then the corresponding timestamps and so on and so forth. Now there are a couple of other options that you can set so for example you can set this return timestamps equal to true so it will return a list of timestamps it's very similar to what we saw before you can also do this for word level as well so for each word it can return timestamps rather than just for each sentence. Now there is a case in which the v2 is better than the v3 of whisper. If you don't know the language that is being spoken and you want the model to automatically recognize the language, it seems like it's better to use the second version of whisper compared to the third version. So for this version, it's better to explicitly mention the language that is being spoken in the speech. So that is something to consider. It means if don't know the language of the speech, just use the V2. If you know the language, then you need to explicitly define it in here. Whisper is a very capable model. Another feature that you have is direct translation from one language to another. So for example, if the speech that you want to transcribe is in French, but you want the transcription to be in English, then you can provide this extra key, which is task and the task is to translate. So basically it will take the speech in French and give you the transcription in English, which is a pretty amazing capability to have. Now, before we wrap up, let me show you a couple of other things. The first one is you can use the flash attention directly within the model, if your GPU supports it. And the way you do it is you simply set this parameter, use flash attention to true and then create a pipeline again, based on this model. Now, at the end, I also want to show you this other amazing open source project called distal whisper. This is basically a distal or smaller version of the whisper model, which is six times faster and 49% smaller. But the error rate is within 1% of the original model, which is pretty amazing. Now, right now we don't have the distal version of, uh, uh, the large V3. But there are already a distal version of the large V2 as well as the medium English. So let me show you how to use this medium English in your own code. Now the code is very similar to what we have seen before. So we're going to be again using the auto model for speech sequence to sequence model. Now in this case, the model ID is going to be different. So we provide distal, whisper, medium, and we're specifically looking at the English version. Then again, you will create an auto model object and load that to GPU that is available on your machine. The rest of the parameters that we are using are exactly the same as before. Again we are creating the processor both for tokenization as well as feature computation or feature extraction. We create a pipeline and then we are providing our audio to this pipeline. Now in this case, it was taking around six seconds. Now this really depends on the available VRAM and stuff as well. But I think this will, you will see a lot more improvement for a much larger audio files. I just wanted to show you that there is this option available that you can run. And the results that we get here are actually pretty good. Now for this distal version, there are definitely some mistakes. Now, for example, if you look here, it says JPD 4, which is actually GPT 4. Now, in comparison, the Whisper V3 was able to correctly transcribe that as GPT 4 with Vision. These are excellent solutions if you're building speech-to-text system. So one solution that I'm currently exploring is to enable basically talking with your documents through speech in the local GPT project. So you will be able to have, um, audio communication. And in that case, I'm going to be using the whisper model to convert speech to text, and then probably some sort of open source model, which will convert responses back from the model, uh, in the text form to speech. And then you can have, uh, this two-way communication with your documents and something like local GPT. So this is an amazing model. I would encourage everybody to explore it, play around with it and see how good this model actually is. I hope you found this video useful. Thanks for watching and as always see you in the next one.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript