Exploring Whisper Models and WebGPU for Speech Recognition

Convert Your Audio To Text

4.9/5

3744 customer reviews

Discover Whisper models developed by MetaAI and their integration with WebGPU for seamless speech recognition, transcription, and translation in web browsers.

Whisper WebGPU - Multi-lingual AI Transcription Locally in Browser Locally

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: If you have been doing any sort of speech recognition with AI, you must have heard of whisper models. Whisper models are a type of AI models that use self-supervised learning to recognize and transcribe spoken language. It has been developed by MetaAI. Whisper models are trained on a large data set of speech and can be fine-tuned for specific tasks like speech recognition, speaker identification, and language translation. They are called whisper because they can be fine-tuned to recognize very quiet or whispered speech and are capable of transcribing audio recordings with high accuracy. Whisper models have many potential applications, such as improving voice assistance, generating subtitles for videos, and helping machines better understand human speech. In this video, we will be looking at this new project called as Whisper WebGPU. WebGPU is a JavaScript API that provides access to the graphics processing unit or GPU of a device, enabling more efficient and powerful graphic processing and computation in web applications, or in simple words, this allows you to run large language models within your browser. That's it. The problem with WebGPU at the moment is that it is only properly supported by Chrome browser. So if you're using any other browser, you might not be able to run it. Plus, there is some experimental support in Firefox, but that is still patchy. If you're looking to run this demo on the Huggins space, like you can see on your screen at the moment, it will only run in the Chrome. I think now you can, there's also some sort of support in Microsoft Edge browser, but I'm not sure because I have tried it out in Chrome and only there I was able to work through it. Anyway, coming back to this project, you can not only run it in your browser because it uses transformer.js from Huggins space. You can also install it locally and then play around with it. I'm just going to give you some of the steps as how you can do it locally. So let me take you to my VS Code editor, where I will show you how you can get it installed. This is my VS Code editor on Windows. The first step you need to do is to just create a new directory or folder and then git clone this whisper-web.git. I will also drop the link to it in video description. Let's wait for it to get cloned, you know, take too long and that is done. And then, of course, you need to cd to that whisper-web directory. Let me do that. So you see that I have cd to it. That is done. Let me clear the screen. One of the prerequisites which it has is that you should have npm installed. So I have this npm installed. Let me show you the version and also make sure that you have Node.js installed and that is also there. So if you don't know how to install it, just Google it or just ask from any LLM and that is going to tell you how to install it. It's simple download and then clicking on setup and exit. Let me clear screen and then let me run the development server. Also if you are using Firefox, you would need to change the dom.workers.modules.enable setting in about-config to true in your to enable web workers. So let me start this server here and then you can access it on localhost at port 5173. So let me press enter. So you see that now it is running on our localhost in Windows at port 5173. Let me try to access this in the browser. And now you can see that it is running in my browser. You can access your audio file or speech file from URL from local file or even you can record it. And if I just quickly take you to back to my VS code, you will see that it is also showing some of the logs here as what I have done. Okay, let's go back to the browser. Now from URL, you can just click here and then you can give it any URL. I'm just going to go with this sample file. You can give it your own file. Click on load and it is going to load it on your browser. And then you can transcribe that. Let's play it first.

Speaker 2: So in college, I was a government major, which means I had to write a lot of papers. Now when a normal student writes a paper, they might spread the work out a little like this. So, you know, you get started maybe a little slowly, but you can see this is audio file.

Speaker 1: Let me click on transcribe audio and this is all running in the browser. So it is just downloading the model in the ONNX runtime. And then also it already did the tokenizer one and it is also decoding it. So let's wait for it. And you see all the status in the button bar, should not take too long. And I believe this is a one time download of the model. Yeah, it says that only run once. It is transcribing. Amazing. So let's wait for it to finish. There you go. And you saw that I played it around and how good this looks. How good is that? So it's a one minute audio. It is all running in your browser for free, nothing else to do. There you go. So now you can export it as a txt file or you could export it as a JSON file to your local system. And if you click on this, you can download this audio. If you click on this call button, you can go with multilingual. So it is not just for English. You can even go with the Urdu one. You can also quantize it. And then there are a few models like for now it uses this tiny whisper model and you can even go with the base one. So I think let's go with a tiny one. And then you can even click here on multilingual. So when you click on multilingual, you can select the languages, there are heaps of languages here. For example, there are some of them, which I don't even know in Norse. I don't know which language is that, but it looks really, I didn't know that Luxembourg has a language anywhere, but there are a lot and a lot of languages here. Even Pashto language is there, which is a regional Pakistani language. Also spoken in the middle of Afghanistan, I guess. It's amazing. So Tamil language is there. You can select your language of choice. I think every language is there, I believe, almost. So Urdu is there, which is really awesome language. And then you can just select your, maybe, you know, you can just select it. Let me try to speak it in Urdu. I'll just click on record. I'm not sure I have my mic with me, but I'll try. Let me do it. I'm just going to click on record and then start recording. And then I have to allow a microphone in my browser, I did. Let me speak it in Urdu. So I just spoke some few bits in Urdu. Let's load it. Let me play it. I did. Let me speak it in Urdu. Okay, so you see that this audio contains both Urdu and English, mainly Urdu. Let's transcribe it. So it is loading the model. Then let's see what happens. So it is, I'm not sure why it is. Okay. So it is an encoder model. Okay. Maybe it is for Urdu. Let's wait for it. It has loaded the model. There you go. Okay. So, because I don't think so, I have faults installed of Urdu, but as far as I can tell, as far as I can make it out, it is writing it quite nicely. So amazing stuff. So don't forget to install the fonts here. And as I said, you can also get a local file from your system and then you can transcribe it. Amazing stuff. So I will drop the link to this project in video's description. Let me know what do you think. I think I'm going to use it more and more locally. Looks really cool. If you like the content, please consider subscribing to the channel. And if you're already subscribed, then please share it among your network as it helps a lot. Thanks for watching.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3744 customer reviews

1/736

Verified Order

“I loved it”

Ivy

Oct 29, 2025

“Price is fair, accurate transcriptions and user friendly.I would recommend.”

Robert

Oct 20, 2025

“I am delighted I chose your service. The human interpreter did all I needed. I chose GoTranscript because of the time I saved by having this done. Thank you.”

Alfred

Oct 16, 2025

“So far, OK ”

Steve

Oct 15, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support