Whisper vs. Vosk & Kaldi in Subtitle Edit Comparison
David compares Vosk/Kaldi and Whisper models in Subtitle Edit 3.6.7 for efficient audio-to-text transcription.
File
Audio to Text VOSKKaldi Models VS Audio to Text Whisper Models in Subtitle Edit 3.6.7 BETA
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hi, everyone. My name is David. And in this video, I'll be comparing the audio to text using Vosk and Kaldi models in Subtitle Edit versus the audio to text using Whisper in Subtitle Edit 3.6.7, the beta version. Which version or which model is better? Now, I'll say that it depends, but let's see. So this is the beta version that I'm actually using. So I'll just double click on it to run the Subtitle Edit beta. Help check for updates. It is here. There's no update as of now. But let's see about this. It's Subtitle Edit 3.6.7. Next, beta 4.28. So now I finally got Whisper to work. But let's begin by getting an audio file. Let's go for audio file. Let's go for this one. Let's click on open. It's a two minute 30 second file. Now, our first test will be using the Vosk models. So I'll go to video, audio to text, Vosk stroke Kaldi. Click on that. If you don't already have the models installed, you'll need to install this. And you may also need to install the LibVosk for this. You can also go to the website to see the different models available. But if you already have the models installed, you can choose from this section. Or if you don't have them installed, you can click on these three dots and download the VUSK model for the language that you want to transcribe. Now, I'm going for English and I believe I have the medium that is 128 MB. So what I'll need to do now is just ensure that I use post processing, line merge, fixed casing, punctuation and more. And then if I'm satisfied with that, click on generate. Now, it's going to load transcribing audio and shows you the time remaining for this to be done. Now, just to mention this is that the Vosk and Kaldi speech recognition is faster than Whisper, OpenAI Whisper, because I've already tested it. It gives you better separation for your subtitles, breakdown of your subtitles than Whisper and seems to be transcribing everything including ahs, ums, or in this case, fillers. Ah, but it's way faster. If you're looking for something that will auto generate subtitles for you faster, I believe the Vosk model and the Kaldi speech recognition is the way to go. So that's almost done. And it shows you the time. So it's already done. So if I play this, you'll already see that it's already appended. Like hi, hi, everyone. This is David from the beginning. So if I play this, Hi, everyone. This is David. And this is a clarification. I did this video on how to... So that's more or less like a filler word. But it doesn't do a good thing. But now you'll also notice that this I did this video on how to expose SRT subtitles in DaVinci Resolve. It is in DaVinci Resolve. You'll notice when we use OpenAI Whisper, or the audio to text version via Whisper, it picks up everything correctly. It's really, really crazy that it does this. So if we play, keep playing. Put SRT subtitles in DaVinci Resolve. And unfortunately, I mentioned that it's not possible to export an SRT... For example, there, it actually puts a full stop when I'm not done talking. I think that there should be a full stop. There should be none. But that's the VUSK Strokaldi Modos. Really awesome, creating quick subtitles shows you how long you have to go. But I feel Whisper has better quality and takes longer. So let's just go to file, just new. No, I don't want to save the changes. And then go to video, open video, just get the same audio file, testing, and then click on open. Still the same audio. And then we go to video, audio to text, Whisper, click on that. And in a previous video, I actually walked you how you can do all this. So hopefully, it's understandable for anyone that's trying to run Whisper speech recognition. So you can generate text to text from audio via Whisper speech recognition. And the language Modos, this supports over 100, including, if I go down, we have Swahili. And I'll need to test that out and see how it works. Hopefully does a good job. And then you can choose the Modos. I already have a couple of Modos available. If you're working with English, choose the versions written dot EN dot EN dot EN. There's no difference between the small and the medium dot EN. But there's probably a difference between the base and the EN. And if you choose a tiny dot EN, it's faster. But with poorer results, that is in terms of transcription punctuation etc. So I'll go with the base model, because it does take some time because I've noticed if you use the base model, it takes almost the same amount of time as your audio file. So once we select that, ensure that we have use post processing, line merge, fixed casing, punctuation and more. And then let's click on generate. So it's transcribing audio. And this is still experimental in subtitle edit 3.6.7 beta. Hopefully it's going to be fully implemented. And as I mentioned in the previous video, probably Nikolaj can think of how he can just create a transcription tool based on this particular Whisper model. A really, really awesome transcription and translation tool, especially if you're working with different languages and you'd like to translate them to English. Hopefully also have a way that you can put this back into subtitle edit or something of that sort. But a free transcription application from Nikolaj would be really welcome. So this takes about the same time as the audio length. So that's probably the downside with Whisper is that it takes longer to transcribe your audio to text. But from my own assessment, it has better results than the Vosk or Kaldi models. It's been trained on over 680,000 hours of multilingual and different things and different accents etc. So I think it's really, really good in what it does. Also, the other downside is that when it comes to the subtitles that it appends to a video, they're really not in sync, if I may say that. And also, there's no like, it's just like dumping text. It doesn't respect, let's say maybe if you've set a profile of let's say a maximum of like 42 characters per line, and maybe two lines, two line subtitle, it doesn't respect that. It just hits you with a block of text. As much as there's a break, you'll always see the red lines. I wish it can do better. Because probably it would be the go to application for quick subtitling. But as it is, it's still way, way better than most applications out there. It's doing a really good job, especially when it comes to transcription of audio. So I believe this, we're going to give it a couple of more seconds and see what we get. I don't exactly remember what time it started, but it should be already almost done. Now, we also noticed that with this, there is no like time remaining. As I've mentioned, this is still experimental. I wish we had a time remaining even if based on the test and whatnot, we already know that it takes almost the same amount of time, depending on the model you've chosen. Same amount of time, especially if you're doing the base goes with like two minutes, 30 seconds or three minutes thereabout. So it'd be awesome to know how much time is remaining before this is done. Because it would be futile if you just stay there. So see what we have here. First of all, we'll notice that hi everyone begins way, way before I start speaking. So this is going to give us some synchronization issues. But let's just play it. Hi everyone, this is David and this is a clarification. I did this video on how to... So you'll notice that it does a very good job right there from this particular section. Hi everyone, this is David and this is a clarification. It doesn't transcribe the ah, that is a filler word. So it just ignores it. I did this video on how to export... Now you notice that it just hits us with a block of text. I did this video and you actually notice that it actually picks up DaVinci Resolve as it should be. So the R should be capital. I just missed that. 30 subtitles in DaVinci Resolve. And unfortunately, I mentioned that it's not possible to export an SRT file or VTT file separately without rendering the video. It's possible. So when you listen to that, my voice went a little bit slurry, but it still picked up the correct thing. Now, that's why I'm saying that Whisper is doing a really good job in terms of transcription. The only place it's failing at is in the synchronization of the subtitles and probably maybe respecting maybe this probably should do some better splitting of the subtitles to make this automated as much as it should be. But from my own test and comparison, this audit text via Whisper will give us better results. No doubt, it's going to be way, way better than using the Vue SK models. Also, based on the research that has been done, and based on the input that has been done, yeah, it's really going to do a really, really good job at all this. Now, yeah, you just feel free to test it out. So we can just... I'll just open up DaVinci. So you'll see the synchronization. It begins way, way early. And maybe that shows this is the first time I've used DaVinci because I needed specifically the subtitling and captioning features. And I did not do a deep dive into it. But the transcript is really, really awesome. So it does a really, really good job, especially with the transcription. And I think that's a plus, especially when it comes to transcription, you need accurate. And especially when you're working with audios that you don't know. And I believe I've noticed that you can also do selected lines, audio, audio to text via Whisper. So you can actually select a couple of lines, especially if you think they did not run as you want them to do. Go to selected lines, audio, audio to text, Whisper. You see those three lines, click on generate. And it starts transcribing the audio for those three lines. Let's see if we are going to get a difference result. It's done with the first one. So this one looks a little bit better. Let's keep going. Let's see. Now we should give us a progress of this. But the audio to text for selected lines is working really good. Shows that there's some progress there. But it still seems that it's taking too long. It's done with the second one. But I still believe that it's going to give us the same same result. But this is revolutionary from OpenAI just letting us have Whisper. It's done. So it's just the same same result. Ah, looks really good. Let's just look at it. I'll just open up DaVinci. So now the second time it actually did probably did not give a good result. Probably it used a smaller model or what did it use? So let's go to selected lines, audio. It still used the base, but gave us a poorer result. Ctrl Z. Yes, the original one was way better. Because I believe it did understand the context of what is going on. The application being talked about. So I know this video has been long, but that's a comparison between the audio to text ViewSKStrokeKaldi and audio to text Whisper. I feel Whisper has better results. But I wish the subtitle synchronization was better. And the splitting of the subtitles, at least to save people time and all that. I wish it was better or looked more or less like what we have with ViewSK. Yeah, so probably each has its own advantages, depending on where you're looking it from. But I think Whisper in the long run will come out on top. That's it from me. That's my comparison of ViewSKStrokeKaldi models and Whisper in subtitle edit. Thank you for watching this video. And I hope that is of value to you.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript