Comparing Audio-to-Text Models in Subtitle Edit (Full Transcript)

Explore the differences between Vosk, Kaldi, and Whisper models for audio transcription in Subtitle Edit 3.6.10 to enhance subtitle creation.

Download Transcript (DOCX)

Speakers

Add new speaker

Speaker 1: In this video, I'll do an update to a video I did a while back, sometime like a couple of months ago. And this is a comparative video of the audio to text feature in subtitle edit using Vue SK and Kaldi models versus the audio to text using Whisper models. Now, there is a big change and this is in subtitle edit 3.6.10 the official version. Now, let's begin by importing a video or an audio. Let's do an audio file. Let's go to audio files. Let's go to testing. And then click on open. Our file is available. Click on this to add the waveform. And now let's begin our test. So I already have everything downloaded. So I'll go to video, audio to text, Vue SK or Kaldi models and audio to text, Whisper. So what do you begin with really depends on what you want. But let's go with the first one. Let's follow the order, click audio to text. And because this video or audio is in English, the model that I have ready is an English model. So the only thing we need to set there is that. So use post processing, line merge, fixed casing, punctuation and more. If you're satisfied with that, let's click on generate. So there we go, it's loading the model transcribing audio to text, and it gives you the time remaining. Now the first difference you'll notice is that in the audio to text using Vosk at the time of recording this video, you'll see how much time is remaining. In Whisper, and there are two models or versions or modes of Whisper, you see a count up time instead of a countdown, like what you're seeing here. So with this, you actually know how much time is remaining. Next, next up, we'll see what maybe separates the Vosk and Kaldi from Whisper. But I can just do a quick mention is that the Vosk tends to for example, if a subtitle begins here, it begins there. With Whisper, it begins at the start. So there's some work. So for example, this is what it looks like without doing anything if I play. Hi, everyone. This is David. And this is a clarification. Now, that is just a filler word. I don't know why it recognizes that as an I. But let's keep going. I did this video on how to export SRT subtitles in DaVinci Resolve. Now I love the way that it does the subtitles. The segmentation is really, really awesome. Like, it knows where to stop, leave a space, and then we get this, we continue. But you can already see that we have a mistake. I did this video on how to export SRT subtitles in DaVinci Resolve, not Venture Resolve. But let's see, we'll see and compare with Whisper. And unfortunately, I mentioned that it's not possible to export. So for example, there is a full stop. I just did a pause, not stop. An SRT file or VTT file separately without rendering the video. It doesn't do a good job there. But that's what we get. I can also, I believe I can pull up another instance of subtitle edit. Hopefully, let's see if we can actually do that. Let's go with testing, open, and then let's go to video, audio to text, Whisper. And for audio to text using Whisper, there are two modes of Whisper in subtitle edit at the time of recording this video. If you right click on this section, and I hope they improve this, right click, you'll see there's Whisper PHP original, and there's Whisper CPP, that is for C++. Now, the Whisper C++ is faster. It has less text on the subtitles, therefore much better than Whisper PHP. And also Whisper CPP takes a shorter time. So we'll use Whisper CPP and you'll see it here. And then because our video or audio is in English right here, we'll use the base model that I have here. And then ensure use post processing, line merge, fixed casing, punctuation and more. Let's click on generate. Now with this, you see that it counts up the minutes or the seconds or whatever. So unlike in the VUSK models that we have a good guesstimate of how much time is remaining, this one we have to guess how much time is going to remain, but it's almost done. Because that's a 2 minute 30 second clip, I believe it's going to be done in like a second or something. There we go. Now, this is what this looks like. And I think in terms of the transcript, it has done a good job. But when you notice, it begins somewhere here the text before even I speak or something. So you have a lot of manual work to do here. So you can pull this, you have to like ensure that this begins here. So you'll have to pull this to go there. So for example, let's play. Hi, everyone. This is David. And this is a clarification. Now you see the way in VUSK, we have like segmentations. This is not available in Whisper. I wish it was because Whisper is doing a spectacular job in terms of the transcript. Because if we play this, I did this video on how to export SRT subtitles in DaVinci Resolve. So for example, you see it just picks everything. I don't know what they're using in the VUSK models. Because even if we have a transcript error here, I did this video on how to export SRT subtitles in DaVinci Resolve. But also the problem here now with VUSK, there's a full stop. In Whisper, there's no full stop. But the and that comes in here. And unfortunately, you see actually it actually even cuts out maybe some of this, you can just pull that back. I don't think it would be relevant to leave this much space here. And then we have the and here. So you'd rather move it to this other line. So let's see. And unfortunately, I mentioned that it's not possible to... Now, it's doing a good job in the transcript. I wish it just segmented the subtitles the way we have them in the VUSK models. I believe it would be a game changer and time saving application. But simply put, you see that Whisper is faster. But it has some varied results, especially in terms of the way it's not respecting some of these spaces. Like, for example, here we had this text all the way up to here. But this is an improved or an update to a video I did a while back. And now Whisper has come of age, it's much faster. We have a countdown timer, we have like a character limit, I believe it's 86. I believe it should be 86. This is actually the Whisper version. It's 86 characters. If you decide to go and use the Whisper original, that one is going to be a lot of text, and it's going to take much more time. But hopefully Whisper will keep getting better and better. And also VUSK models because all these are free options that we can use to automatically transcribe audio to text and into subtitles in subtitle edit. Maybe this video is going to be of value to you to know what you're going to use and save time. Thank you for watching this video. Until next time, stay safe and never stop learning.

Summary

Generate a brief summary highlighting the main points of the transcript.

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Key Takeaways

Extract key takeaways from the content of the transcript.

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file