Optimizing Audio Transcription with Vosk and Kaldi
Discusses training Vosk models, Ruby automation, and NLP tools. Covers language model interpolation, setup, and future exploration with VideoGrep.
File
Notes on Vosk and Vosk Utilities
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: You You ♪

Speaker 2: At the moment, I'm just doing some initial research on training a Vosk model. So my use case here is, I have a collection of audio and video recordings that I'm going to try to transcribe. And

Speaker 3: I need something that's sort of specific

Speaker 2: More specific than currently the whisper model or the Yeah, the whisper models can can sort of do you

Speaker 3: So there's this version and then there's an L graph version.

Speaker 2: What is the difference between the The BOSC Model English US and the BOSC Model English US L-Graph.

Speaker 3: Okay, so the mask model English US is the standard model,

Speaker 2: which has a fixed vocabulary. the vocabulary is determined at the time of training and cannot be changed during run time. So it makes it efficient and suitable for situations where you know what kind of speech you'll be transcribing. Whereas the ElmGraph utilizes a dynamics graph structure, which allows you to modify the vocabulary on the fly, even while the speech recognition process is running. So, this makes it ideal for situations where you might encounter new or unexpected words, which for this use case is definitely, definitely the ideal choice. you

Speaker 3: You I think I'm asking the wrong question here.

Speaker 2: So, what I'm hoping to do is to automate some of this process using Ruby. I'm just trying to find out if there's anything that already exists. There sort of was for pocket Sphinx, but I don't think there's much for, I don't think there's much for Vosk, Vosk-related material. So, when it talks about language model interpolation, it's referring to the step where once the models are trained, you interpolate them to create a new model that combines the strengths of both. This can be done using a variety of methods. What I had in mind was the pre-processing steps before you would interpolate the models, which is collecting collecting the data, the text corpus, cleaning the data, and then training the generic model, as well as training the domain-specific model.

Speaker 3: So I was thinking of

Speaker 2: Ruby gems related to natural language processing but I think I got out of it pretty good but just because it then asked me well which libraries do you want to use? I'm like shoot, I don't think there really are any already for Vosk or Kaldi. So I suggested using TTS, text-to-speech, and Rake. So we'll see if that. The Hugging Face blog post discusses using n-gram language models to boost the performance of the Wave2Vec speech recognition model. could be implemented using libraries like TTS or Rake that provide n-gram language modeling capabilities, which isn't true at all, just by the way, but, and this does not exist as well, unless I missed something. Let me check out the link here. Also, it's clearly really good at aggregating search results. I was about to ask sort of a deeper question into programmatically going about this, but At least with the basic service, it's not the best at that.

Speaker 3: You You I failed to find the interpreter.

Speaker 4: Now what am I talking about?

Speaker 2: Setup tools, system side packages.

Speaker 3: I don't know if that's what it was.

Speaker 2: I always thought these two were the same thing, kind of like RVM and RBN, Finn Ruby. Not exactly.

Speaker 3: Oh, I want 3.10.

Speaker 4: Doo doo doo doo doo doo doo da da.

Speaker 2: All right, oh, okay, you have to install it first, I see. Okay, so you have to install the specified version first with pyinv. Then you can create a virtual environment with that version.

Speaker 3: You

Speaker 2: So this is another reason why I'm taking the time to go through this is that these tools are a bit further along in development for models like Vosk or Pocket Sphinx, or I forget the other one, Kalbi, but as you can see it does Let's take some configuring.

Speaker 4: Okay.

Speaker 2: I think

Speaker 4: And you can find references to...

Speaker 3: Eight point seven.

Speaker 4: 18.7. Why is this necessary? Oh, okay. Oh, okay. You

Speaker 3: Oh, I see.

Speaker 2: So the problem might be with the fact that I had Rust up installed as opposed to just Rust.

Speaker 3: Thank you.

Speaker 2: Well, given that it's some sort of packaging system, I'm not sure if it's actually required for this particular piece. The first thing I'm actually giving it is an audio track here.

Speaker 5: West exterminate, yes, yes, gas cap link, stretch more, Mint's hack, point sun red, point hoof trap, slash tide, gust each bat, slash nerd urge, mate each nerd Yes Cap drum, space nerd urge tab, yes Hoof space, shy red tab, Troy Link Yes Sun odd, right Yes Go go go Sss, shh Boom Boom, sss, shh, pew, sss, shh, shh, sss, sss, ah, ah, ah, ah, quick, quick, quick. Each space, bat, yank, each, yes, ice, scribe, goodbye, clause, space, red urge, nerd, space dash, notify, send, space, shout, bat, yank, each, yell, same, space, scribe, sleep, space, two, same, space, pit, scribe, kill, video, point, sun, hoof, save, hype, quench, scribe, echo load space tide plus slash backtab space vert space nerders made each nerdcap yes goodbye

Speaker 6: From spontaneous generation of contextualized sense sense. Contact.

Speaker 3: You

Speaker 6: From spontaneous generation of contextualized sense

Speaker 3: Contextualized sense, electro-rhythmic pulse Electro-rhythmic pulses propagate signaling the transcending self-referential state. Electro-rhythmic pulses propagate signaling the transcending self-referential state. Electro-rhythmic pulses propagate signaling the transcending self-referential state.

Speaker 6: Propagate signaling the transcending self-referential emergent patterns at recursive depths. Emergent patterns at recursive depths.

Speaker 3: Conscious flow engaged, a quick tremor of deep resonance. Thank you so much for watching, and I'll see you in the next video.

Speaker 2: All right, so now I'm going to give it a do.

Speaker 3: Let's see. All right, building a database of words.

Speaker 2: All right, building a database of words. So when, yeah, so scripts for parsing the transcripts can probably be adapted from the ones that are parsing the files. or breaks, I don't know, I guess we're gonna start with the link chain chunker that seemed to be the chunker.

Speaker 3: So yeah, sorry, it's a little loud.

Speaker 2: Yeah, this works pretty well, pretty impressive. Later on today, I hope to explore the VideoGrep and MP4Grep tools.

Speaker 3: So yeah, so the next step will be to look into how the,

Speaker 2: the text corpus is to be structured. The Kaldi toolkit that requires a specific structure to the text that it's gonna be trained on. And that'll take a few days. That's a fairly grindy sort of set of tasks that you have to really sort of just zero in on for a while.

Speaker 3: Bye-bye. you you

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript