Practical Introduction to Kaldi ASR Toolkit

Convert Your Audio To Text

4.9/5

3720 customer reviews

Explore Kaldi for speech processing tasks through a guided tutorial, using innovative examples like animal sequences for model training.

Kaldi ASR - Hello World Tutorial

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Okay, so this video will be a very practical intro to Kaldi. Kaldi is a toolkit for ASR and other speech processing tasks. I will be doing something very similar to A Nello World, largely inspired by the Kaldi for Dummies tutorial. After you installed Kaldi, all you have to do is go to this public repo and clone it into the X directory inside Kaldi. There it is. Now, while in Kaldi for Dummies tutorial the suggested training set and test set recordings contain sequence of numbers, I decided to change and use sequence of animals instead, hence the name of the files which contain the name of animals in Portuguese. Now, like mentioned in the tutorial, you have four files in the train and test directory. You have speaker to gender which maps each speaker to a gender. In this case we have just one speaker which is me. Then you have wav.scp which maps an address ID to the full path of a recording. The full path is missing in the screen. Since my full path is different from yours because we have different machines, in order to get the proper full path all you have to do is just run one of my scripts. Go to Pedro's scripts and run format wav.scp. Then you also need to have a file named text which maps the sequence of words set in a recording to the address ID. And finally you need to have address to speaker which maps each address ID to a speaker. You do this both for the training and test set. The content of these files needs to be sorted, so if you did not sort it you can use one of my scripts to sort the train and test directory. Lastly there is a file that you need to have called corpus.txt which should contain all the address transcription that can occur in your ASR system. This file goes into the local directory. We have just covered the acoustic data. Now let's go over to the language data. Everything we need is located inside the dict folder. There is lexicon.txt known in ASR literature as a pronunciation lexicon or pronunciation dictionary which essentially maps each word to their phonemic representation. It is important to remember that one word can have more than one representation. Then there is a list of non-silenced phones, a list of silenced phones and optional silence. The rest of the files in the repository are just copied from other folders in X according to the Kaldi for dummies tutorial. Now all you need to do in order to train your first ASR system and test it using Kaldi is running format wav.scp to get the full paths of the audio files in the wav.scp of the training and test set and then just hit run. This should be fast since both the training and test set are very small. Everything went alright, the model was trained and decoding was performed on the test set. As you can see we have 5% overdecode rate after decoding which means there is some mistake going on. In order to detect it we can use this line from the repo which shows us the transcription generated by our system. It essentially writes the words matching the best path in an output file called out.txt. And if we analyze out.txt we can see that we have a total of 20 words in the recordings of our test set and we have a mistake here. The word dog is being wrongly inserted. Let's hear the recording where that is happening. You can hear a small breath at start. It might have to do with that. According to the word decode rate formula we have 1 insertion, 0 deletions and 0 substitutions which makes for 1 over 20 explaining our word decode rate of 5%. Hope you find this video helpful. Shoutout to everybody maintaining and developing Kaldi and good luck developing your speech models.