Understanding Speech Recognition Systems: A Deep Dive

Convert Your Audio To Text

4.9/5

3718 customer reviews

Explore the processes of training and testing ASR systems, feature extraction, language model preparation, and more with hands-on activities and breakout sessions.

6. Basics of Kaldi Toolkit and Data Preparation for ASR system development

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: that is all about the training process. While we will do the testing, this block diagram is only whatever we are showing that is for testing process. So once we get the test data, we will do the feature extraction, then it will pass through acoustic model. Then the decoding will happen along with the language model and we will get the text output from the speech signal. So this is the overall flow automatic speech recognition system. So in this session, maybe today or tomorrow, what we will discuss, so we have taken database as mini-library speech, which have English language data. So the training set contains around five hours of data, speech data. And testing set is around two hours of speech data. And the language model, we have used here the whole library speech, which is around 100 hours of data, or I think 1,000 hours of data. So the whole text data we have taken for language model building. And from that, using pruning technique that maybe tomorrow Professor Samudraviji will discuss about this pruning techniques. So we have taken, after pruning, we are taking the small language model from that. So the next part will be we will discuss how to download the data, and then how to prepare the data, which will be need to do the further processing using Kaldi toolkit. Next one will be language model preparation. So yeah, in language model preparation, there are two parts. One will be dictionary preparation, then the language model one. And this we will do in two ways. One will be we will use pre-trained one, which is available in open source. And another one is custom. If we have the vocabulary and lexicon, how we can design? Because that will be helpful for us if when we are designing for Indian languages. Because for Indian languages, we can't get a pre-trained model. For that case, maybe the custom one will be helpful for us. So then we'll go for feature extraction. Of course, we are using Male Frequency Substance Coefficient here for the feature extraction purpose. I think morning session, Professor Umesh have discussed very clearly about why these MFCC features are useful for ASR task. And I will, Professor Prasanna will take a session in the evening about the feature extraction. Then we will go for monophone training. Till today, we will go for till feature extraction stage. And from tomorrow, we can go for monophone training and decoding, triphone training and decoding. And we'll go for, at last, linear discriminant analysis, then maximum log-likelihood transform, and another one will be speaker adaptation training. So these three techniques we will use to further improve the performance of the ASR. So some prerequisites, some small comments. I think already in the prerequisite sessions, Professor Samudravis have covered these comments. So now we will clone the mini-libre-speech data and the recipes from the GitHub.

Speaker 2: I think I will go to the website.

Speaker 1: So, if people can also parallely do I am testing this in the chat box. Let me also find it, right?

Speaker 2: Yeah.

Speaker 1: So I think we have already given how to make this corpus folder. So I already have, I think, we have already requested all the participants to download these seven files. So I hope all of you have downloaded that. I will just clone this as I have written here. I will go one by one. First, please open the terminal. I will open the terminal, new terminal 1. It's written in this website. Then I will go to my Kaldi root directory. That will be CD. Then I will go to EGS. Then there are many other examples in the EGS file. Now, in this file, I have to clone that. So I will write git clone, as written here. Are my screen is visible to all? I have to. SPEAKER 2.

Speaker 3: Screen is visible, but font size is slightly bigger. SPEAKER 1.

Speaker 2: Font size, I have to, yeah.

Speaker 3: We'll make that.

Speaker 1: Is this font size is fine? Yeah. So from this repository, I have cloned this folder. So I hope you people are also following this. So now I will go to my Caldi folder. And inside EGS, I have cloned this one. So inside this, I will go to S5. Now inside that, as I have mentioned here, We have to go to cd this folder, S5. So now we are in EGS. Now I will go cd, IEEE, this one. Then I will go to S5. And inside S5, I will see what are the files are there. These are the files. Now I have to create one folder called Karpus, M-K-D-I-R. So now if you see ls, this corpus folder I have created. Now this inside this corpus folder I have to insert this folder I have to copy these seven files. So from these seven files we generally not require all seven files we require five files from that. So from all these seven files, I need this libret lexicon, voxclav, sorry, vocab, vocabulary, lexicon, and training data, and development data, and this prompt 3 into 10 to the power of 7. This is the small language model, pre-trained language model. These five files I need. So I already downloaded. I will take from there. I think you people also have downloaded it from your download. I will take 1, 2, 3, 4, 5 files and I will place in my created folder. So, till this point, all of you are following?

Speaker 4: So, actually, sir, whatever link you have sent on github.com, so, first we need to go go to Kaldi and then we need to run this command, this GitHub command.

Speaker 1: Yeah, you have to go to maybe you have to open one terminal. And inside that, you have to go cd to this folder, EGS folder.

Speaker 5: OK.

Speaker 1: Once you reach in the EGS folder, there you do git clone.

Speaker 4: OK, OK, OK.

Speaker 6: So, sir, please paste the link once more in the chat box because I was in my mobile so I changed it to my laptop. I can't be able to see the link. Okay. Okay.

Speaker 2: Okay. I will do that. Yeah, I think Aditya Jain have said.

Speaker 1: So, now we will inside as we have here if you see. run__custom.sh, run__customprelanguagemodel.sh. So we will start with Custom Language Model 1. So initially, if we see this code, I think we understood that this first sixth number line we have. Is this screen is visible, or I have to increase the font here also?

Speaker 4: Please increase the font size, Ravinder.

Speaker 1: Yeah.

Speaker 5: Yes. Yes. Yes.

Speaker 3: Participants can also open it using gEditor.

Speaker 2: What is the quality? Reference is not working. OK, here.

Speaker 1: So, is it fine now?

Speaker 3: Okay, yeah, they can also open it on their laptop as well.

Speaker 1: Yeah, on their own system, you can open it. Yeah. So now, we will so I have declared my data as .slash corpus this inside this corpus folder I will take the data. So this stage equal to I have given minus 1. So that is required if once I have run this thing and I want to run it again to do something. So then I need to delete all these files. And I will start from scratch, from the starting point. Then I can set my stage as minus 1, which will delete all this. If I'm starting from the phrase, I We'll always keep our stage equal to 0. So and this export Kaldi root, this will be export my Kaldi root directory directly, which will be, in my case, will be home slash jagbandhu slash Kaldi. So now we will go to the next part. The first part will be, yeah, I will go to the document. We have completed this. So from this, we have taken trend data, test data, vocabulary, lexicon, and this one. So we have copied these files. Now for the data preparation part, we have to create our data. From the first thing will be whatever data we have downloaded here. So these are some tar files. So we have to untar it. And these two also, we have to push it to a caldic respective path. And caldic respective format, we have to give this lexicon and voca. So for that, these two scripts we have written. So if you see this, these are the unique vocabularies present in that English language of that library speech dataset. And these are the lexicons. The lexicons means one side will be the vocabulary will be there, and the other side, this phone-level transcriptions will be there. OK? So if I will go to this, here I am writing make directory data. So that is already we have created this corpus folder. So we will go to local. Then we will see that what this untag.sh is doing. So, this will generally, yeah, generally this will take these two folders and it will enter it through this and it will create two folders. Now let us run it and see what it is doing. I will comment all other blocks only I will run this 0th block.

Speaker 3: We do not have to do anything right now.

Speaker 1: Yes, you can comment it, and you can start. I mean, maybe what I'm planning is once I go to some three, four stage, I will give some breakout. I will give some 10, 15 minute time to cut these things. So I have commented on this. Now I will run.

Speaker 7: Hello, Jagbandhu, can you show me from the git clone part? Actually, that part itself has not been done for me. What was the first step, the git clone? I think somehow I missed it.

Speaker 1: Yeah, this is the stage. If you have to go to Kaldi EAS, inside that, you have to do this.

Speaker 7: OK, just wait a second. OK.

Speaker 1: Maybe what I will do, ma'am, I will quickly, some 15, 20 minutes I will take. Then maybe we'll divide through breakout rooms. Volunteers will help you out. It will be better, I think, and easier to enter.

Speaker 4: Yeah, so. Vibhanku, please repeat once again so that others can. Yeah. The steps, whatever you have discussed, after cloning.

Speaker 1: OK. So after cloning, we have to just go to this. So this, we are doing the cloning till this point. Once we have done the clone, in that corpus, we'll create a corpus folder inside that EGS file. Maybe that will be easier to show from here. Yeah, is it visible now, Om Prakash sir?

Speaker 2: Hello.

Speaker 1: Yes, it is visible.

Speaker 4: Yeah, so once you have done this,

Speaker 1: so once you have done the git clone, in the same terminal, you just type cd, this one, then s5. Then you create a directory inside that called corpus. Then you copy these files, these five files, whatever you have downloaded earlier from these five links. You just paste it inside that corpus folder. Then we will go and run this script. Is it fine? Do we need all the files or only the five files? Only five files not all the seven, first two files is not required. Which five?

Speaker 8: Can you just say which five?

Speaker 1: Yeah, that last five files, first two are not required, ok. So next we can go. So I am just running the 0th stage I am commenting all other parts here let us see what we are

Speaker 2: getting.

Speaker 1: So yeah, how to prepare lexicon for language. OK, so yeah, that is a good question. So if I will show you the lexicon.txt file. So if you have text data, first you have to find the unique words from that. If you have a text corpus, you have to find the unique words from that. Once you've got that unique words, then you have to prepare the phone transcription from that. Is it clear? Nobody's asking the question? Yeah, maybe, for example, Hindi, if you have Hindi texts, some sentences, you have to find the unique Hindi words from that. Once you've got the unique words, suppose 1,000 unique words you've got from that. So each word, you have to write the Hindi transcription. Suppose kamal, one Hindi word. So you can write ka. So in this case, if you see that agha, ega, A-G-A in the seventh case. So they are writing like this. So similar way, you can write for Hindi case and other language cases. Yeah. So we'll go to the next part. Yeah, so the first stage I have executed. Now I will show what are the outputs we are getting from that. So once we have executed that, now we will show in the corpus folder. Now this libri-speech folder I have created. So inside this, this file I have created. If you see, this is untagged, train clean. Inside that, these are my speech files, this .flag format. All are in .flag format. And the corresponding transcriptions are here. Transcriptions means the sentences, what this fellow has spoken. So these are my IDs, 19, 198, 00, 00, 00s, then 1, 2. These are my utterance IDs, and these are my utterance transcriptions. So whenever you want to create the things from scratch, you also have to maintain the data like in this format. So first part is done. Now we'll go for the second part. We will fix it as stage. We'll start from stage 2.

Speaker 2: Stage 0 we have done, stage 1.

Speaker 1: Now we will run only this stage 1, and we'll see. So this we have done for speech data. Now, we will do for the language model part. So what is the requirement for that? Local. This is my data underscore lm.sh. So this will take the destination directory and local directory. So this will create a folder called data. Inside data, it will create a folder called local. And inside local, it will copy these two files, libri-speech, lexicon-text, and vocab-text, OK? I will do that, and I will show you. So, now it is done successfully, let us see it have it created, you see data local inside that LM language model, so this created two shortcuts, shortcut links, hyperlinks inside this. So now, this is the my text one, this is my vocabulary one ok. So, till this point is it fine? Now H my data if you see in the corpus all the data's are in FLAC format so you have to see that in your system FLAC is installed or not so how to check that? So you have to just type flac. If you are getting this, then this flac have installed on your system. If it is not installed, maybe you can try to install using these commands. I think if you have cloned it in your system, you can see these commands and try to install that.

Speaker 4: I am going to the next part you want to please copy and paste those flax commands related

Speaker 1: commands yeah so that these commands okay yeah I think this file you have already cloned no, you can get from this. You go to, I will show you. This run custom.sh, we are running now. So if you open that now, you will get this. Is it fine, or is it required me to copy and paste over there?

Speaker 4: It is. OK, got it, got it, yeah. Yeah.

Speaker 1: So now, once that is done, we have to do our data preparation for speech part. So generally, when we are going for this CALD-specific format, we have to create a train folder and test folder. We have training data and test data. So for that, we have to create a training folder and testing folder inside this data folder. That is a CALD-specific format. So for that, we need these five files, web.scp, text, utterance to speaker, speaker to utterance, and utterance to gender. So five files I need. So how to create that web.scp? Web.scp contains utterance and their corresponding utterance ID and their corresponding location. Similarly, text also contains utterance ID and their corresponding text transcription. Utterance-to-speaker contains utterance ID and their corresponding speaker. Speaker-to-utterance, speaker ID with utterance ID. Speaker-to-gender, speaker utterance ID with respect to its gender ID. Now, we will design that. For that, Skaldi already have its data preparation.sh file inside this local. So if we give the files in the format, whatever I have shown to you earlier, so that will be like, yeah. for this LibrisPitch, if we see the corpus folder inside this LibrisPitch, tape, clean, and train inside that. If this kind of format, if you will give 19, 198, then 0, 0, 0, 0, in this file, in this format, if we give our data, so it will automatically create that. We have all the speaker to utterance, utterance to speaker, all this. So, now we will run it and see. Yeah, now I have already for both the files, training and testing, I have created these files now let's go into that folder and see how it how it will look like to go to data folder inside the data earlier we have created for language model this local folder now this two folder have created to go inside this train clean in underscore 5. We can see first web.hcp. So before that, we have to install Flack on your system. Then only we can run this command. So this is my utterance ID. This is edge. This caldi generally takes web file. It's the data in Flack format. So it is transferring. This command will transfer this Flack to web. So that's why they are giving this quote on this. Then this is my location, location of the file corresponding to this utterance ID. So this will be my web.hcp. I will go to text. This is my utterance ID and corresponding transcription. Now if you go to speaker to gender so this is my speaker ID and this is speaker to utterance and its corresponding utterances ok. Next will be speaker to gender you can see these are 29 speakers I have in the training set, and they are corresponding gender IDs. Then utterance to speaker, this is my utterance ID and corresponding speaker ID. So now we will go and see what inside this development set. So similarly, same five files are generated. You see this web.scp is like that, and same kind of files are also there. So this development set is for testing purpose, and training set is for training purpose. So, now we will go for feature extraction part maybe after that I will give a break Maybe you can follow up, whether or not. We have prepared our speech files. Now, this feature extracts for MFCC feature extraction. Generally, Kaldi have that mfcc.sh. Once we have created this inside data folder, this train and test folder, if we give that as input, then we can able to extract these features. This part I've written for parallel processing. If you want to put it serially, this part is also not required. So it will create a MFCC directory. MFCC directory, I have kept MFCC. So it will, inside this S5 folder, it will create a MFCC directory. Inside that, this MFCC files will be generated. So this will take two things as input. One will be what I have to do. As it is in for loop, I have two sets. One will be testing set and training set. First, it is taking testing set. Then it is going for training set. So this will be defined by here. And then this will be go for make MFCC. in that it will give the log files inside this experiment. It will create experiment folder inside that. The log files will be generated. And this will be the MFCC file, MFCC arc files. So that MFCC is generally called degenerating dot arc format. That will be stored in MFCC directory. And after that, it will do substantial mean variance normalization. Once this MFCC is generated, then over there, it will do the substantial mean variance normalization. And it will, again, store in this MFCC directory. Let's run it and see. So I have to run the third stage. So I can fit here as third stage. It will take some few time.

Speaker 4: So for running any suppose I have some database so what are the actually files we require to beforehand running this course what are the files we require as per I have got the idea one is required is transcript this trans dot text file yes yeah we need yes two files

Speaker 1: one will be lexical and vocabulary vocabulary are the unique words present in that what text corpus we have take you have taken so what are the unique words it have okay so Maybe the text corpus may be a larger one. And maybe whatever speech files are there, that may be a subset of the text corpus.

Speaker 4: So manually, we need to prepare this.

Speaker 1: Yeah, manually you need to prepare this. It may not require manually, but once you have text files, you can use some cell commands to get the unique one.

Speaker 9: OK.

Speaker 1: So once you got that, then you have to write your transcription manually. This part you have to write manually. Am I correct, Samudra Vijay sir?

Speaker 4: Lexicon part.

Speaker 3: Yes, lexicon has to be manually prepared. There are some graphic G2P rules, but still best to check it with manual.

Speaker 4: So sir, actually, suppose not the first one, A H O is written. So what is the meanings are actually if we the file has two columns.

Speaker 3: Yes. Second column is preceded by tab after the first column is a look at the first line in a is a letter. How should it be pronounced? It can be pronounced in two ways. One is one is up it is a mango or something or it can be a so there are two pronunciations so the first line and second line say that whenever something you see a separate word called capital just a it can be pronounced in two ways the what is there on the right side is the labels for

Speaker 4: speech sounds phonemes okay so this this is only you have to sir we have to prepare for all the phoneme phonemes you have to prepare this for all the unique words in the word transcripts okay unique words in and there should be some particular particular what we have it to use the particular lexicon that the second

Speaker 3: column second column is the list of the phone phoneme labels yes that is how So that tells how that word has to be pronounced.

Speaker 4: OK. Thank you, sir.

Speaker 10: Sir, how this number is decided, like 0, 1? Is there an order or what?

Speaker 3: Number in the EY1, EY2, et cetera. This is the convention that individual has followed. This label has followed. We have our own convention for Indian languages. We talked about it. That I can quickly show. One minute I'll spend tomorrow on that.

Speaker 10: OK, sir. Thank you.

Speaker 3: Then I can proceed now.

Speaker 1: Please proceed. Yeah, I think MFCC files are created successfully. You can see here, feature extraction completed successfully. Now I can show you how this MFCC files are. If we go to the MFCC folder, you can see that ARC files. So, in my case, it is taking 12 as number of parallel jobs, it depends on the number of course you have in your system. So, if you have some eight course, you will generate eight arc files, it actually the number of number of utterances we have, it will divide that utterances into that number of jobs and it will parallelly process okay so in that way we will get different number of arc files for training and testing so this development is clean is for testing purpose and 10 for training purpose you can see two files here one same name one will be arc and scp so scp file means it will tell from line number 21 this utterance if we see this utterance id this is utterance id and this and the look here is the location of the mfcc feature vector so as we have only one file this two dot one dot arc and inside that multiple number of feature vectors i mean so a multiple number of utterance mfcc features are there so if you see in this case 91 utterances mfcc feature vectors are stored in one file okay so and these are the numbers if you see this 21 1, 1, 4, 2, 9, 8, 2, 8, 8, 8, 7 are the starting points of that feature vector, starting points of that utterance, utterance feature vector, OK? So now we can see once we see the ARC file. So ARC file, we can't see directly, as it is a binary file. You can convert it into a text file. Then you can see that, OK? how we'll convert it to a text file, and she will run it. So there is a command called copyfits. So that's generally used to convert these binary files to the text file. Now, for one case, I will do. Now, this is actually the, this is the, this is the, this is the, this is the, this is is actually a general format how we will use copy fits this copy fits is a binary so in generally we can't run it like this if I run this this will give the error to me that copy fits command not found for that I have to first take that path.sh and run here so such that this binary files will be export to my current terminal now if I will run this I can able to convert it see now I have converted into text file. Now, inside this, if we see, there will be a text file created. So now you can see this is some number called 21. You see the column number over here. So here, this feature vector is starting from this location. And the next, these are all 39 dimension feature vectors. Sorry, these are all 13-dimension feature vectors. 13-dimension feature vectors. And this is excluding the C0. So how I am telling this, I will show you in this conf file. So I have taken energy equal to false, so C0 will not come. And I am taking all the default values. Only this is a non-default. If you take default, by default, consider the energy value. So I am taking a huge energy as false. In C0, it is not taking. Other default values are, for MFCC, we are taking 25 millisecond frame size with 10 millisecond of frame shift. So about this frame shift and frame, this MFCC feature in details, maybe Professor Prasanna will discuss in the evening session. And the other part is, yeah, here are taking humming window as a default one for this extraction of this MFCC features. So now these are and this defaults number of feature vectors are 13 and I think number of filters in this case is 24 for 8 kilohertz and this is a 16 kilohertz signal so number of filters they generally take as 40. So now if you go to run custom.sh, we have done till this feature extraction part. I will once show that text file. So these are software each 25 millisecond with 10 millisecond overlap, we'll get 1, 1 feature vector of 13 dimensions. So then once it is extracted, then you can see. So if we see the SCP file, the next will be starting for 1, 4, 2, 9, 8. Next feature vector's location, so you can see. If you search it, you can get it. I think I can search. So next feature vector is starting from here, you see. So then, so all will be in that inside this square bracket, I can able to locate it out So, this is 13 dimension vector, what we are getting, so till this point is it fine, hello, So I think if we ever are lagging, maybe we can separate it out from breakout rooms. Now we can run in this point. Maybe after that, after some 15, 20 minutes, we can start the language model preparation. Is it fine, Samdraviji sir?

Speaker 3: Yeah, sounds good.

Speaker 1: Yeah. I will request Lalaram to initiate breakout rooms.

Speaker 11: OK, I'm initiating. Thank you.

Speaker 12: OK, ma'am.

Speaker 11: Okay.

Speaker 12: You

Speaker 1: I will request all of you, maybe you will get a message from the host that you should have to accept that and go to the breakout room. you

Speaker 11: Participants, please join the respected group, may you got the joining link, break out . Thank you.

Speaker 13: Okay, so the names should come in that order. Okay, so what is your group? I don't know. I don't know what your group is. I just came. I am talking and I am assigning. Okay.

Speaker 9: Please mute and jump. Okay, sure, sure.

Speaker 10: So, there is no one in breakout 11.

Speaker 11: Okay, actually other group members are not connected in the workshops, so that's why no one is there. In the group 11, from the group 11, it's only you.

Speaker 10: So, you can put me in another group, sir. Yeah, okay, okay. I'm putting.

Speaker 14: Okay, no problem. it's it

Speaker 15: Even I didn't receive a link for breakout room. you

Speaker 11: So, are you connected or. No, I did not connected, I did not receive anything. Ok, I am connected, ok.

Speaker 12: Shrikant, Vidhya Sagar, Surbhi, Shrikant, Narayan Chaudhary, hello, hello is am I audible?

Speaker 11: Thank you for watching. you

Speaker 14: You Hello, sir, I was just removed from the group for some.

Speaker 11: Okay. So, which one?

Speaker 14: Four. Four.

Speaker 11: Four. Okay.

Speaker 14: I'm connecting. And someone missing in that group, sir. Actually, I'm a volunteer. Yeah.

Speaker 11: So, actually, only 66 people are joined. So, that's why. Okay. Okay. Fine. That's why

Speaker 16: Hi Lala Ram sir, I am wish I'm Vidyadhar Raju. I am

Speaker 11: group 8. Group 8, okay. How do I join that? Okay, I'm assigning you. Sure, sure. Sir,

Speaker 17: I'm Ghoom. You're Ghoom? I have also not been assigned to any group. So, which one are you

Speaker 12: group has been assigned. Okay. Okay. I'm assigning you. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. I'm

Speaker 15: I'm from group 9, group number 9, so can you please assign me to that group?

Speaker 11: Okay, okay, okay.

Speaker 15: Thank you, sir.

Speaker 11: So, Devi, are you connected by mobile? Are you connected by phone or laptop? Actually, I'm not, I can't assign those who are connected with the application.

Speaker 12: So, yeah, so since we reached here, after this, since we are not getting this error,

Speaker 9: once we do that, then I think…

Speaker 16: Sir, I am from group number 9, can you just assign me there?

Speaker 11: Ok. Ok.

Speaker 1: So, which file is, which address you have multiple figure vectors now? Yes, there are many, see there are many number of addresses are there, no, I showed you know

Speaker 12: But from the SCP side, you can see what a targeted site is there in Mahabalipuram. OK. So then from the text file, you can see.

Speaker 11: Arun?

Speaker 18: Yes, sir.

Speaker 11: So are you connected with a mobile phone?

Speaker 19: Yes.

Speaker 11: Yeah, so actually, those who are joined by application, so I'm not able to assign them in the Google data options.

Speaker 16: Yes, I just can answer. I'll just leave the meeting and I'll join again, sir.

Speaker 11: OK, OK. Please join with the laptop.

Speaker 16: Yes, sir. Yes.

Speaker 10: Hello, sir.

Speaker 11: Yeah, hello.

Speaker 10: Oh, sir, I'm from group three.

Speaker 11: OK, OK.

Speaker 10: Assign me.

Speaker 11: Uh-huh. Yeah, OK, OK. Salam, I'm assigning you.

Speaker 5: Please assign me also to some group, sir.

Speaker 11: OK, OK.

Speaker 15: Excuse me, so yeah, sir. I have switched to my laptop now. So I'm from group number nine. Yeah, so thank you, sir. OK. OK.

Speaker 12: So what's the time?

Speaker 7: Done. Oh.

Speaker 12: Yeah. Good thing, man.

Speaker 9: Same. My good thing, man. I don't know what's going on in my mind. Yeah.

Speaker 11: Anybody else who have not assigned any group?

Speaker 9: Hello, is Amiruddin?

Speaker 12: I am not very sure about it.

Speaker 11: Yeah, so Tanmay, which one is your group?

Speaker 18: Can you just assign me now, sir? Yeah, okay.

Speaker 11: Yes, I got the notification. Mahesh Chandra Srikanth, I have sended you a joining link, please join. Sir, I can't hear you, sir. Yeah, yeah.

Speaker 16: Yes, sir. Can you just assign me now, sir, group number 9?

Speaker 11: Group number 9?

Speaker 16: Yes, sir. Yes, sir. Rosie? Rosie? Yes, sir. So, which one is your group?

Speaker 11: Group number 9.

Speaker 9: Sir, I am from group 3 ok ok and Nandini, Nandini Sethi, Suresh sir, Mahesh sir, Nandini

Speaker 11: I'm in group ten. Okay. Are you already there? No, no, I'm not. I don't know whether I joined or not. Okay. Actually, I have sent you maybe you are not accept. So now I'm sending you

Speaker 8: In mail?

Speaker 11: No, no. Just on meeting itself you got a link.

Speaker 18: Okay, okay.

Speaker 11: You got a request, so you need to accept.

Speaker 18: You are invited to Breakout 3, it is coming. Yeah, yeah. So there only I have to join? Breakout 3?

Speaker 11: Yeah, yeah. No, I think now I... Yeah, yeah. Whatever the group I have assigned, please join.

Speaker 18: So it is showing you are invited to Breakout 3.

Speaker 11: Yeah, yeah. So please join, sir.

Speaker 18: OK.

Speaker 4: Hello, sir. Add me to group number eight.

Speaker 11: Yeah, OK.

Speaker 20: Sir, we took complete up to feature extraction, right?

Speaker 11: Sorry?

Speaker 20: We have to run the script up to feature extraction.

Speaker 11: Yeah, yeah, yeah, yeah. Yeah, so actually, now we are sorting out the doubts. So after completing the sorting, after sorting out the doubts, we will again have 10 minutes, maybe 30-minute session for language model. Then after, we have to.

Speaker 20: OK, actually, by mistake, I tried to run the whole script. And so the feature expression was completed, but I was getting some error in it. And then I shared my script to Jagadish sir while I was in the group.

Speaker 11: So now, are you in which group? group number six group number six so maybe someone you can maybe you can contact those who are connected in your group you can contact them and you can ask yeah maybe your volunteer may be there

Speaker 20: I've already been contacted and already been checked

Speaker 11: okay okay actually group six I am in the volunteer but I am handling that this so I'm not signing you in that other group.

Speaker 20: Yeah, I can understand. He asked me to join the main call, so I just returned it.

Speaker 13: OK, OK, OK.

Speaker 19: Hi, Lalaram, are you? Yeah. Hello. Yeah. Actually, in group number 10, the volunteer is not available. Can you add to the some of the... No, no.

Speaker 16: Actually, he's joined. I think Arun is there. Okay. I think. Who's there? Yeah.

Speaker 11: No, it is visible. I do not know why this file, okay. If you have unfriended me, just

Speaker 9: delete that file. You said there are only four, seven files to be there. Now, the data folder, how you are getting rid of this data folder also, you have deleted, isn't it?

Speaker 1: No, this data folder is not there.

Speaker 9: How? From where you got that? No, from where is it selected? Not detected. Wait. Oh.

Speaker 11: Vaishnavi? No, I'll tell you. Edit the whole custom folder. Vaishnavi? And then copy again.

Speaker 9: Yes, sir. Yeah.

Speaker 11: So, are you not, have you not opened the learning group?

Speaker 21: Sir, just wait for a second, sir.

Speaker 11: the Yeah, so this one you correct, Guru? Breakout 8, sir. 8? Okay. Yes. Sir, hello, sir. This is Vaishnavi, sir. You are asking something? Yeah. So, have you, have you, are you in any group?

Speaker 21: Yeah. So, hello, sir. This is So, hello, sir. This is So, hello, sir. This is Vaishnavi. So, you are asking Vaishnavi. So, you are asking

Speaker 11: Vaishnavi. So, you are asking something. Yeah. So, have you something. Yeah. So, have you something. Yeah. So, have you have you are in are you in any

Speaker 21: have you are in are you in any have you are in are you in any group? So, right now, I haven't group? So, right now, I haven't group? So, right now, I haven't joined sir. There they have joined sir. There they have joined sir. There they have given Pooja Gambhir, Suresh. They given Pooja Gambhir, Suresh. They given Pooja Gambhir, Suresh. They have uh you know, I did a

Speaker 11: without five yes sir okay I'm sending you a request please accept so I have I think already sent you okay so what I have to do now sir just you got a request you need to accept that okay so will I get the mail or here only I'll No, no. Here only. Okay, sir.

Speaker 21: I think you may. I didn't get sir anything right now.

Speaker 11: So are you connected with a laptop or something else?

Speaker 21: Sir, laptop only, sir.

Speaker 11: Yeah, so actually, you have to receive it. I'm checking again.

Speaker 21: So it didn't came, sir. If it comes, I'll join, sir.

Speaker 13: OK.

Speaker 21: Yes, sir, now I got, sir.

Speaker 11: Yeah, please. so one by one host will come in each and every group so you can sort out your doubts until that you can discuss with your own group in own groups in that way we are working so if

Speaker 5: If someone have any doubts, you can ask there also.

Speaker 21: Hello, sir. Actually, I have a doubt, sir.

Speaker 11: Hello. Yeah.

Speaker 21: So you have, sir, there is a, like a pop-up. I have to join in six. After that, I have to return to the main call or I have to stay there only, sir?

Speaker 11: No, no, you stay there. One by one, the main host will come in each and every group. You can ask your doubt, actually. This is the STLC.

Speaker 21: Okay. Can it be sent again, sir? again. Sorry, I came to the main call once again. Okay. Can you send me to group number

Speaker 20: six because it's asking me to join group eight. Okay. So can I get the notification again

Speaker 21: But once again, I came back to the main call, sir.

Speaker 5: Actually, I already sent and now I'm sending again.

Speaker 11: Maybe you received.

Speaker 12: Yeah.

Speaker 11: Okay.

Speaker 12: Yes, sir.

Speaker 11: Yes, sir. So, okay, so I am closing the breakout rooms. So now we are going for the language model.

Speaker 21: Sir, I didn't get, sir.

Speaker 5: So we will be closing the breakout room. We will be proceeding to the language model. And we can even discuss it offline.

Speaker 21: Sir, I didn't get the model.

Speaker 11: Yeah, actually, now we are closing the breakout session. Now we are going to discuss the language model. Please wait in the main room.

Speaker 21: OK. OK, sir.

Speaker 11: You say toately thank you thank you

Speaker 17: We are past our way, how are you?

Speaker 11: Everybody, please join in the main room.

Speaker 18: Jagabandhi is already back to main from breakout session.

Speaker 11: Yeah.

Speaker 21: Yeah. Yes. Good.

Speaker 11: We're also back. Yeah. system is not that much efficient, so that's why you are going and getting it stuck.

Speaker 1: Yeah, I think all of you have returned back to the main session. Can we start? Can I start now for the language model building part? Yes, sir. Yeah, OK. So we'll go ahead. So I think most of you have run in this point comfortably, or do you have any doubts in this point? Anyone have any doubt? So if you have any doubt, maybe after some 15, 20 minutes, we will again go to the breakout room, and we will talk and decide that. So my screen is visible to all?

Speaker 20: Yes, visible.

Speaker 1: Yeah. So this language model preparation. So for this, already we have created one local inside this local folder, one LM. And there we have only located two things. One will be vocabulary. Another one is lexicon. Correct. So this ARPA model that we will talk in the next, maybe tomorrow, that is actually a pre-trained model. Here we will create our own model, and we will see. And at last, we will compare the performance between that two. So that ARPA model, what they have, they have created with a huge amount of text data. Here we are only taking only this training text data to create our language model. So now, first, we will go for dictionary preparation. Now, let's see how we are going to do that. So once you go to the dictionary preparation, it will generate these five files. OK, let's see how it is generating and what are the extra questions.txt, lexicon.txt, non-silence fonts.txt, optional silence.txt, and silence fonts.txt. Let's see what it is and how it is creating. Now, I have commented over here first. So this is generally a best method, generally cell-based method, how to comment multiple lines you have to just give that less than less than sign then come then at last you you have to give another COM comment that you can able to comment multiple lines at a time, OK? So now we have to go for stage 4. So I have to specify a build stage 3 we have executed. I will go and specify a stage 4, and I will run, OK? So now what actually for dictionary preparation, what are the input files it is required, OK? So the input files required will be this language model, what we have generated during stage one. Stage one, we have generated these files. So if you see, what is that? Local LM, it is there. I can show you once. Inside data, there will be a local folder. Inside LM, these two files are there. One will be lexicon and book up. So it will take these two as input, and then it will generate these five files, what I have told you. And let's see what it is, how it is in writing. We'll check the WD. I am in the right folder.

Speaker 2: Okay, sorry, I have not saved that, again it is computing, so I have to, yeah, now tab

Speaker 1: done that part. Now, it have prepared two silence phones saved in phones.txt, one optional silence app saved here, 39 non-silence phones, and five extra triphone clustering, extra questions and lexicon file will be the lexicon text. Now we will go and see what are there in dictionary in OSP. Let's go to this file folder. So as I have mentioned, we have created these files. Now the silenced phone in this data set, these two are my silenced phones, SIL and SPN. Optional silenced phones is SIL. These are my non-silenced phones. So these are the variants of AA. So this, we have already seen this lexicon one with no silence. Now with silence, this is the way we have to make this one. So you don't have to worry for this. This script file will automatically generate this one by taking this unique word and this lexicon file. So these extra questions are telling that they have five clusters. Each cluster belongs to a similar kind of phones. It will cluster it to a single cluster. It will be used as a decision tree while decoding. Maybe Professor Samudra Vijay will explain it and tomorrow's sessions, clearly. So now, once we have done with this dictionary preparation, we will go for the language model preparation. So up till this point, we have done with this dictionary preparation. Now, we will create the language model and see let's see how this is happening. So you have to change it to fight. So, it will take this dictionary and NoSP and it will give two folders while we temp NoSP and language NoSP. generate it and see what are that. generally generate these files. So this L.FST will be the FST from the lexicon. If you want to go on details, you can always go. I will share the slides. What are these things you can get from here? This LFST is a phonetic dictionary. It is a finite state transducer, which will convert this phone state to what? Phone to what it will convert, generally. So it will generate this graph, and it will store these probabilities of how to combine from poles to what. So we will see what are the files it has generated over here. So, these are the words.txt you see now this is the initial HMM matrix from here we will start the monophone training. Now, this is the phone text and its corresponding levels. So if you see here, for each phone, we have four states. So one will be beginning, end, intermediate, and standalone. So, in all the cases we will for each phone we will get four states. So, this is generally used for this HMM state generation for monophone training. These are my out of vocabulary that is unknown. So these are lexicon aligned with this phone IDs whatever we have seen this state IDs. So now we have aligned it with what to this IDS. Now we will go so this is all belongs to how to combine from phones to word. And then the next, this GFST will be used to combine from word to sentences. So that part we will do in the next part. So for this, we need to generate this GFSD file. We need two things. One will be this transcription. Actually, we can generate these transcription files from the text file, whatever is there in the 10 case. The only difference is here this utterance ID is there. In the transcription, that does not require any utterance ID. If you go and see in the data folder in any development or this one, maybe this development training data we will use to see this text that have utterance ID and then corresponding text. But for transcription, we only need this text. We don't need this one. So that's why we have written a small script to... It will read that text file from this train dataset and it will exclude this utterance ID from that. Only it will take some, from field two it will take this cot command, fill 2 to the rest, 2 minus means fill 2 to rest it will take and then it will add with this s and slashes, okay. Then this will be create my transcription files. Now let us run it and see.

Speaker 2: only run this stage 6 one.

Speaker 1: So now we can see this trans file is generated over here yeah this trans file has generated Now we can see these are my trans files. Now, another file I am creating lm trend.txt, this line is will generate this trans files, then I will read this trans files and I will generate this lm trend.txt which will be inside this trendclean underscore pipe folder. Let us see, this is so what I have told this starting s in the start of sentence will be the end of sentence, and this is my file. So this will be now you have to, if you have not installed IRSTLM, you have to install IRSTLM by going to the Tools folder. You just have to go to the tools folder, then you open terminal, just write extra slash install irsdlm. Yeah, this file you have to execute. So then it will automatically install the IRSTLM. I have already installed that. So now I will run it and show you how to generate this GFST file. So, I will take you to that script lmbuild.sh which will require Kaldi root directory from that it will get to the path of IRSTLM. And what is the n-gram, number of n-gram we are taking? In this case, we are taking 3 as n-gram. So now I will show you the torchless on this one. We have written inside that. This is my lmbuild.sh. So this is the first argument will be caldiroot directory. Second argument will be number of ngrams what I am taking. So if it is already existing, these two files, it will delete that. Then it will create. After that, it will take irstlm buildlm.sh, buildlanguagemodel.sh. It will take this whatever this end silence and the start silence, sorry, start sentence and end sentence we have generated this lm10.txt to generate this gz file. Then from that, we can find this out-of-vocabulary from this training set. And then we will go and find this gfst file, t.fst file. So let's see how we can generate this. Now, I have generated that GFSD file successfully, now we can see over here, GFSD file is generated, Now, maybe from tomorrow, we can go for starting up this monophone training. We have features now. We have built our language model. Now, we can go for monophone training from tomorrow. Now, if you have any doubt, you can ask. Or else, you can go for a breakout session of maybe 15, 20 minutes to solve the doubts.

Speaker 9: Hello. Excuse me, sir. Yeah, please.

Speaker 22: Sir, in that folder you told us to download, the one that we cloned, there is no data folder. Sir, in that folder you told us to download, the one that we cloned, there is no data folder, have you included that or is it missing?

Speaker 1: No, no data folder, it will generate when you run the script, data folder will generate when you run the script, before that you have to make a purpose, before that you have to Make a corpus folder. And you have to put the files inside there.

Speaker 22: And I did that. But it was showing me some error, like prepare lang.sh. Yeah, so I couldn't proceed after that.

Speaker 1: OK. OK. Maybe what is your group ID? Maybe I will now in the breakout session, I will go to your group and see. I will request, yeah, Suryakant sir, you want to tell something?

Speaker 22: Now breakout session is there?

Speaker 1: Hello? Now breakout session is there? Yeah, breakout session is there after this, maybe for 10, 20 minutes.

Speaker 18: OK, OK, OK. How are we going there? Just disconnect and log in again, right?

Speaker 1: No, no, no, no, sir. When they will initiate, they will come with the option.

Speaker 22: We are into that, yeah.

Speaker 11: Yeah, OK.

Speaker 22: Not automatically, sir.

Speaker 11: You have to make a.

Speaker 1: You have to accept no matter what.

Speaker 11: Same as earlier, I will initiate, sir.

Speaker 1: Sure.

Speaker 17: Yeah, Mr. Chetkabandhu, you have shown context-independent text, right? Begin, end, intermediate, and singleton. For silence, why it is being modeled with five states for silence and spoken noise? And rest all, it is four states. Is there any specific reason?

Speaker 3: Yeah, let me come to that. Why silence has five states where other phones have three states? Silence, although we think of a silence as simple sound, silence can have some kind of background noise, some fan noise, and things like that. All these things are put into silence. Therefore, the silence actually may have very more varieties Therefore, it that is modeled by 5 states rather than 3 emitting states.

Speaker 17: Sir then I mean we also take spoken noise right spoken noise is also 5 state I mean does the same thing valid here I mean the silence and spoken noise will be modeled with 5 and rest all are 4 right?

Speaker 3: Yeah silence and other some particular phones have a lot more variety therefore the number of states, number of HMM states, modeling that phone or label is higher. Of course, data required to train the five states would be higher, but in general there There will be plenty of pauses or silence in speech. Yes. So this should not be a problem. Yes.

Speaker 18: Do not give up.

Speaker 14: So one doubt, sir. Yeah, please. So in data local, in language temp, NDSP after the language model is done. So there is a file called lex ndisambig. So what is that, sir? There is a number called number 16, sir. Disambiguation. Yeah. So what is that number 16, sir?

Speaker 11: You are not audible.

Speaker 14: Hello?

Speaker 1: Some of the research, can you clarify that one, that 16? I am also don't know that disambiguation.

Speaker 3: Right now, I don't recollect the details. But disambiguation is one particular letter may correspond to multiple phones. So one to many, many to one kind of mapping is disambiguation. I also don't remember. which I picked up.

Speaker 14: Sir, maybe I have doubt that, as we do in HMM model, where we consider a bank profile, like previous states, 16 states or what, sir? That is my doubt, sir.

Speaker 3: Sorry, what is the question of that HMM state? No, sir, that disambiguate number of 16, right? Frankly, I do not know what is through the lexical, nothing through the HMM space, it is purely architectural text. Okay, okay.

Speaker 14: Thanks. Okay. Okay, okay. Please join the recess. Okay.

Speaker 11: Please join the respective group . Thank you. Please join me in a moment of silence for a moment of silence for a moment of silence the respective group. So, most of the people are getting stuck at this point. Please join the respective group or those who are not got any request, please. Bye. you

Speaker 23: you I am facing issue while downloading this tar files, so it is showing HTTP requests and awaiting response and then again it is showing that connection timed out in headers. the entire files that I have mentioned in the...

Speaker 11: Data, for data downloading.

Speaker 23: Yes sir, download content, seven files mentioned below, all these, vocab, lexicon and all. So how to access it, because I'm not able to download it.

Speaker 11: Yes, whatever the link I have given, just you need to type in the browser, so it automatically starts downloading. Suppose it's not downloading, that That means you are not having that good connection, good internet connection. That is the thing only happen, otherwise no issue. So you are downloading with mobile data or other?

Speaker 23: No, no, with my college internet. I am not facing any issue because the call and all everything is running smoothly. It's just that it's not getting downloaded.

Speaker 11: That should work actually. you are working with college data, then it should work.

Speaker 23: Okay, sir. So there is one request. Sir, after this meeting ends, can you please upload this

Speaker 11: particular session? This particular session? Recording? Recording, yes, sir. All the recording we will upload. All 10 days workshop recordings, we will upload in the Drive also or in the YouTube also.

Speaker 23: Yes. Yes, no, actually, my request is if you upload it today itself, then by night I'll be able to download and do everything.

Speaker 11: Yeah, the same day we will upload. Maybe up to night, before night, we will upload. We have a target up to 9 we will upload.

Speaker 19: OK, OK, thank you so much, sir.

Speaker 11: Something will not work. It may delay. But after the meeting, we will start uploading this.

Speaker 23: OK, sir. Thank you, sir.

Speaker 11: OK, thanks. you Okay, so you was in which, which group?

Speaker 16: number 10, but no one who's actually you want to ask some money something. Yeah, related to like how we can create a lexicon file for any lecture, for example, Canada. So if you have a large text, so how we can convert to lexicon, how we can generate lexicon from out of it.

Speaker 11: Okay, so lexicon basically you need to you need to create actually, according to the context of the language, you need to create the lexicon. Suppose whatever the pronunciation, different pronunciation you have. So according to that, suppose you have one word. And for that word, how many number of pronunciations can be happen? So according to that, you need to create a lexicon.

Speaker 16: OK, like is there any tool or something like that? So we have to contact some linguistic from there.

Speaker 11: Yeah. Suppose you have a word, some word, suppose I'm saying Kishore. So some people can pronounce your name differently. Some people can pronounce your name differently. So according to that, suppose in the right side, we have to put Kishore for all pronunciation. And in the left section, we need to write the phonetic representation of Kishore. different pronunciation of Kishore. So in that way, we can create the lexicon.

Speaker 16: OK, is there any lexicon file available for Kannada language?

Speaker 11: Actually, that I don't know. You can explore and check. Or if I come to know, I will let you know.

Speaker 16: OK.

Speaker 11: Yeah, thank you. So now I'm. you . Panasonic's graphics card is exactly the same. As you can tell, the offset of the top and bottom artificial wood is abdominal metal. That means if we had put an actually wood Latvia on the camera, it would be an actual wood classic and no reflection. you You

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3718 customer reviews

1/730

Verified Order

“I needed an interview transcribed accurately and I was happy with the quick turnaround. ”

Jen

Jul 20, 2025

“Very accurate transcription, fast service, easy to use and order, thank you!”

Gabby

Jul 15, 2025

“I am beyond happy with this service, which I am using it produce interview transcripts for my dissertation research. The interface is easy, the customer service was prompt and informative, the transcript is accurate, and the pricing is wonderful. I will recommend GoTranscript to anyone who is in need of affordable human-powered transcription services.”

Justin McDonald

Jun 29, 2025

“great work. quick and professional”

christian oradesky

Jun 28, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support