Technical Overview of ASR System Setup

Convert Your Audio To Text

4.9/5

3718 customer reviews

Discussion on installing and setting up an automatic speech recognition (ASR) system, troubleshooting errors, and exploring real-world implementations.

2. Automatic Speech Recognition (ASR) in Kaldi Toolkit Hands on GMM-HMM and Language model

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: So one window, OK. So here it's coming. I'm just looking at the readme file since we are just reviewing whatever we did. We have already installed and tested SNO system. We need three types of linguistic sources. We have them. We have one text for Assamese number recognition system. We are just copying it both for testing and training. And these are the typical inputs of this particular text file that's required, the MIRA file and the WAV files we already know. These details are with respect to the Assamese number recognition system that database, database that we have recorded can vary from one database to another database. We, after the CALD, we installed certain things. Let me call it, we are calling it a CLST-CALD. That is database independent. We unzipped it. And then we also looked at the, yeah. Then after that, we unzipped one database related to Assamese number recognition, which contained the WAV files and some transcriptions and so on. Then we wanted to create some soft link or shortcuts to local steps and utils directory, which contains the various programs that will help us to train and test systems. Then now we will go, we went through the data preparation part. The whatever some text files are there, we just moved it, et cetera. The lexicon for the speech database, we put it like lexicon.txt because that's where, that is the name, the script. Now we are going to examine data preparation scripts. We'll expect lexicon.txt in the etc directory. And then we executed this command, which is a shell script here. And we checked that our lexicon is OK. Then we executed this command, which generates these files. And then we executed these two commands. We'll examine the, have a quick look at these four scripts that we executed today morning successfully and see what do they generate, what is the input to that, and what is the output. So the first script that we ran was this. So I will just open that, edit that file to show what it contains. Firstly, if you look at the script, it has one argument. And this is an input argument. This is what we are going to feed it input. And it should generate certain files. What are the files? It will generate a file called something, something, something. Yeah, it is written here, one for the test, another for the train. So it contains information about the utterances, who spoke it, and what is the corresponding transcription, three columns. So let me look at that, open that file. And I will then share the entire screen. Maybe that's better. So where am I? Yeah, I'm here. G-Edit, give me a minute. I'll share the entire screen. Stop sharing and share the entire screen. Entire screen, start sharing. And just to cover them up, I'm just going to do this. Where is my G-Edit?

Speaker 2: Here.

Speaker 1: Oh, the G-Edit is already here.

Speaker 2: OK, that's fine.

Speaker 1: Now, this is the file, generate this thing. And what is the purpose of the file? Let me just. To generate files containing file ID, that is through the utterance, speaker ID, that is the SPK, and transcriptions corresponding to all the WAV files in the WAV directory. So for every WAV file whose file name is this, we will not only have the utterance ID, but also the speaker information in the second column and the corresponding transcription in the third column. Now, this format is not required by Kali, but this format will be used by our later programs to generate files in the format required by Kali. So this is OK. Why this is necessary at all? Because we have one input argument, which a text file containing the sentences preceded by a serial number. That sentence also, I can show it here, which is, let me, which is, I'll just say ETC, numbers 80 or something. Yes. See, the sentence number is just 01, whereas if you look at the file ID, the sentence number is 0001. So we have some. Why did we do this way? Because we had 80 sentences in one file, and this file was used by recording script to prompt the people to read it, and it was recorded. So every sentence was recorded by multiple speakers. In fact, every speaker read these sentences because we were interested in this smaller category. Therefore, this intermediate script is necessary. But if you already, if you're, you don't have to use this script. You can use any script. But if you have information in this particular format, then you can skip this particular script. So this format is file ID, and speaker ID, and the transcription. So the implicit outputs of this program are implicit. The input to this program is this number, this file, whatever content is shown here. It's implicit output, which means that you don't have to type it in. It will generate on its own. It generates two such files, files which contain lines like this, one for the training data, another for the test data, all that. So if you just also see the convention, some documentation, if we just invoke this particular script, it will give you the usage with an example and comes out. That's the good habit that one does. We are already in the AS number code. That's an assumption there. And it reads the input file. What is the input file? It is the first argument. People who are familiar with the scripting will understand this. But even if you don't, it doesn't matter. So the variable called infile contains now this file name, this much, because that's how we know it. And we are creating a zero empty file, a new file. That's why touch command is used. Now, what do we do? We use a for loop to do some repetitions for each of the WAV file in the WAV directory. In our case, there are 640 WAV files. There's also readme file. Therefore, I specifically said list all the WAV files in the WAV directory. We already have put the WAV files in the WAV directory under the current directory, under the number code 21. So it lists them. So you see this command is enclosed with the two back codes. Therefore, this command is run as a subprocess, a background job. And the output is stored in this list. And so this contains a list of all 640 files. For each file, the file name is taken as ff. What you do is the following. We are going to do some text processing. What text processing is best done by Perl script. If you are familiar with Perl script, you will immediately understand what is happening. Even if you are not familiar, it doesn't matter. I'll just explain the essence of this. Now, if I look at this file name, and just let me look at the first file, not this. So what is the command that I need? I need the command ls. Let me just show that. So ls wav slash star dot wav. I'll just look at the top one line because I'm interested in the first file name in the list. That I'm going to assign to ff. And I'm going to do all this processing. Let me see what does it give. It gave me this. So you see that initially, to begin with, I want this to go slightly up. To begin with, ff will contain this information, this string. Now, what I'm interested in is I am not interested in the directory. I am not interested in the extension. I am interested in the file ID. That is the name. So that file ID, I am going to extract it. So anything that begins with a capital A-S and followed by dot wav. So whatever is highlighted is selected by Perl. This is a Perl regular expression. And printed. And it is stored in a temporary file called temp. If you do not understand Perl, don't worry. So what right now contains is there's a temp file. It's empty. Touch. And now it contains whatever I highlighted here on the terminal. So temp contains this highlighted A-S 011, M001. And note that that is what I need at the first column. So I am printing the file ID in this line. I am pulling out the file ID from the file name and doing this. Now, what I want is only the speaker ID. Now, how do I know which is the speaker ID? Speaker ID is the four letter after the A-S. So after A-S, but excluding A-S. So this bracket will take care of that. After the A-S, there will be three digits and one letter. Letter in Perl is word or alphanumeric. Alpha A to Z, uppercase or lowercase. So what? So it picks up three digits and a letter, which is what is highlighted here. So which is the speaker ID? That is the speaker ID that we need here. Look at the top here. We are going to print it here, but after a tab. So the first line, we printed this. And then we pulled out this speaker ID, which is highlighted here at the bottom left, and printed it after a tab. You see, there's a backslash T, tab, and then, OK. Then what we need? We need to print a tab. We need to print the last line. We need to print a tab and a transcription corresponding to sentence number 001. Where is that? Of course, we have seen that. You see, transcript corresponds to 001 is here. Now, how do I know where it is located? So I will use a grep command, new regular expression. And I know that the sentence number is 01. So I need to pull out the sentence number 01 and look for a line which begins with 01. Grep does that. So this is what is done in the third line. In third line, I am looking at a sequence of four digits dot wave. So sequence of four digits dot wave, if I look at the file name, it is this. Sequence of four digits followed by dot wave. Notice this here, backslash d is decimal, decimal. But I am going to pick up only the last two digits. I am going to pick up the last two digits preceded by two digits followed by dot wave and print that. That means right now, sent ID, the environmental variable called sent ID contains digit 01, only these two digits. Because those are the two digits which are preceded by two digits and followed by dot wave and end of the word. So it picks up these two digits, which is highlighted on the bottom left, and prints it here. So we know the contents ID. That's not sufficient. Now, we look in this 80 numbers, 80 dot text. We look for a line containing 01 at the beginning. And that's what this grep command does. It takes the infile. What is the infile? Infile is this file, whose content is shown top 10 lines are shown here. It looks for that line which begins with one or more digits and picks up. I ignore the digits, but pick up only the later ones, whatever appears after the time. So that is written here. Initially, the line will contain digit here, and then some spaces, which includes a tab, and whatever is there till the end. After this space, this entire thing is pulled out and printed to temp with a tab before transcription and new line, next line. Then we have processed. And then we do this for every file. So corresponding to every wave file in the wave directory, we have a line like this in the temp. Now, of course, temp is a temporary variable that I talked about. That is not sufficient. What I'm going to do is, if an old version of that file is there, I'll remove it. And I will move it with the file name, correct file name. Why do I say temp? Why not directly put it here? Because sometimes we make error. At that time, old file should not be removed. And I'll rename the temp file name as this. And of course, test data is the same as this. Therefore, we just copy the train file to the test file, end of the session. Any questions here? I think it's clear, OK? So what we have now is two files. One is this, and the other is this. And we already had a look at it. But let's look at this again. Since we already computed this job, let us look at this again. I'll just print the first three lines. Yes. So it is in the format that is shown in the documentation of the script. And what you see here, all the speaker IDs are same. It's the first speaker that is speaking. Therefore, the second column is same. But sentence numbers are different. Therefore, the corresponding to the third sentence, the transcript of the third sentence, ek na, it is there. It is the sentence number 3. And it's transcribed like this. Any questions? I'll wait for 10 seconds. If there is no questions or comments, I'll assume that things are clear. Move on.

Speaker 3: Hello, sir. Yeah. Sir, if I want to know that coding properly, then we have to learn the best scripting.

Speaker 1: Yes, best scripting. Bash is necessary for anything to work with the Linux. Let me sort of put it that way.

Speaker 3: Like in 22 line, in AS dot star, AS dot star means after AS, oh, what is there?

Speaker 1: Which line number? 22. 22 is this. Are you referring to, oh, yeah, yeah, here. No, see, that is a Perl command. Sorry. Oh, this is a Perl. So you need to know the, see, these are all Bash commands for each and so on. But from here onwards, this is a Perl command. So you need to know the regular expression. A little bit of Perl is necessary to do the text processing. OK, right? OK, now let me come back to your question. See, in Perl, this is the command for pattern matching. For every line, it does some pattern matching. It does some text analysis. And it analyzes the entire. In this particular case, this command says that take the entire line and analyze whether a pattern of this type exists. Pattern is including the forward slash. So the line, look for pattern. So what is the line? Line is this. This line is a typical line. Look for a pattern where it begins with the AS and go and capture it till but excluding dot. That's right, excluding dot. Yes, so what you include is inside the bracket. Now let us look at this second line. What we include is the four lines. That's where the bracket is. And that, whatever is there in the first bracket is implicitly stored in a variable called $1 at the end of this Perl command. And the next Perl command, semicolon separated, it prints whatever came inside the bracket. Yes, you can have multiple brackets. It will be $1, $2, $3, and so on.

Speaker 3: Actually, $1, so set in file equal to $1. That means in $1, the actual thing will be shifted.

Speaker 1: Whatever is in $1, whatever is inside the bracket would be stored. And the value of the $1 variable, which is whatever is inside the bracket, will be printed and appended to the file template. This double greater than means appended.

Speaker 3: And there are 3D for three integers and one character. There is a Y, W.

Speaker 2: That's right, Y, W. Yeah, yeah. That's right, exactly.

Speaker 3: OK, thank you, sir.

Speaker 2: Any other questions?

Speaker 1: All right, now let's go back to our readme file. We have generated these two files. By running this shell script, we generated it. And we know how the content is. And then what we did was whatever we generated right now, that for every word in the transcription, what is a transcription? All the information, all the words in this column, for every word, there should be an entry in the lexicon.h. It's a requirement, obviously. Therefore, that needs to be checked. Sometimes we forget. Therefore, I wrote a Perl script, which will do the job. Let's just quick. I mean, we don't need to bother about this. But we'll spend three minutes on that. So let me just, since we are engineers and there are no linguists here, let me check the lexicon directly. There are in lexicon directory, there may be other files that will be useful later. We'll see that later. Notice that it is a Perl script now. This bash will not execute it. Perl will execute it. And minus W means if there are any syntactic errors do the warning. What is the purpose? If a word in the transcription does not have an entry in the dictionary, print the word. Input is this and this. And output is list of words, not presented text. OK, that's trivial. What? OK, I will not go through this because this is not necessary. But I'll tell the essence of what this, because it requires knowledge of the Perl script. Essence of what it does is the following. When you invoke the file, you are given two inputs. One is the lexical file and other is the file generated by the earlier step. The lexical file is stored in a variable called this. So now dollar lexicon variable will contain the content, will contain, will refer to this file. And input text that contains the transcription in the third column is referred. What we do is, in the line number 17, open the lexicon, re-read every line, and make a list of unique, make a list of the words by splitting, and make a hash table of all the words. So essentially, this particular while loop reads the lexicon and makes a list of all the words. How does the lexicon look like? Let me take the look at the first two commands, lexicon.txt, I guess. Yeah. So if you look at the lexicon file, it has a word here and these symbols here. What we need is, we just need the list of the words to make, to have a unique list of the words in the lexicon. It is made, and not only it is made, for every word, for every word, there is a label sequence. For example, for the word do here, DUI is the label sequence. So for every word, a hash contains the corresponding label sequence. So not only do we have, not only have made a list of unique words, but we have prepared a, what is that called? We have made a hash where for every word you give in the word, it can output the corresponding label. So a sentence like this, suppose, then this can be replaced by sunya label once, sunya label again, now label again. So this word sequence can be converted into a phoneme sequence. OK. Now let's go to the next step. Now we will read the opening text file. Of course, if there is an error, then it gives a warning message and dies, comes out. And this you have already seen here. Now what we do is look at the input text. What is the input text? Input text is in the ETC. I think it is add to something. I want only first two lines, because otherwise we can't see. OK. Yes, input text contains this. And we again chomp the lines to remove the new line that is necessary. And OK, first line is already shown here. It's a good documentation. You split them into words. And you pick up only the transcriptions. We ignore the zeroth column, second column, and we pick up only the third column. And that's what is written here. You pick up column number two onwards. The split will do that. See, earlier, older thing had this thing, which used to give problem. That's where there was a problem. Ignore that. So what this does is to take the third column in the file, all the words. And for every word there, it uses the hash function that we developed earlier and generates the corresponding label sequence. So actually, it doesn't print it. All it checks is, for every word here, is there a hash entry in the hash? If there is no hash entry, then it is a word which exists in the transcription, but not in the lexicon. That's all it is. Over here. Come on over here. That's it. Any questions here?

Speaker 2: I'll wait for 10 seconds.

Speaker 3: Yeah. Suppose there is a lexicon. In lexicon, we have some unique words and its phonetic transcription. In text file, we have the input sentence text transcription.

Speaker 2: Correct.

Speaker 3: OK. So suppose we are using 0 to 9, that is 10 unique words in this particular system. So if it's fine that suppose we miss some words in the phonetic lexicon, phonetic lexicon transcription, like sunna. If we miss sunna, then it will show some error, that there are some unknown words which is not present in the lexicon. But suppose a reverse thing happened. Suppose everything is present in the lexicon, all unique words, but some extra word is present in the text transcription. There will be any error?

Speaker 1: No, your sentence was a little conflicting. Let me put it the other way. Let's say we have only 10 words in our transcription. But our lexicon, the pronunciation dictionary contains 15 words, including 10 words plus 5 words extra. Then there is no problem. OK. Because what it does is the following. You see the if condition here. When does it print an error? If for every word in the transcript, every $W, $W is any word here. For every word, there should be an entry in the lexicon. Lexicon can have extra entries, no problem. OK.

Speaker 3: So it will separate by the space. In this, there are some space, 8, 2, 3, 4, 5. Ah, yeah, yeah, it's separating by the space. Using this space, it will separate the words.

Speaker 1: Correct, yes. See this, for each $W at, there's a splitting here. Sorry, line number 31. That does it. Using space separation, backslash space means space.

Speaker 3: I don't have the programming concept, but I am thinking about the internal thing. That's why I find out.

Speaker 1: Very good, very good. No, no, fine, yeah. Let me not assume the Perl concept or this thing. Yes. It looks for space as the word separator. One or more space, no problem. And of course, we know that Linux is case-sensitive. That we understand. And after the ache, you put a full stop also. It will also get into problem. That's understood.

Speaker 2: That's fine.

Speaker 1: Any other questions?

Speaker 2: OK.

Speaker 1: If not, let us go here. So it is checking that for every word in the transcription of the train, there is an entry in the lexicon.txt. That is what check lexicon for script does. And the same thing for a test. In our case, it's the same file. But in general, when we do the reporting the results, we train on train data and report the results on test data. In fact, we do cross-validation, but we will come to that later. OK. So when a new database comes, assuming that things are in place, so far we have run one sys script and one perl script. One shell script is a bash script, a bash of C shell. And this is a perl script to check that lexicon is OK. After that, we did line number 83. This is what we did. So let's go to line number 83. We ran this script. These are the three scripts that we ran. There are other two scripts. We'll come to that. Third script that we ran was this. It will generate files of this type under the data directory. At means utterance, utterance ID. SPK means speaker information. At to speak means every utterance was spoken by which speaker? That information is written. And this is a reverse. One speaker spoke how many utterances? That is also stored. So all this information can be figured out from this file. And I'm highlighting here what we just now generated, because the first column says the utterance ID. Next column has the speaker ID. Therefore, we can find out the list of utterances just by looking at the first column here, first column of this particular file. If you want a unique list of speakers, you just pick up the second column. I mean, actually, I'm used to one first column, second column human. But in Perl, it will be zeroth column and first column. So speak can be done. And by looking at these two columns, we know that this person has spoken by this speaker. So the third file, at to speak here, that information can be for every utterance. We know who is the speaker. So first two columns are stored in this file. Reverse, we'll talk about it a little later. Reverse also can be talked, which is for one speaker, what are all the utterances that he spoke? One speaker, he spoke this utterance and this utterance. So we make a list of utterance for every speaker. There will be one line per speaker. We need to generate them. And you have to store it under data, slash, train, and sometimes directory. That organization is done by this. So we will first look at the script and see the output. This script doesn't require any input because the above script assumes some inputs. So let's first, anyway, let me edit this file. OK, this is the, this is a C shell, not bash. That's why I put it here. It generates files needed by CALD for training. It assumes implicit input. It is not given. It is not given as argument to the script, but it assumes these three files. We know what these files are. And just for the example sake, each line of this, any of these files contain these three lines. We know this fine. Now, what are the files that we probably will generate? Let me just show you those files. These kinds of files under the data directory, OK? Now, what happens is there is a at also. So these files under the data directory are generated. I have written four here. Maybe there are a couple of more also will come. But these are important for us for our understanding. Whenever we implement a system, nothing goes smoothly for the first time. I mean, this is very common. We are all humans. We make mistakes. Therefore, when we encounter an error, we go and correct the error and run the code again. When we run the code again, suppose already there is a file called R2Speak. If you have given read-write execute permission for every file, it will overwrite. I am slightly, I have become older, careful. I have deleted useful files because of these things. Therefore, I do not let files to be overwritten unless I explicitly tell the command that overwrite. By default, it will never overwrite. Therefore, I'm following that convention. So if there is some old version of R2Speak in the directory, which is wrong, we know that the system didn't go through all right. It has to be overwritten. So I am going to remove any such files and then start afresh, clean slate. So what I'm going to do is any files under the data directory, I'll just clean it up first. That is the function of the prepare for call D train. What will it do? Clean, step number one, clean old files directories if they are present. You remove the data directory itself, not only data directory, but also its children. So it removes recursively and forcibly. R, recursive, F, forcibly. It removes. So entire directory is gone. Which directories? Data is gone, emptied, experiments are gone, MFCs are gone because we are going to start with the clean slate. OK. So we are at the step number one, clean old directories if present. We are always in the AF number 21. So let me look at the LS. There's already a data directory there. There is an EXP directory there, MFCs. I have not done that for MFCs, it's not there. If it is there, it will be removed. Right now, whatever is inside, it will be removed. Let's see what is inside right now by default. It has some directories. These will be cleaned out. It may contain old files. I don't want that. So I'm just removing all these directories, starting with new state, new slate, clean slate. And I'm making a directory. Minus P means necessary. I make a directory called EXP, which will be empty, which would just look like this. Let me just show this. EXP is empty. And it will also contain a directory called data slash train. Well, we saw that under the data, we saw the train directory and also test directory. So we made it here. Line number 16 makes those directories. Similarly, under the data, it also makes a directory called lang bigram, which we see here. And also under the local, it creates a directory called dict. Let me just check that. Under the data, under the local, there's a dict directory. So we create this because we are going to put files there. In this directory, data local dict, we'll put the lexicon.txt, some reformed version of that. In the data underscore train, we will put some files which tells the list of the WAV files. And who spoke what and things like that. There's a corresponding directory for test. That's OK. And under data, there's a lang underscore bigram. I said bigram because we are using just bigram here. It's a simple process. You can always do a trigram, fourgram, whatever you want. So the grammar file that is generated will be stored here. So in general, when we look at LS in the 21 directory, AS number equal to 21 directory, there's a data directory. And data directory contains these two folders. And lang underscore bigram will contain the language bigram model. And data slash local will contain a directory called dictionary where some version of the dictionary in the format that is required by CALD will be generated. Who will generate this? This particular script right now. We are looking at it. We will generate that. Now let us generate that lexicon first. That's the first step that we are going to do here. Please note that whenever there's a comment here, the hash, anything on that line after the hash is comment. So this is the first executive line. What happens is when you write the lexicon.txt, sometimes we do write normally alphabetical order, but sometimes you forget. If there is one extra word inserted at the wrong place, CALD will clear. Therefore, whatever lexicon.txt, we will sort it, automatically sort it, and write it into the lexicon file called lexicon.txt under the data slash local directory. So since we already ran this program, this file should exit. Whatever file that I highlighted should exist. Let me just check that. Data slash local slash dict slash. Oh, there are nine files here. Let me see whether the lexicon.txt hack in LL. Yeah, it is there. It will generate it. Just for the heck of it, let me look at the content of that. So let me do the. I don't want all the lines. I just want the top three lines. OK, there was one lexicon.txt. There was also lexiconp.txt, which has probabilities. We are not used to it. There is only one alternate. If there are multiple alternate pronunciations, then two pronunciations are there. The probability of each of the pronunciations will be given here. Beautiful. But this is what we said. The line number 22 on this line number 22 that is here has created this particular file. What about these other files? They are created later. Any question?

Speaker 3: Sir, for multiple probabilities, what you are talking about, so it can be used for the dialect detection?

Speaker 1: Yes, yes. OK, that's right. In some dialects, let me not say dialect detection. It is, well, see dialects, OK, let me express my knowledge of phonetics. Dialect means region-dependent where some words are more used than other words. Yes. Therefore, then the, so this probability is not the probability of different words, but this is the probability of one word being pronounced differently.

Speaker 3: Sir, but I think dialect was the different tonal style, pronunciation style.

Speaker 1: If you take that, then it's OK. One normally talks of accent, but it's OK. Let's say Britishers will say pip rather than pen, but we say pen, OK? Now, it is the same word, alternate pronunciations. That is called a sort of accent kind of thing. On the other hand, if you say jada or some other word for jada, in Hindi, there may be multiple words for jada more. So in some region, Bihar, North Bihar, it may be this word is more used, and South UP, it may be used more, et cetera. That is what is called dialect. But whatever, you understand.

Speaker 4: I have the wrong concept there. Dialect and accent are both different. Accent is something like two people use differently, varies person to person. But dialect is like a social variation, depending on geographical area, varies. Lots of things vary, but accent differs individually.

Speaker 1: Yes, accent can differ individually also, correct.

Speaker 4: Yes.

Speaker 1: And it also may depend on if you are a second language. My Hindi has the influence from my mother tongue, and so on. Yeah.

Speaker 5: Yes, that's an accent.

Speaker 1: Yes, accent, percentage. Dialect is variation. Yeah, I mean, best thing is to look at the handbook of this thing. I wrote it somewhere. I think I showed it yesterday. Did I show it yesterday? I don't know. Somewhere I wrote it, because I also, that's what it said. Dialect is more of sort of preference for some words rather than other words, more geographical, whereas accent is speaker-dependent. I have a question regarding data, sir.

Speaker 4: Hello? Yes, sir. As we see, we collect SPS data by making people read some text. Yes. Sir, I have a question that if we provide them text, some, as in math, for example, some people use as a helping verb, aich, hai. If we provide, for hai, we use aich or hai, for mathaly.

Speaker 1: Correct.

Speaker 4: So if we provide them aich, sir, written text as aich, as helping verb, so how would he pronounce it as hai?

Speaker 5: Because his mother tongue is hai. Naturally, he will speak of hai. But I have provided him aich.

Speaker 4: Usually, he will read aloud as aich. How to capture his variation?

Speaker 1: So the best thing is not to provide text to read, but ask him to read. We, why do we ask people to read these texts? That is what we did. We also, in Assamese Number Recognition also, we asked people to read. Why? Because easy, easiest. For people, I mean literate people, reading sentences is literate. This was collected in the IIT, so students, they'll read it. Our data collection job becomes easier.

Speaker 4: But in this case, we won't be able to capture the variation.

Speaker 1: I agree. So let me point out that, no, there is one more. Exactly. We won't be able to capture the natural variation. So the right way of doing things is just record natural speech, and listen to them, and transcribe them. Yes. I agree. But transcription, listening and transcription is a painful job. And if you have to do it for 100 hours of speech, you will take years. Or several people working. We poor research scholars are not able to do that. That's where we use the Google to give us an approximate Hindi transcription and work with that.

Speaker 5: I'm working on this only, sir. That's why I pointed it out. I'm working on this topic for it only. You are working with the? Method of speech transcription for sure.

Speaker 1: Oh, OK. So the right thing is, if you can get some help from somebody listen and type it in. First thing that I do not know how close my theory is to Hindi, I mean, one possibility is you feed in your my theory to the Hindi Google ASR, and it will type out something. I don't know whether how words are different or not. Then somebody listen to it and just correct the transcription. Sometimes it is maybe quicker, but sometimes it may be simpler. Just listen and type it in. But there is other problem that happens. When you people speak, what do they speak? OK, OK, ee, ooh, kya ho hai? You should write speak, but I spoke speak. These are called natural. These are called speech disfluencies. It happens, happens, happens. What is this happens? These are called speech disfluencies. It happens in natural speech. If you want a good speech recognition system, it should be captured. So when you come to the lexicon, we'll talk about the speech disfluency also. But that makes it much more complication. But if you have a new language, you have to do that. I know there are people working with the Adi language. There are people who work with Python language. There may not be any dictionary. There is no ASR. You are the first guy who is building it. Well, you make the lexicon transcription. That's why you get paper. I mean, that's why you get recognition by the community. Let's put it that way.

Speaker 6: Yes, OK, sir.

Speaker 3: Hello, sir.

Speaker 6: Yeah.

Speaker 3: And one more thing. Here we are building ASR on digit recognition system. So naturally, in the dictionary, in a particular sentence, we look for it is a permissible word. The next word is permissible or not. That means I am going. So I am mango. That is not a permissible. So for such kind of things here, every digit is permissible. After one, two can be three, can be any digit can be. So what is the purpose of the dictionary or such kind of buy them and try them?

Speaker 1: Yeah, in this particular case, buy gram would be of no use. What you say is right. The grammar has no role to play because any digit can come anywhere in this current context. However, look at the natural numbers. You ask somebody to speak his mobile number. He will never begin with seven because seven is reserved for defense or something. He will never tell his defense telephone number. So there are certain and there are whenever something begins with zero, it is ISD code. When something begins with one, either it is either a three digit or four digit number. So there are some rules. That will be captured by the gram. So we are using buy gram because right now it is called ergodic model. Any word can come after anything, any word equal probability one by 10 because there are 10 words are there. But in natural sentence, we are going to look at natural sentence. Therefore, we need a language model. So the Kalia has provided language model. We are not using it essentially in this particular number database because I designed these 80 sentences intentionally to see that every word, every digit pair occurs at least once. There was a reason in that. Grammar does not play a role here, but we will still go through the grammar process. Lexicon is necessary. Lexicon is necessary. What we have done so far is necessary, but the grammar creation is redundant.

Speaker 2: Oh, thank you, sir. Next question.

Speaker 1: OK, if there aren't now, let me see. Oh, we have prepared a Kalia. OK. So we are at line number 22. We generated lexicon.txt. That will look like this. From the lexicon.txt, some others are also generated later. We'll bother about it later. Now, next line, we will copy this particular file that we had generated just now, which is there in the etc directory. We copy it from there to the train directory, data slash train. This is a train. Train, we copy to data slash train. Similarly, the next one is test. We copy to the data slash test directory because we are going to process this particular file in the data dot test. And in the data dot text, we expect some files called text, art, and so on. We'll talk about it. OK. Now, let us process the files. We process the files first in the train directory and then repeat the process in the test directory. So let's just do that. Let's go to the train directory. Now, we have moved out of AS number equal 21. We have moved to data slash train. And we are creating a file called text. What text prints is the zeroth column and the second column. This is a PulseScript, so don't worry about it. So let me just look at the content of this. OK. Let me just print the top three lines, head minus 3, pieces in data slash train, and art underscore SDK. Let me see that. This is we are familiar with this content. So what text will contain is this and this without the middle column. So there's a Pulse command which prints the zeroth column and column number 2. So it will contain zero content of all this and this without the middle column. That's it. That is the content of the text file that Cardi demands. OK. Come on. The comments are here. An example line is like this. Let me see whether it is actually there. So in the data slash train, I should have a file called text. Yes, that's it. So text contains essentially this but without the column. And as explained here. Fine. Now, we need to generate a file called art. What does it contain? It contains the first column of the text file. Text file contains this, only the first column. Therefore, it will contain only the file ID. Utterance is file ID. Let me just check whether it has been created. Yes. Fine. Now, where did we create it? You see that I have overwritten here. If I take this, it has to be overwritten. And that's why there's an exclamation here. That's the scripting command. Now, we'll create a file called SPK. Obviously, SPK information is here in the middle column. But the middle column contains lots of SPK. 011M occurs repeatedly. So what we'll do is we will take the 0th column, sorry, column number 1 of this particular file. This file, let me make it. Which is the file? The file is train.txt, whose content is here. And we are printing out the column number 1. This is column number 1. We'll print it to some file called SPK. But yeah, there is no sorting. So the above command creates a file called SPK that contains the second column of this file, which is SPK ID, all these commands. Now, we need to create another file, which is called utt2speak. That tells this utterance is spoken by this guy. So only the column number 0 and column number 1. And so that can be done in two ways. One, we already have an utt here. And we created a SPK, this column. So the file called utt contains only the first column of this file, whereas the file called SPK contains only the second column. You can use a command called paste, which will paste the content of this and this side by side. So this file, utt2speak, contains the first two columns of this file. If there are 640 files, there will be 640 lines. And the lines, each line will contain just this. And of course, it does some sorting that is necessary. Again, called the demand sorting. So let me see the content of this file. What is the file? Utt2speak. Yes, that's right. Let me look at the tail, the last line. It is for the speaker, this guy, female speaker. She spoke this. And how many lines are there? There will be 640 lines because eight speakers into 80 sentences. So utt2speak is done. And now generating the reverse, speak to utt. It is called the guys have already provided a PulseScript, which will do the job. Just let us look at the head of that speak to utt. No, that is going to be, I don't want head. It's a long line. Therefore, I'll print only the top one line. And I'm going to expand it because you see here, this is the one line of the file spk2utt. It says that the speaker number 011, male, has spoken all these 80 files. This is actually one single line. And this will contain eight lines, one for each speaker. Any questions? It's trivial. That's OK. I call it a demonstration. Fine. Then now I'll go back to my original utt2speak input file. Then we will put only the transcription, column number 2, to a file called trans. We take it from the file called text. What does the text file contain? It is actually the data slash. I have not changed my directory. That is why it is doing this. So line number 45 picks up anything column number 1 onwards. It prints it as a transcription. Therefore, transcription will contain only this last column. Let me just check that it is a trans, right? Yes. Obviously, it contains only that. And then what we need to do is we created a text file which actually contains the file ID and its transcription. But CALD demands it in a slightly different way. Or I shouldn't say CALD. It is the Language Model Generation Scripts demands it in a different way. What it says is that every utterance will have a silence at the beginning and silence at the end. So the transcription should contain a symbol for silence here and a symbol for silence here. Silence symbols can be same. And that is added here. You see, s in angular bracket says that it is silence at the beginning of an utterance. And slash s says that it is the end of the silence. Sorry, silence at the end of the utterance, end of the wave file. So this is demanded by the language modeling program and we store it in the language model.txt. So since we have already run this program, let us just check this content. What is content? lm underscore train dot txt. Transcription contained only this. But lm underscore train dot txt contains a beginning silence and an end silence. You see the slash here, end silence. This is the HTML format. And language model generation program expects this as the input file. Any question up to this point? OK. So now we have generated the files for acoustic model training and files for language model training. And of course, we have to repeat the same thing for the test data for decoding, but otherwise there. OK, now otherwise done, that part. One part is done. Now we need to process the text files or grammar files. What grammar for text? What are the files that will be required for grammar generation? Lexicon dot txt. And these grammar guys also require us to generate certain files called extra forms dot txt, this piece of text, extra question dot text for this entry. We'll not bother about it right now. We'll continue at the beginning. Now, there are some requirements of the requirement of this lexicon dot txt. So let me look at the now which directory we are in. We have moved to the data slash local slash dictionary. Let me just move it there also so that I don't have to type in all the time. Data slash local slash dictionary. When I touch any, touch command creates an empty file. These two files it will create. In fact, if the file already exists, it doesn't do anything. It just shows that file is revised. Edited time is taken at the current time. That's it. But if the file doesn't exist, why your file doesn't exist? Because we had cleaned up this directory. So therefore, this file did not exit. We created an empty file using touch. For the lexicon dot txt, there are some requirements. This file should necessarily contain two entries, these two entries. This is a requirement of the language grammar generation software. We'll just do that. So what we do is we will use one Perl command to append these two lines here. But before appending, what we do is we rename the old lexicon dot txt as original lexicon dot txt. This is my convention. I always keep an old copy. And to that line, I just append to these two lines. What are the lines? Because lexicon dot txt demands this. So far, whatever is highlighted here is done by line number 71 to 72. And then there are other requirements. Many times, we forget it. That's why I have automated this checking purpose. Lexicon file should not contain this or this. Lexicon dot txt should not have duplicate lines. What do you mean by duplicate lines? One word can have multiple pronunciations. That is OK. That is not duplicate because the second column changes in the lexicon dot txt. But sometimes, by mistake, we write duplicate lines. They should be removed. Secondly, lexicon dot txt should not contain any empty lines because guys crib later. Therefore, that's what we are all doing. We are just ignoring any duplicate lines and empty lines, et cetera. That is line number 74. So you associate the comments that are written here in blue with the next four lines. Fine. Now, if I look at the lexicon dot txt, it will not have duplicate lines. It will not have empty lines and so on. Let me just check that lexicon dot txt. But it contains these two lines, as you need. And it has been alphabetically sorted. We added these two lines. But we sorted it. OK, that's what is here. Sort unique. Well, sort must be somewhere here. It is a sort here. OK, it does it. And then we also need to create line number. Now, we'll go to line number 75. We also have to create a list of phones in this language. I shouldn't say in this language, in this database. What are the phones? They're all here in the lexicon dot txt. They are in the column number 1, or the second column or column number 1. So we pull out all the symbols here, make a unique list. Line number 75 does that. It pulls out column number 1 onwards and sorts it, makes a unique phone dot txt. It takes lexicon dot txt as the input, pulls out all column numbers 1 onwards, that is only these symbols, and sorts them and makes a unique list of phones. So let me see that list of phones. Let me do a head of list of phones, that is phones dot txt. You see that there is a word called A. There's also a word called. Can you guess what this word is? A-A is A. What do you think A-A-N is, suffix N? On. Nasalized. See, in Hindi, how is P-A-N-C pronounced? P-A-N-C. It is P-A-N-C, P-A-N-C, P-A-N-C, OK? In Marathi, it is P-A-N-C, normal A. But in Hindi, P-A-N-C. So there is not just A-A. A-A is here in art. But this A is nasalized A. So in the list of phones, not only A is there, A-N is also there.

Speaker 3: So in Bengali, it should be P-A-N-C. In last, there will be C-A-H.

Speaker 1: Yes, yes, yes. Here, there will be C-A-H, correct. This is Assamese. Probably, they pronounce it as P-A-S. But here, right, right, exactly.

Speaker 3: They pronounce, in place of S-A-H, they pronounce C-A-H. We say Sonunigam, they say C-O-N-U-C-H-O-N-U-N-I-G-A-M.

Speaker 1: Ah, C-A-H, C-A-H, yeah, yeah, OK, right. Yes, there are language-dependent rules. All those scripts are similar. There are language-dependent rules, correct. Yes, yes. I mean, if you, yeah, OK, leave it. That's fine. So number of phones, I can check how many word count this WC command gives the number of phones. There are 16 phones. There are total, if I look at the lexicon.txt, there are 10 or 12 lines. On the second column number 2, or column number 1, I'll let me call it as column number 1. This is column. Column number 1, there are 16 distinct symbols. How do I know 16? Yeah, OK. So I have created a phones.txt. And I have created lexicon.txt according to what those guys demand, grammar guys. And the grammar guys or somebody currently demands a list of the phones. Out of the 16, some phones correspond to silence. Let me just check which phones correspond to silence. There's a word called, is it still here? Still is a symbol for silence. And these curly guys require a separate list which does not contain silence. So that, OK, it's a graph. We'll take care of that. That's also created. So there's an explanation here. The last two commands, these two commands, these two commands create phones.txt that contain a unique list of all phone level labels that are present in the column number 1 of lexicon.txt and a non-silent phone.txt that contains a list of all unique phones, excluding still. Therefore, it will contain phones.txt contain 16. Therefore, non-silent something will contain only 15. You can see that. Fine. These are requirements. We'll not worry about it. And somebody said there can be cough sounds and things like that. You can add it. You can add it to non-silent phone.txt. We are not bothered about it. We'll not bother about it. Fine. Secondly, two more, please. These are required by Kaldi. It requires one something called silence phone.txt. Just should contain only one line. This also should contain only one line. There are variations. We'll not bother about it. So these files are the two files that are needed by Kaldi. We'll create it and done. OK. So how many files are there right now in the? Where am I? I am in the dictionary directory. And in the dictionary directory, how many files are there? Nine files. Not just lexicon.txt, but ori.lexicon.txt, various things. Don't worry about that. Any question up to here? No. OK. So now the script ended, exit. When it exits, it goes back to our number, AS numrecode 21. Let me go back there. And I'm still in data. I see. That's why it did PWD. I get confused. Of course, here it shows, but I am a little careful because I made mistakes. PWD. I am in num21. Now, there are no questions here. Let me close this. Let me close the checkbox. Let's go back to reading. So we understood this. The above script assumes the following, of course, that we have. Sorry, Kaldi train. This is what we did just now. It created files such as this. Now, let us do the other two things that we stopped at this particular point. Let us look at this script. And then we'll take a small break. First script we look at is create wave.scp. In Kaldi, there are files called script files, which is processed by parallel shell scripts. Therefore, it is called script file. Its extension is .scp. SCP is an abbreviation for the word script. It will create that. And Kaldi demands that. And that is created by this shell script. Let us look at that shell script. Where is that shell script? It is in the current directory itself on the lower bottom. So let me edit that file. Create wave script file. I'll just do it. All right. What now? This will be, this is written by somebody from Bash itself. Who? The guy who wrote was Abhishek Day. He was a project staff at IIT Guwahati. Three, four years ago. And I edited it to suit our AS number recognition system. Now, what is the goal of this file? To create wave.scp file for train data and test data. What is the input? At is necessary. Let me just have a look at that at file. And let me just show the data slash train slash at. Yeah, it just contains the utterance. File IDs, what are the, oh, it's also written. At contains the names of the wave files without the extension. And the following is an example line of that. OK, we also saw that. Output will be wave.scp file, which will be there in the train directory as well as test directory. And wave.scp contains a list of the wave file and its full name with absolute path. What is absolute path? At provided the ID. That's OK. And in the column number one, this is zeroth column. Column number of the wave.scp contains the entire path. Because when this is processed by system, they don't talk of relative path. They will take they want. Because this can be from even in the cloud or I don't know, whatever it is. Therefore, there is a path that begins with this slash. So in my older case, it was Samudra Vijaya. Therefore, you see various things were there. This is the convention. But now we see that our paths are slightly different. It will become slash home slash chief slash Kaldi or something. That's OK. This is the required output. OK. Let us do the full trace. This is very small. This is the line that is for the train data and the rest is the test data. So let's just do at this point. Wave path, where the output should be stored, input will be where the wave files are stored. Wave files are stored in the present working directory slash wave. Present working directory is here. And under that, there is a wave directory. Where is the wave directory? Yes, that contains the wave files. Data path, where the output should be stored, it should be stored in the data slash stream. So these are defining two local variables. And this is the single line command which does the job. Now, it takes the at file, which is like this. It contains only one ID. This is the piping. Feeds to another program called awk. Awk is similar to Perl. Somebody wrote it, so I'm just using it. It is useful for processing table or information. What it does is, the first column, it writes wave path, followed by a slash, followed by the file ID, which is shown here. And it also adds an extension dot. So essentially, starting from here, it prints this first. And it prints this entire thing, where this file ID is preceded by the wave path, which is given here, and followed by slash. And after the file ID, dot wave is here. So essentially, lines of this kind is generated. Where it will be generated? It has to be generated in data slash stream. Let's look at that file, whether it has been generated. So what is the input to this program? This is the input to the program. What is the output of the program? It is a file called wave.scp. Wave.scp. Scp is for script. That's what it contains. Yesterday night, I installed Kaldi afresh here in Kaldi. Earlier, it was on the desktop. Now, it continues here. Now, it's running. Any questions here? All right, so that was trivial. So it's just this is the file that will be taken by Kaldi MFCC, signal processing frequency coefficient will be computed starting from here. Let's go back to readme. Next one, where some of us had a problem was this. It creates a grammar file language model. Which language model? Bigram language model. And this is a shell script. Let's look at this shell script. It gave lots of messages and sometimes error to some people. One person still had a problem. My suggestion is do what I did, reinstall Kaldi. It will be all right. It happened to me. OK, so let's look at this file. Edit, create bigram language model. How many lines are there? There are lots of lines, but I'll tell you the essentials, rest we can sort of forget. Generally, this is also a bash file. Generally, I put a date at the beginning and at the end so that I know how much time it took. It's sort of my portion of practice. It defines some training dictionary.

Speaker 4: Sir, I reinstalled the Kaldi today only, but I'm getting the same error that got yesterday at this 95 number line, sir.

Speaker 1: OK, we will address that at the end of today's session. Yes, sir. My guess, did you change 1.7.2? You did that, didn't you? There is some class somewhere. You see, whenever we install, the output should be put into a log.txt and check for any errors. Sometimes it happens. I don't know. We'll see that later. Hopefully, we will be able to solve it, but let us see. Yes. All right. Now, what is the file? Create bigram language model. It runs the command.shell. See, if you look at the terminal output here, there's a cmd.sh. Yes, cmd.sh. So this is something that has been provided to you by Kaldi guys themselves, but slightly modified. Let me see what it contains. It contains whether you are using GPU or not, and things like that. So we ignore that. We are all using CPUs, no CUDA, and all those things. We are going to use a run.par. No raster options. Forget it. It will run that. And it also checks whether there's a path. If there's a path.shell, then execute the path.shell. Of course, we looked at this path.shell yesterday. Let's see. And we changed this one number also here. But this path.shell is also almost standard, except this particular part. Actually, we added it. We means at IIT Guwahati. And I changed this number also. Something's ignored. No. And where is the dictionary? What is the name of the dictionary file? It is dict. What is the name of the output language file? It is a language bigram. In which file the training data, et cetera, information is there, enter it. These are local variables. Don't worry. This line, line number 15, says where we are going to use a bigram, ngram equal to 2. There should not be any space on the left side of equal to or right side of equal to. This is a bash convention. Then I am, it's our echo means print messages, training ngram, the language model. Any old stuff is there, it removes it. It makes a necessary directory. We'll just quickly run through this. It prepares a language model. And then what does this language model shell takes? It takes dictionary as the input and generates the, OK. It takes the dict in the data slash local as the input and generates some file. And that file, that same file is used by the next command, build language model dot shell. And it generates a language model, but it is gzip. That's what they do. That's the sort of big language models can be stored efficiently. And then there is a language model that is called ARPA language model, Advanced Research Project Agency in US, three, four decades ago, old story. That generates the language model. And it checks whether there is any out of vocabulary words. But we have been smart. We have done the check rexicon.pl. And we have ensured that there are no OOV words. So if there is OOV words, those WAV files, whose translation contains such new words which are not there in the pronunciation dictionary, will be just ignored. That's why they are doing all this checking. So OOV means out of vocabulary. And it generates, now this is the big command. See, lots of pipes are there. It means it does various checkings and so on. And it does the finite state transducer compile. It compiles a finite state transducer, which is a graphical language model. And of course, it's not just that it does various commands. At the end, it generates this file. And g.fst, it also checks that whether g.fst file that is generated is stochastic or not. Is the finite state transducer, is it stochastic? So when you run this script, it says it is stochastic. It is not stochastic. That is the output of this particular language. But actually, grammar, it is called capital G for grammar. Fst for finite state transducer is a binary file, so we can't read it. But it is generated at this line of large. Any questions? Where is it generated? It is generated in the train data. It will not touch the transcription of the test data. Obviously, it is not supposed to know it. So let me just look at the file, whether the file exists. Of course, our messengers said that it exists, but let me check. Where should it be? It should be the data slash train language is train g.fst. It is not train language. It is some other language. Lang.bygram problem. g.fst, yes, it is there. Is it a test file? Let me check the nature of the file. File command gives me the nature of this particular file. It says that this particular file is an open Fst, open finite state transducer. That is an open source grammar generation program, open Fst. It's a binary file. It contains finite state transducer data. And what is the type of the object? Finite state transducer is a vector, arctype, standard, version number 2, number of states is 12, something like that. So we have done up to here. Any questions? I'll wait for 10 seconds. Nahi hai. So let me just go back to my README. We have done up to here. Well, one person still has error, but we'll see that later. But otherwise, at least four of us have had no problem. Then the last step is this. We'll do this, and then we say, ah, data preparation is over. This will take a little bit of explanation, 10 minutes. We'll take 10 minutes, and then we'll take a 10-minute break. Now, the last thing is, OK, let's look at where am I? I am here. What are the files which are there? These are the files. And I see a folder called conf here. Let me see the content of the folder called conf. I'm going to list that content. You will see two, three types of files. Let's look at the extension. One is called config, configuration file. Another is called just conf, another configuration file. Third extension is proto. Topologies are prototype. This is the HMM topology. And there are two files with the extension config. One is decode config. One is decode with the deep neural network config. We'll not bother about that. We'll keep quiet. Let's start with the default one. But let us look at the files with the extension conf. There are two types of files. One begins with the fbank. fbank is filterbank.conf. Other begins with the mfcc something.conf. You see mfcc underscore various sampling frequencies.

Speaker 4: What?

Speaker 1: Give me 10 seconds. The sampling frequency of our wave files is 16,000 hertz. How do I know? Well, I can check using the file command. If I look at the one wave file, let me say 010 capital M probably. No. 645. No, no. 011, 010, OK? There's one file. And just look at the content of the one file. Or what kind of file is this? It says that wave audio, 16-bit, mono, and 16,000 hertz. So if you take a new database and you don't know what is sampling frequency, whether it is 16-bit, whether it is mono or stereo, and things like that, that can be found out by doing a file command. OK. So there are two, three questions. Yeah, please ask.

Speaker 3: Hello, sir.

Speaker 1: Yeah.

Speaker 3: Sir, suppose we make a model with some particular frequency, say 16 kilohertz or 8 kilohertz. But if there are some file present which is with a different frequency, suppose we are going with 16 kilohertz and there are some file of 8 kilohertz, then there will be any problem in efficiency or?

Speaker 1: Yes, not efficiency. Your program will run. I mean, CALD expects all the files to have the same characteristic. That is, all the files should have been sampled at the same frequency. And all the files should be mono, not stereo. All the files should have been the same bit, 8-bit or 16-bit. That is OK. But entire file, whether it is in train data or test data, should have exactly the same characteristic. This should not change.

Speaker 3: And I got an article. And they have done some work. And in 8 kilohertz, suppose they got an efficiency of 10%. But for 16 kilohertz, their efficiency increased, suppose 1% or 2% increase. So what do you say?

Speaker 1: Now, one minute. There's a difference between when we are talking of that, they might have recorded it under 16,000 hertz. They downsampled it to 8 kilohertz. And the accuracy is lower. In general, accuracy increases. Accuracy is better with 16,000 hertz. That's understood because there is more information. Yeah, yeah. But it will take much more time to execute because Ah, no, no, no. No, no, no. 16,000 hertz, see, the execution time is only in the MFCC computation. That is 1% of your execution time because a lot of time takes place in decoding using HMM, not signal processing. So if you have the option, record it at 16 kilohertz or 44,100 hertz, it doesn't matter. You can always downsample. You can reduce the sampling frequency. You will not lose information. You will lose 1% million information. It doesn't matter. But if you record in 8,000 hertz and you want to upsample it to 16,000 hertz, you don't gain any information. You just are matching the format. That's not good.

Speaker 7: Yeah. So actually, I had a question on the bit rate because I had previously done a model. I tried an HMM on a normal English number system. That was using a 32-bit rate audio. That gave a really bad model. But then after using an online converter or something, converting back to 16-bit, the accuracy didn't change much. So is there, I mean, when you're collecting data, is it important to set the bit rate initially set to 16 or?

Speaker 1: Yes. Not necessarily. It's up to you. The 16-bit is good enough for most speech applications because 16-bit allows the amplitude to be plus or minus 2 power 15, 32,000, which takes care of SNR of something like 120 dB, 140 dB, which never occurs in real life. Therefore, 16-bit is good enough. But 8-bit is not adequate. OK. OK, sir.

Speaker 7: Thank you.

Speaker 1: OK, so the summary is recommended bit rate is 16-bit. That is good enough. And even if you record in stereo, you can always convert it into mono. It's a trivial one-line command. You don't even need an online thing. It's a SOX command. Trivial will do that. And when you record, do not record at 8 kilohertz. Record at least 16,000 hertz. If you want to record at 44,100 hertz because that's the default sampling frequency of your handheld recorder. Because many times, suppose it's a Adi language, I may have to go to the village and ask people to read. In which case, I can't take my laptop. I can record only using a handheld audio recorder. Default is 44,100 hertz. Converting from 44,100 hertz to 16,000 is a single one-line command. SOX will do that. That's trivial. I'll show you how it is. OK, sir. But on the other hand, if you are recording in the lab, whatever is highlighted here on the bottom left is the preferred. OK. And now, since we are at it, many times, people, see, you can coax your friends to come. Two, three friends will come to your lab and record speech. Most people don't want to come. Who will come? And on the other hand, people who are willing to use their mobile phone and read a set of sentences or speak, whatever it is. Now, mobile phone, by default, uses 8 kilohertz. This is the standard since a century. That was standard or not century. Well, almost. Yeah. And it uses mu law and things like that. Doesn't matter, but 8 kilohertz. So now, of course, these days, we don't use the regular landline. We use WhatsApp or through the internet data connection, WhatsApp or some such thing, where they actually, some of them do go into slightly higher frequency and so on. So the frequency, et cetera, we may not have control. If you are using mobile phone, we do not have control over the frequency and the bit and things like that. But it is our responsibility to make sure that all our wave files, whether it's a train or test, conform to some standard with the same format as it is highlighted on the bottom left. OK. Got it. Any conversion you need to do, do it. Do it to the, preferably, to the slightly better one, preferably to something like this. This is general advice. Yeah, over.

Speaker 7: OK.

Speaker 1: Next question. All right. So yeah, let's execute this command. Now, these lines assume that the sampling is 16,000 hertz. Why? Because we saw the 16,000 hertz. And mono is required by Kaldi because there is no advantage that Kaldi, even if there's stereo data, it just increases the space. But Kaldi doesn't make use of it, at least as of now. OK. It uses 16 bits. So this information that our 16-bit data, mono and 16,000 hertz, at least the sampling frequency and bit information has to be specified to the Kaldi for training. And that is specified in these two files. One is a file called mfcc.com, another file called filterbank.com. These two files should exist in the directory cons. So when I did the lf slash cons, these two files are not there. Which file? This file and this file is not there. Intentionally, I have not put it because it is your job to put that mfcc file according to the sampling frequency. So let me look at the content of the sampling frequency file. What is the first file? mfcc 16,000. So let me just display this file more. It is in the conf directory, mfcc underscore 16,000. OK. That's it. It has a format that we don't have to worry. And this is a comment. Use energy equal to false. That means when you compute a male frequency capture coefficient, you do not use a 0 capture coefficient or you ignore the energy content. Because sometimes somebody may speak very softly. Somebody may speak very loudly. So the energy should not matter in speech recognition. That is the first line. And the second line specifies the sampling frequency. On the other hand, if I do the 8,000, let me see what happens. This line has changed. That's all. This is normal. Now, I don't want you people to, I have been teaching students and even editing, they make mistakes. So what I did was, when I provided the zip file, I created the various mfcc files for 16,000, mfcc files for 22,000, mfcc files for 48,000, various common ones I created. And what we will do is now just providing the shortcut. So that what Kaldi looks for is a file called mfcc.conf. And when we provide a soft link or a shortcut, this mfcc.conf will actually point out this. And Kaldi will read this file, similarly with the filter bank. So let us execute these four lines and take the break. OK, before that, just I want you also people to do this. And let's conf. And how many files do you see? 3, 3, 9 plus 2, 11 files. After executing these commands, you will have two more files. Let's execute these commands. Now I will do the lsconf again. mfcc.conf and filterbank.conf. Anybody got any different value? This all everybody should get. Nothing to do with the bigram language model. OK, nobody has any other data preparations over. Now, since this is notes on files created in the above database, well, these are the files that are required. And these are the files that are required. We already talked about it. Now we want to do this job. It may not take much time if things are OK. But still, let us take a break. It is already 1.45. 10-minute break is OK? Yes, sir. OK, so do we come back at 4.55? Yeah.

Speaker 2: OK, we'll wait. OK, let's continue.

Speaker 1: So now most of us are ready to train and test automatic speech recognition system using the Asami's number database. Number is a sequence of digits. Now there is a, OK, let's look at the files that are there in the directory. There is a file called run.shell. That's what we are going to use. Now let's look at all the files in this directory. And let's have an idea of what these files will be used for. Common.shs, that is used for checking whether there's a GPU or not. This is setting. Conf directory, blue color. So it's a directory. Contains these configuration files. This shell script is for creating the language model. Right now, bigram, if you want, you can also make trigram. And the next script, create wave.scp, is to create a script file, which contains two columns. Zero column contains the file ID. Second column contains the location of that file anywhere in the disk. So it need not necessarily be under the wave in the current directory. It can be anywhere. As long as you change the script, so you don't have to copy it from here. That's what I want to say. But the simpler way is just provide a soft link to the directory. That we can do later. Then data directory, we know what it contains. Demo.shell. I also wrote a shell script. After training the system, it is fun if I can speak some assembly's number and it prints out the output. It will be wrong. It doesn't matter, but at least it's a working system. So it's a demo. We can look at it later. ETC directory, et cetera, we have already seen. Experimental directory, that will be created because that's where all the training models and various other information of the experiment will be created. Right now, it is empty. Local directory is which contains local steps utils. These directories contain files created by Kaldi guys. Path.shell tells where our FST is located and so on. And then we talked about the prepare for Kaldi train.sh. We looked at it today. And now we'll inspect this run.shell. Run.shell runs Kaldi training and testing. It's a bash script. We'll have a look at that. Now, why do you want to look at that? Because in the run.shell, if suppose your laptop has multiple cores, these days some of the laptops have four cores and so on, then many things can be done in parallel. These can be done in, if you have four cores, sometimes it is possible to run. So for example, process the four speech files simultaneously independently on four cores. And even while decoding recognition, also it helps. Therefore, you can set the value of the number of cores, number of jobs, NJ stands for job, number of jobs, parallel job that can be set in training and while decoding or recognition. There are some constraints here that you can read on your own. So right now, for example, how do I know what is, how many cores are there in my PC? I might have forgotten. And for that, there's a command called nproc, nprocessors. nproc, in my case, will give eight. I have a, my thing is, it has eight cores. So you just remember this and in the run.shell, let us edit it and set these values. Let me just show the, let's edit a run.shell. And now I want you people also to do this command, nproc, and remember the value and the edit run.shell. Let us go, we will not work. Yeah, question.

Speaker 7: So for example, like let's say that we have eight cores. So setting the value as eight, will it slow down the system or something, or should we like set a smaller value? You know, some have this doubt in the morning.

Speaker 1: Okay. Now, when you say slow down the system, if you have some other job running, well, other job will slow down, okay? But if you're running only this, it doesn't. Your normal commands, I mean, if you're doing video on WhatsApp, of course, both will slow down. But normal typing in email, checking kind of thing, it doesn't matter.

Speaker 7: Sir, also, like, I mean, I've seen a lot of scripts have, like they set the number of cores to like even numbers. So is even numbers, is like any rule for that, or like people can just set any odd number also, like three or five?

Speaker 1: You can set whatever you want, provided, okay, what are the rules? Okay, let me know, since you're. This should be less than the number of cores. Suppose I have eight, in my case, answer is eight here. I can set train NJ as seven, six, anything, doesn't matter. I can set teacher as one also, no problem. Okay. Got it, sir. Okay, good. Now, so if you want to run some other thing in parallel, you don't mind this training process to go a little slow, reduce it, no problem. Okay. Of course, there are other constraints. You also look at the number of train speakers and things like that. Right now, we'll not worry about it, okay? Now, in fact, sometimes, people even set this number greater than N core. I have eight cores here, but in principle, it is possible to set this as train NJ is equal to nine also. CALD doesn't object. Why, so then where will it be useful? Look at the following. Processing a file involves file IO, input-output. Reading a disk file from the disk takes time. So when one file is being read, another file can be processed, and therefore, some people use this, but I wouldn't go that hard on the PC. I'll stick to this convention. Okay. So let us edit the file. Let me go to run.shell. It is a bash script. So no, it clears the screen, type the date. I mean, that's fine. It runs the command.shell, it runs the part.shell. We don't have to worry about it. Certain things are here. We will not worry about it now. When you know the GMM, HMM well, you will understand these. Now, I am concentrating on these lines, 22 and 23. I have set it as four by default. I can set it as eight, but I will leave it at that, doesn't matter, because I would assume that most of your laptops will have four cores. All right. Anybody has more than four cores? Anybody is competing with me? That's six core. Six core, I see. That's a nice number. I mean, I thought it was always two power of two, but it's six core. But okay, it's fine.

Speaker 4: Most laptops you see in, sorry, desktop. I'm using a desktop, sir.

Speaker 1: Yeah, yeah. Still, NPRUK will give you the number. Kitna hai? Six, sir. Ah, six, okay. It's a desktop, therefore six. Somebody has put it. Okay, good. All right. So you can set it to six. It's up to you. See, this is a small job, so it really doesn't matter much. Okay, all right. So I have set this. I'm not going to make change, therefore I don't have to save it. But let me go back to readme file. Edit the run.shell appropriately. That is the first thing. Second, what we are going to do is we are going to train the system in two stages. This is my, I'm a bit cautious, because many times I make mistake in the computation of MFCC features. Why? Not MFCC program is fine, but somewhere my file will have eight kilohertz or 16-bit instead of 16-bit, 8-bit. I have not checked. Some file is empty. Some file has only 0.2 second duration. All these nonsense things happen. They get caught in this first process when you extract the features, MFCC features. So I want to first check that my feature extraction runs all right. Then onwards, there is no problem because I have checked the language model train is running all right. How did I check that? I ran this. There were errors here. So these are the critical data preparation errors, and there is a data preparation errors still might not have gone. Therefore, I will first step, step number one, I'll compute the MFCC features and do not train. If there is no error, if there are errors, resolve the errors. So that's what I'm going to do first. So how do I do that? The CLST guys have evolved something called a switch. If the switch is one, some particular stage is run. So let me go back to the run.shell. See, they said set switches. So by default, this switch is set to one and all training, testing, all other switches are set to zero. And what you see is that you can not only do the monophone train, that's what we do for today. I can set the switches one by one or whatever it is. And these three switches can, you can set correct, one by one, good too, because what actually happens is most of the errors are caught here. And if there is any remaining error, it is caught here. If you can train a monophone system, running this is always trivial. And these things, suppose monophone training takes one hour, these may take two, three hours or four, five hours. SGMM, this stage takes probably eight hours or more. And I'm just giving you an order of magnitude. And DNN also may take three, four hours or so on. So depending upon your database size, you get an idea of how much time does it take to train and test a monophone system. This doesn't take much time, maybe half an hour, 15 minutes, something like that. Don't worry about that, that's short. Signal processing doesn't take time. The training and testing, that takes time. So by running this, so what we run is, first we run this. In the second stage, we run this. And later at home, you run this. Your accuracy will not increase much because it is a small database. Any questions here? So read. Okay.

Speaker 8: Question. So what if we just do the MFCC and the monophone training together? No problem. If our database is right, will there be any?

Speaker 1: No, it's no issues. See, the trouble, why I don't do it is, when you do this, lots of messages may come. I put it in a log file. I want to know whether there is an error or not. In the case of a new database, I will go in these steps, three steps. One, this, then this, and so on. See, even when you do that, actually, see, this is always done. Even if all switches are zero, this is always done. But my practical experience is that these two guys give problem. I mean, I shouldn't say problem. In fact, they are helpful to me. They notify if my data is in the order that hardly expects. That is why I run this independently. And when I run this run.shell, I am running it again. But it takes some one minute, so I don't bother about it. So in principle, if your database is very large, you can comment this late, in later stages. But that is okay. All right. So let me go back to my readme shell. What else do I have to do? I have to set this switch one, and all other switches to zero, and I'm about to run it and save it, obviously. Let me just check my switches, run.shell. By default, it is in that manner. One, all other switches are zero. So I don't have to make any changes in the run.shell. I will do only MFCC extraction. Before I run, let me see what all things it does. It is some local directory names, database-dependent preparatory work, which we have already done. So, and after that, there's a date command. So there's a date command at the beginning. Therefore, I will know that my setting switches and running these commands took how many minutes? One minute at most. And then, this switch is one. Therefore, up to here, it does. If, then, finish. FI. So if, since this switch is set to one, these lines will be executed. What are these lines? First, it prints a command. This is my, I inserted this because I need information. I want, something is going on. Otherwise, half an hour, one hour, nothing happens. Is it running or not? It becomes a data. Okay. What does it do? It does the MFCC processing for train data as well as test data. Therefore, there is a for loop here. One's for train data, another for test data. Okay. For the train data, it uses a shell script in the steps directory. See, now the step directory is useful. And actually, you are using number of jobs 20 here because file IO is there. Therefore, 20 files are being processed simultaneously. It will, these are all the standard things. Let's not worry about it. If this particular command with these arguments, if it fails, then it exists. So there's an R here. This is actually a single line if you, yeah. So this is the kind of, if this fails, it should not go to the next step. Therefore, if this is successful, then go to the next step. If this one will become 0, then this will be executed because this is an R in batch. Okay. After that, once you compute, when frequency capture coefficients are computed, their means are set to 0 and variance are normalized to 1. So this process is called CMVN, capital mean normalization and variance normalization. 0 mean, unit variance. Okay, there's signal processing. And then sometimes some funny things happen, which is some files are empty, some files are short, and all nonsense things happen. It checks that things are OK and it's validated. These two lines checks that everything is OK as far as MFCC is concerned. I also don't know its content, but I know the kinds of errors it throws. If it throws an error, in your MFCC directory, something, something, you need to take precaution. So and then it prints a date and it's done. So what we will be doing essentially is to execute this because MFCC extraction switch is 1. Any questions here? No. So now let me go back to the terminal. We don't have to make any change. Therefore, we are not facing any change. What we'll do is, according to readme.shell, you need to run the run.shell. But my general convention is I will always put the output into a log file so that I can inspect it leisurely because sometimes many things come. So let me run that, run.shell. I will output not only output but also error into log.txt. I can actually say that is MFCC.txt, but let me call it as log.txt. Therefore, it knows it's a log file. Let me just run it. Now, when it is running, your control is not in your whole. I didn't put an ampersand here. But yeah, one minute. So let me just check. Let me just more this log.txt rather. OK. It put the date. And it first does the training to grammar level. So all these error messages keep coming on the terminal. Then if there is an error message, I will not know. That's why I always. The error message may be somewhere here. You may miss it. That's why I put it into log directory. OK. It will all be OK because we have already done this. Found no unexplainable faults, various things. OK. Since we are looking at it, this one already. OK. Checking topology. HMM topology. OK. G.FST. OK. It took this much time. So it took this much time to do the checking the language model, and then it created the wave.scp. How much time did it take? Nothing. Almost zero. And then it started doing the abstraction. It did 20 parallel jobs. How much time did it take? This much, let's say. This is 3.02. It started at, so seven minutes it took. Seven minutes? No, seven seconds. It's a shorter time. OK. And now, ah, there's an error here. Can you see this word, error? Can you see this word, error? So somewhere, some funny character is there, unallowed UTF-8 character. UTF-8 is not ASCII. Maybe by mistake, that text file contains UTF-8. It cannot contain UTF-8 character. White space tab or something is there. So this is a little bit problematic, can create problem. So what I normally do is to wrap error with the log.txt. It will tell me, oh, there are two lines that contain. So then I will go and look at this. This is there. So let me just expect, check this file once, just for my consideration. I will do a more. If I do a more, it may not show any error, because UTF-8 will be without any problem. Oh, I don't know what. I won't even know where is a non-US. So I need to use a PulseScript or something to check that. Even gedit probably will interpret the UTF characters. Let me just try this. So one of the files contains something funny. I don't know what that funny is. Will I be able to see in gedit? Probably not. Is there an extra line or something? No, extra lines would have been taken care. I do not know. Right now, I don't see it. So I'll assume that this is an error, not a fatal error. Let us hope that it will run. If it doesn't run, our original sum is here, number.txt. We have to type it in. That's OK. We'll do that. That is not a big deal. So let me now assume that things are OK. And let's run the run.shell with the next step, number 2. Train and test monophone-based system. Therefore, we have already extracted the MFCC. Therefore, we'll set this switch to 0. And monophone training and testing switch here. And here, you see, I did all this log also. So let me edit the run.shell. What are the things that I have to do? These things. Run.shell, edit it. You people edit it. Set this switch to 0 because I don't want to recompute and waste time. I'll set this 1 and then set this to 1. That's it. I need to save it. Save. Before I move on running it, let us see what training and testing does, what kind of things they do. It comes here. Oh, it just runs this. Training monophone. And since test is there, now they go. Test has a little extra step. Let's say make graph. What is a graph? It's a graphical language with a finite state transducer. Now, when you test the data, you just have to run the decode.shell. That is OK. But that decode.shell will give you the output, whatever it recognized. But we want to compute the accuracy or the word error rate. To compute the accuracy, you need to know the existing, what is that called, transcriptions. They will be used. But OK, this is done for some other reason. I'll tell what the reason is. Suppose I took Assamese speech data, general-purpose speech data. I have 100 hours of speech data and text from all Assamese text that I can get on the internet. So when I created a language model using training data, where is that here? I can use the text from any text in general purpose and build a big general-purpose language. So it can recognize 10,000, 50,000 Assamese words. But now, I am going to use this data only for recognizing certain specific words. What are the words? Let us say you are checking the status of your railway reservation. You want to book a railway ticket. You want to check. You call a call center, automated call center. It asks some questions, words which are very specific. Therefore, our vocabulary is small. And kinds of questions that we ask is also small. So in actual usage or deployment, we may have a specific task. Therefore, the vocabulary size may be smaller. And the grammar types of questions that are asked, questions or sentences that are spoken are also very specific to the task. So while decoding, while recognition, we want to recognize only those words and only those type of sentences. So we'll rebuild the language model using only the test data, test-level transcriptions. So you can use it, build a new language model, and build the final state graph. That's the high-level description. We'll not worry about that. It does the decoding. When it does the decoding, it also computes the word error rate. But I added a command here. You show me the various word error rates for various language model weights. So let's run it. So I have saved this. I will go to my directory. And I will run. What is the command that I want to give? I want to give this command. See, now I have put an ampersand here so that even when it is running, I can check the log file. Log file is this. So let me do this command one after another. It doesn't matter. You can put one after another also. It doesn't matter. So yeah, question?

Speaker 8: AUDIENCE 1 So what is the nohup?

Speaker 1: SRIKANTH REDDY Nohup. Sorry. Come on. OK, yeah. Let me answer his question first. So this is the command that we give. OK, actually, what we just need to do is dot run. Sometimes it doesn't run otherwise. This is the command that we are supposed to run. But we are going to put the output to this file, not only output but also error. This ampersand tells that errors also should be written in this file. And at the end of the command, if you put an ampersand, this command will be running in the background. Your control returns. Your terminal is free. You can type another command. Now, this job for a large database may take several hours or days together. And in between, power failed. Then suppose you have done the MFCC computation. You spent one hour. You have done the training. You have spent four hours. Now, just testing was running, and the power failed. Then what if you run this command again? It will waste one hour in computing MFCC, four hours in retraining, and so on. We don't want that to happen. So Linux has something called no hangup. Hangup, if you see the good old telephone systems in old movies, if you see, telephone systems are not kept on the table, but they were getting hanged on the wall. So hangup means stop. Even if the current goes, store the intermediary results in a temporary file. And when the PC is rebooted, start from where you left off. This is sort of useful for big programs, whereas whether it will be actually useful here or not, probably it is not. I'm not sure about that. But in general, when I run long commands, I prefix it by nohup. I put the suffix ampersand. And have I answered your question, Mitesh? Yes, sir. OK, yeah, next question.

Speaker 3: Hello, sir. So before, as you run run.sh, so I just run.sh without changing any value. Because by default, it was the NFCC extension. But now we have to change the value in run.sh.

Speaker 1: Correct, we have to edit the run.sh. So the first one, did we run? Yeah, you run. We ran, yeah. Yeah, we run and did the log.sh. Now, edit this so that.

Speaker 3: So we can just open this file by clicking in that folder. We have to cat.

Speaker 1: No, no, no, no catting. Edit, normal edit, gedit or vi, whatever you like. I don't care. Oh, gedit space run.sh? Yes, yes, gedit space run.sh. I put an ampersand. That's my convention, but that's OK. Gedit space run.sh. Let me just show it here. I'll close this. I'll say gedit space run.sh. In my, I always want to put it in the background. So I put an ampersand. It opened here. I just go there. Yeah, these three lines I will change.

Speaker 3: Oh, there in line 29. I put it as 0. 31 and 32 will be 1, 1. Correct.

Speaker 1: Because we don't want to extract a read. MFCC will be 0. 0, because we have already done that job. This is a nice way of doing stage by stage. Once you make the change, you save. And then I'll follow the instruction here in the readme file. In the readme file, this is the command that I'm giving. I'll give both the, yeah. Let me run one by one. Copy. I'll go to the terminal. Well, this is my, I don't, I never do a window full screen. I do this so that I can go look at here, there, and so on. OK, I'll paste it here. I press an Enter. Now, see, the first job, which are the gedit job, was done. But the second job is running. This is a job number in the square bracket for people who don't know. That's the background job number is 2. And this is the process number. Anyway, we don't have to worry about that. If you see a top or Htop, you will see that something GMM alignment is happening. Program is running. I'll do a quit. Doesn't matter. But before that, before it completes, let me also do this job also. Let me see. Keep examining the output file. So I'm going to give this job command also. Which command? This command, I'm going to execute it right now. Line number 148. Let me do that. Enter. It said something. No hop, ignoring input. Something date, it said. It is training the Elgram and all this. Again, again, again, again, it's doing it. I know this is all fine. Let me run. Don't do that. G.sht, I'm fine. Monophone training, I started. When did it start? It started at 28, 16, 28. It is using four cores. It is initializing the monophone system, compiling the training graph, using the training data, and forward-backward algorithm. Expectation, maximization, pass 1, 2, 3, 4. All right. It is doing 35. It is right now, it has done the 35 pass. It will do the 40 pass. That's the default value. And whenever this file changes, something shows up here. New version exists or something. I will have to refresh it. Let me try to refresh, whether I can. Is there a refresh? No. Yeah, yeah, it said reload. It already said reload. I'll reload it. Now it will show some more output has come. It has reload. Let me go down. Unfortunately, I have to do this. It completed all the 40 passes. It has done the job. And for the training data, it took 1.1 hour. I don't know what that is. I don't understand. It has done the training. And the models are there in the exp underscore mono. It already started the testing. It compiled the test language model. And it does all the other things. And it's giving error. What error it is giving? If I go here and press Enter, it has already done the job. This particular job was done. But I am looking at the output. It is showing. You see the word error rates here? Before I ask questions, let me just go at the end. How much time did it take? I'm interested in that. So what I will do is I'll just grab the today's 2021. I'll just grab this particular thing in the mono.txt. It will give me all the times. It started at 2815, completed at 2939. One and a half minutes, less than one and a half minutes. It's a small database. So signal processing and decoding and training and testing will take less than two minutes because this is a small database. Now, I will now these outputs are the output that is here. This is not part of the Kaldi script. This I put a command grep dot something. I'll show you here. The end of the monophone testing. Here is the monophone testing. After that, the following line was introduced. So that I want to know what is the word error rate. Otherwise, somebody has to go there and check. No, I want everything to be displayed on my terminal. It's shown up here.

Speaker 2: Any questions here?

Speaker 3: Sir, when I am running this, I am getting some error.

Speaker 1: Ah, ah. What is the error? We will have to see. OK. OK. Any others who have got success? We'll come to that. But any other question by others? See, if not, we will break today. OK, so. Yes, sir, I have a question. Yes, question.

Speaker 8: Sir, so while we ran that run.sh thing for right now, so basically we are just doing the mono thing, right? So after we ran it, we got some WER correct. So we actually trained our acoustic model, right?

Speaker 1: We trained our acoustic model. Is it saved? Yeah, yeah, yeah, it is saved, yes. Let me show you. It is already one minute. Let me see where it says that it is saved. XP.

Speaker 9: OK.

Speaker 1: So let me just look at that. First, let me try what is there in EXP, experimental. You see two directories. And in the first directory, it is true with the MFCC. MFCC for train and test. It has lots of log files. We will not worry about that. We already talked about the capital mean variance in our base here. This is a log. We will not worry about it. Let's now come back to the EXP. Let me see what is there. I'll go to the mono. Let's see what is there. If you see here, there are some files with the extension MDL. It is model file. Is it an ASCII file? No, it isn't. Let me just check that. The data is binary. It's actually model is stored in a binary format.

Speaker 8: And all the rest of this OSSCS files are they a part of the model thing, or they are something else only?

Speaker 1: OK, no. There are two models here, only two models. 0.model and 40.model. 0.model, the initialization part, a monophone training, initializing monophone system, 0.model. And there are how many passes are there? 39, 40 passes. At the end of the 40 pass, whatever model is there, that is stored here. So this is the final model that will be used for decoding. In fact, there is a link here called final.model, which if you see ls space minus l, final.model, it will point to 40.model. ls minus l, sorry. Can you see?

Speaker 8: OK, OK. So this is basically like a soft link, right? It is a soft link, exactly. It's a soft link.

Speaker 1: Because what happens is sometimes, OK, now since we are at it, let me just go to the WER error. Which is better? Lower the error is better, lower the number is better, or higher number is better?

Speaker 8: Lower number is better.

Speaker 1: Lower number is better. I'm just, OK, I'm a teacher, no? I have to ask questions of this type. And most of us are teachers.

Speaker 3: There are some inserts and deletions and substitutes.

Speaker 1: Yeah, the description will come. Yeah, yes, we should talk about that also. But let us look at the one number. Which number is lowest? Here is 6, here 7, here it became 8. Which number is lowest? Some 5s I see, so I'll concentrate on the 5. Ah, this number is the lowest. Can you see that? Yes, sir. OK. Now, this number is lowest for this particular configuration. What is this configuration? It says that WER when the system was trained with 17 passes and the language model weight is 1.0, WER rate was 5.87, the lowest. Therefore, it is best if I use these parameters. Now, I will not train the system 40 times, 40 iterations. But I will take the model called 17.mdl and I will use it. It helps us to do that. So if I want to henceforth use the final dot model as not 40 dot model, but as 17 dot model, I can put the link. Of course, it has deleted the older models, but it can be restored. That is where this link will help. Because while decoder will always look at this final dot model rather than one dot model, two dot model or something.

Speaker 8: So as you said, it is a 17 pass, I think. So that means it did that 17 times or it took the first 17 sentences? I don't understand that part.

Speaker 1: OK. This is HMM training. You take entire training data. You initial the HMMs with some arbitrary value, something like arbitrary, not really arbitrary, but whatever it is, some initial values. And re-estimate the parameter of the HMM. HMM re-estimation is primarily transient probabilities, but means and variances of the normal distributions. So you use all the training data and re-estimate. Then one new version of the model is created. This is called one pass. Let me repeat it. We start with the 0 dot model, which is almost random. We use the entire database and train it once using the expectation maximization training algorithm. We get a model called 1 dot MDL, which is obviously better than 0 dot MDL because it has seen some data. And do this. So each of them is called passes. And typically, we know that it requires at least five passes or seven passes. This also can be set in the 1 dot shell. So they started showing, while decoding, they started using only 7, 8, 9 models that are trained with going through the training, seven iterations, eight iterations, and so on. So each iteration has n number of passes, right? Exactly. This is the iteration index, right? Iteration index. How many passes?

Speaker 8: So basically, this one gave us the highest, the lowest. Accuracy. Highest accuracy or lowest low accuracy.

Speaker 1: So that is the evidence. Now, what I will do is I will set the maximum number of passes as 17 and stop there. I will not go all the way up to 40. This number is acceptable in the configuration.

Speaker 8: So every time, will it learn the same parameters? Almost the same.

Speaker 1: Like for first answer? These numbers will not vary very much because, can it vary? Yes, because there is some kind of random initialization. But in our case, we actually do some rules for initialization. So I don't think this number will vary. It may vary due to finite position and things like that, but no. There is no random initialization in the sense of some, in some cases. Here, everything is sort of set, I guess. I mean, yeah, minor variations, but really doesn't matter. But in general, in iterative algorithm, you start with the random initialization and some initialization do better. You know, gradient descent algorithm. Yes. OK, now let me also, since we are at it, let me also explain the content of this line. Of course, this says that the information is there in the directory. What is that? EXP monophone. It decodes recognition using the test bigram, bigram of the test data. And yeah. And there is a file called WR. This file, you can look at it. We will look at it quickly. And I did a grep. So it showed only that line, which is of interest to me. What does it show? World error rate is 5.87%. It shows with a two decimal error, but I will stop at one decimal point because variation is even at this level. So it's 5 to 8. Therefore, this 5.8 is good enough for me. But anyway, let us see how this error number came about. Primarily, there were 480 sentences. And each sentence contained seven words or something. So totally, there were 4,480 words in the test data. Out of which, most of them were recognized correctly, but there were 260 were error. Therefore, the error percentage is this ratio into 100, this ratio into 100, which is about 6%. You can see that this is about 6%. You multiply this by 16, you will get this number. OK, that's understood. Out of the 263 errors, there could be three types of errors. Some extra words were inserted, 144, 114 insertion errors. Some words did not get recognized. They were deleted. Some words got misrecognized, or this is a substitution error. This is the way word error rate is calculated in speech recognition. If the output, so this is a word error rate. In Chinese language, there is no concept of word. Words are not separated into spaces. Therefore, they talk of character error rate, capital C-E-R. In this case, the similar concept applies. Any question on this line?

Speaker 3: Sir, in case of ESNO tutorial, there are only two possibilities, yes or no. That's why there are 0% of word error rate when we run. Yes, and also it is by one speaker. So it is insertion and substitution. Those are very least probable, because only there are 10 digits are there. So it can be replaced by other nine digits. Or deletion, it can also be, or substitution. So when there will be a loss of words will be there. Suppose we are handling with 10,000 or 1,000 words, then their word error rate may be increased without the proper model.

Speaker 1: Exactly. And somebody actually said that they tried in their language, word error rate was very high and things like that. But that is where, of course, we will, that's the improvement of the performance. But you said two points, why the word error rate is higher here and nearly 0 there in ESNO, two points, one. In ESNO, at any point in time, only two words are possible. Whereas here, at any point in time, any of the 10 digits can come. So this is a measure for this confusability. So the confusability there was between two words, here there are 10 words. Therefore, we say that perplexity is 2 in ESNO case and 10 in our case. It is not the number of distinct words in the language, but here in this case, at any point in time, any of the digits can come. Whereas in the natural language, even if there are 10 words, we follow grammar rules. Therefore, there are some linguistic constraints that come into picture, that grammar rules are non-existing here. Therefore, the error rate, perplexity is 10, 10.

Speaker 3: That means whenever someone is writing some paper, suppose we are saying that there is a word rate of 9% or 10%, but I have a word rate of 20%. But that doesn't mean my model is not good. Because if I am taking all possible words in that particular language, and if someone is taking only 20% possible word of that language, then his word rate will be reduced. So that means whenever we are putting in some word error rate or efficiency, we should mention how many distinct words we are taking.

Speaker 1: Exactly. So whenever we see papers, always at the end of the experiment, there is always a discussion. Experimental results and discussion. Discussion is a separate subject. Then, of course, reviewers will demand that, you show me how your system compares with the other systems. Of course, you will have to put a table. Don't put only error table, but also one number of words, unique words in the language, the lexicon size. And perplexity, write it there. Because many guys don't say. So here, we use bigram. You can use trigram and various other things. Perplexity slightly reduces. In some cases, you write the perplexity. Also, you write the number of speakers. In yes-no system, there is only one speaker. Here, there are eight speakers. Number of speakers more, variability increases. And then, is it recorded in quiet condition or normal natural conversation? Of course, it will make difference. So whenever you want to compare two experiments, you need to give all these conditions. And then, you should compare. Scientists will agree with that. But if you just give the word rate, some guy will say your word rate is 20. That guy has given you 10. So he will reject. I agree. Yes. OK. Without going to detail, I'll tell you one more, what should I say, tuning thing, some changes that we can make. Instead of using bigram, you can use trigram. Generally, it helps. That's the first point. Your word rate comes down. So you can run this experiment after the session. That's one point. Second point is look at these errors. What we normally try to do is to set these two types of errors nearly equal. Now, let us compare the two successive lines. Which are the successive lines? These lines. Sorry, one minute. Let's look at these two lines. The word rate is lower here and higher here, because this is 17 pass, and this second line is 7 passes. Now, let us compare the number of insertions and number of deletions. Here, number of insertions is nearly twice that of the deletions. Well, look at here. Number of insertions is nearly 10 times of that of the deletion. That is another hint. Something that you can improve. How can you improve? There is some parameter called word insertion penalty, WEP. If you change the WEP, then insertions will reduce. So that is what is shown in these two lines. Let me now compare these two lines. Note this line, word error rate with 7 pass. This is also word error rate with 7 pass. And this is also word error rate with 7 pass. Where it varies is something called word insertion penalty. WIP is a symbol that you will see in the code. These numbers change. Now, if this penalty is 0, there is no penalty. Insertions are more, deletions are less. When the penalty is non-zero, 0.5, insertions reduce from 255 to 237. Insertions reduced by how much? 255 to 237, by almost 20, but deletion increased by 2. And still it's OK, because my insertion reduced by 20, deletion increased by 2. This is better. Therefore, the error rate should be smaller. Yes. Just by increasing the word penalty, my error rate reduced, because insertions were very large. They reduced a slight increase in deletion. Now let us go to insertion penalty 1.0. Let's be a little bit greedy. Then number of insertions reduced further from 237 to 225, about 12. And deletion increased by only 1. Obviously, this number is lower, because the total error rate is smaller. Any question here? OK, I have a question to you. Why stop at 1.0? Why not make it 1.5? Probably to reduce further. That is what you should do. OK, that's up to you. But what has been done in the shell script is, by default, they decode it with the three word insertion penalty, because these are the typical ranges. Again, this is something that you need to tune. Why only 0.5, 0.2 word insertion penalty in terms of 0.0, 0.5, 1.0? Why not 1.1? Why not 0.7? Of course, go ahead. You have a lab, a PC, run it. Find out. Plot it as a function of word insertion penalty. You can do all these things. In your thesis, you will write all these things. In the paper, you do not write it, because this is a standard procedure. So I showed the word error rate for three word insertion penalty for past three, and you saw a significant decrease in word error rate for different values of the word insertion penalty. What about for the best number of passes, 17? If I change from 0 to 0.5 to 1.0, it changes slightly. It still has. So word insertion penalty has to be increased. It is reduced by what? Kitna zyada nahi hai. 6.16, 5.92, 5.17. So changes are in the first decimal place. Actually, change is 0.05 still. If you want, you can do it. So I'm talking about the tuning things that you need to do, which is obviously visible here. It can be done. Where to go, where to change it? Of course, you have to go through a little bit of scripting, and they will all be there somewhere at the high level that we can talk about it later in another session. I also don't remember exactly where it is, but I know I can go in and find out. That is the first point. Second point, here, these word error rates are low because our test data is the same as the training data. Why did you do that? Because we had very little data. If we split up into test and train, then we will not have enough data for even training models. That's why we did that. But when you report results, nobody wants to know. Well, I shouldn't say nobody wants to know. There are some places where you need to know. But the accuracy on the training data, but when we report the accuracy, you generally report the accuracy for unseen data, data that has not been used for training, that is test data. By convention, test data means it has not been used for training. That is why when I said in the README CLST, let me go to the beginning of the file. I specifically talked of two types of transcriptions, one for test data, one for training data. So the same file ID should not be there in both of them. So the content file ID is here, and test and train should be distinct. You should report the decoding always takes place for this. Therefore, in general, you should divide the data into at least two parts. Generally, I take generally 2 third part for training and 1 third for testing, and then do a threefold cross validation and so on. Those are the details. That information actually is there at the end. Where is it? Yeah. So we have done this. Sorry, let me now go back. We have done the step number 2. Step number 2, this is step number 2. This is already done, and we were inspecting the word error rates. Now, of course, you can do this also and run it. You do it later. And then some descriptive notes are here. Note some manual intervention. Splitting of the speech data into train and test part is a manual task. Depending upon the number of parameters, divide the text into two parts of this thing, mutually exclusive speaker sets. I talked about that. There are some extra things that what are the words which are confusing and so on. There are some scripts for that. We'll talk about it a little later. Today, we'll. So let me. So we have this third part. I assume you can do it on your own. So that's not an issue. So let me go to the, yeah. I guess this is where I want to talk today. Any other questions? Yeah, Sajal.

Speaker 3: Hello, sir. Suppose normally we take, as you said, two-third, one-third means 30% for testing, 70% for training, or 50-60, or 20-80. But suppose we have less amount of data. So we can put some extra noise on that thing, and we can use it to increase the volume of the data. Is it possible?

Speaker 1: Yes. In fact, yes, you are perfectly right. Your thinking is in the right direction. Suppose I have only 1,000 speech files. That is nothing. So I use 300 speech files for testing and 700 speech files for training. Now, to these 700 training data files, I can add a little bit of noise and reduce the SNR. Then I have 700 plus 700. I give all these 1,400 files for training. Then my system becomes a little bit robust to the noise. Still, the sentences have not changed. Words have not changed. Speakers have not changed. But you can. This is called data augmentation, A-U-G-mentation. And beauty. So you are thinking in the right direction. You are in the group of several other people years before you who built the CALD model. They also thought of this. And data augmentation is now a standard parameter in CALD. You can just set it to data augmentation. It will add it automatically. Those guys are nice guys. Whatever good practices are there, they have already put it there. Now, where is that switch? Of course. Right now, I don't know. And people who use in industry, especially for TDNM, when you use recursive neural networks, you need hundreds of hours of data, thousands of hours of data, and so on. So the data augmentation by adding noise. Noise also. You can add reverberation noise. You can add white noise. You can add do the vocal tract normalization. It is a person has spoken. And with this pitch, my pitch is 140 yards. You can artificially increase the pitch by, let's say, 20% because 200 appears as if a female is speaking. You can do this pitch variation. That's also data augmentation. But more frequent data augmentation is called speed variation. People at my age tend to speak slower. I'm speaking fast these days because I want to finish it off. But otherwise, generally, elders speak very slowly. So different people speak slowly, fast. So some speech is there. You can increase the speed by 1.1 or multiply it by 0.9. So they reduce the speed. So this is, again, speed variation perturbation. Speed perturbation is a standard switch in calling.

Speaker 3: Some speakers are extraordinary, like Atal Bihari Bhattacharya.

Speaker 1: Exactly, yeah. They think a lot. And politicians in India will speak slowly. Do you know why? Not that they can't speak fast.

Speaker 3: Without a slip of tongue, some words will come which may cause their controversy.

Speaker 1: Aaj kal bahut sab bhi hai. Yes, various.

Speaker 3: Controversy. Because that's why Atal Bihari Bhattacharya never in controversy because he can think, he can write, then he can speak.

Speaker 1: That is true. And there is another reason, which is the people who listen to his talks, they go to election rally, are mostly people from village, rural area. They cannot absorb information fast, like we people do. We are used to speak very, very fast and all those things. So that they have to absorb the, there is enough time for cognitive absorption. There is no cognitive overload. Anyhow, the thing. Any other question or comment?

Speaker 8: So what was there before Kaldi? Was there ASR before Kaldi?

Speaker 1: Of course. Hey, hey, hey, come on, hey, you people like me, you're ignoring me, huh? HTK, Hidden Markov Model Toolkit. What toolkit? HTK, HTK. HTK, correct. And before that, let me just go to the HTK, Hidden Markov Model Toolkit, HMM Toolkit. This was built by HMM Toolkit. A PhD student in the United Kingdom, Cambridge University. He built it in around 97 or something he built it. And yeah, okay, 1995. And 1995, he then he built a company called Entropic. Okay, and it became a commercial system, and they were selling it. And in fact, in KFOR, yeah, this is 1993, acquired, right. So it was written early 90s, 1983 company became and KFOR, we bought it for 2 lakhs or something in early 1990s. Remember, that 2 lakhs is likely about 2 crores now. It was expensive, but we wanted to research. So and fortunately, I was in a national laboratory, there was a United Nations project. So we could buy it otherwise we will never buy it. So we use it, I use it for about two decades. Only in the last 7-8 years, I have been using, I have been using CARDI. Anybody asking question what was there before HTK? Sir, I would like to know, but yes. Good. Anybody knows? Have you heard this word SPINX in Egypt? Yes, I have CARDI, CMU SPINX. Ah, that is in US man. This is in, yes, you are right SPINX. So let me go to the SPINX, this is showing something else. SPINX CMU, SPINX CMU models. That is also open source, that started with the open source, it remained open source. And this again was outcome of a PhD student. His name is Kai-Fu Lee. He was a, of course, Chinese student working at School of Computer Science in Carnegie Mellon University in Pennsylvania in US. And he submitted his thesis in 1986. And in the year of 87, I went, I visited Carnegie Mellon University. Let me do some backpacking. I visited CMU because, thanks to Professor Raj Reddy, who was the head of the department, he actually won, what is that called, Computer Science Nobel Prize. He won that for speech recognition thing. And so I was one of the first few people who used SPINX model, that was SPINX-1. And I built a Hindi speech recognition system. With any data collected here, there were some 5-6 Indian speakers were there. I asked them to read some Hindi sentences, built and I called it as Dhwani. Anyway, that is. So, now, that probably was the first HMM based model which was openly available, open source. And before that, there were HMM models in IBM, and they were the guys who sort of popularized Rabiner and Jelinek. But that was at the company. Okay, that's the history. And let me also tell, of course, I was there only for six months, and then he came back. And my colleagues, when I came, returned here for the couple of colleagues, one Viksha Raj, another is Rita Singh. So, they are still there at CMU, they are maintaining the CMU SPINX, but now it has become a bit out of popularity. HTK, some people still use it, but SPINX. The problem with the SPINX and HTK is that you have to do lots of steps. Whereas in CALD, it is all automated, it is like Python programming. You write a C code, you have to write lots of tests. In Python, you call a library, get the things done, forget it kind of thing. So, everybody likes Python, Python language and everybody likes CALD. Now, CALD is also sort of facing some, let's say, guys like OpenStep, Xstep Foundation guys, they are going beyond this, everything is TensorFlow and things of that kind. That is how progress is happening. And there is actually a session in inter-speech a couple of months later, where the future of speech recognition research will be discussed by experts. I got an email yesterday. If anybody is interested, I can share that link. Next question.

Speaker 8: And so now, last night I was looking for various toolkits. So, I came across ESPNet and SpeechBrain. These both are end-to-end model toolkits. The ESPNet follows a CALD-like thing. I think it is based on CALD. And it is in Python. So, we can then use PyTorch and stuff there also.

Speaker 1: You are right. ESPNet is by the Japanese group. There was some recent paper also, some advanced version. And as you said, it is Python-based, so you can download it and use it. But again, since it is end-to-end, you have to use the GPU. So, whatever, that also brings it to the point. Whatever that I showed here can be done on a laptop. I mean, because it is not a sophisticated model. It uses a DNN. You see the DNN here. DNN switch, if you set it, it will use the DNN, but HMME is always there. So, if you want to get rid of HMME and want to use a recurrent neural network, TDNN is based on that time delay neural network. Then your computation increases. Therefore, one essentially goes to the cloud or you should need a GPU. So, the purpose of this session was to have a base or a foundation where you can run, you can build a first-level system on your speech database and you can do it on your lab PC, whether you have GPU or not. Now, if you have a GPU, it is better to go to the Kaldi with the TDNN model or ESPNet or the later one as the other one, what is that called speech brain. But there they require thousands of hours of data and kind of thing. I mean, you can do the transfer learning, but that we learnt it in the ASR school. So, even I am not comfortable with that. I heard, as I told you yesterday also, I heard lots of things, watched the entire proceedings of the ASR school, but I have not yet. I did run the TDNN for the data that they provided, but it is like pressing the button and getting some numbers. I had some problems, it was re-ran it, it worked all right. But I can say that I did the job, but if you ask me, can you do it? I have my number recognition system, can you run it and that? I do not know where to put what. That comes by practice. And now we have a sort of group of half a dozen people at least who are actually interested in this group I am talking about. Some of us will be more enthusiastic, more free time or whatever reasons we learn and we will share that experience with others and soon we need to go to the cloud-based. There is something that we can do freely, but may not be able to do, but you may have to pay a little bit. All those things will be the next part. But right now, I want all of us to be able to run a DNN-based system, HMM DNN-based system for our database. Now the advantage, why were you running all these things? Because all our checking that we were doing, check lexicon and other things, they will take care of many errors that are there in the wav file transcription and so on. Therefore, then if you run it on the ESPNet or anything like that, it will be very quick. It will not waste some speech data.

Speaker 4: Sir, I have a question. Yes. Suppose we got a word error rate and all. How will we implement it for real life experiences?

Speaker 1: Do you want to build an app?

Speaker 4: App or online system, sir, as we have witnessed during the session, sir, that we spoke some Hindi sentences and it got executed. We got an output like Hindi sentences. So how do we implement that?

Speaker 1: I'm sorry, let me go back to the, now I'll move out of this place. I'll go to my 21 directory. Let me see whether I'm in the 21 directory. No, I am in AS. Do you see a word here, demo.shell script? Yes, sir. If you want, we'll have 5 minutes, 3 minutes to look at that. Let's look at that. That is meant for showing a live demonstration. Suppose we have a model here, then you want to run this script, speak a sentence and see the word getting displayed on the terminal and either. It's a variation of the run.shell. What I did is, let's look at that. It's only one speaker, therefore, I set it clear because single test file. MFCC switch, yes. Testing, I'm not going to do any training. All the training switches are set to 0. I will run only testing because I didn't put the DNA because this also can be set. So I can test the output using various models, monophone, try 1, try 2, try 3, SGMM. Generally, try 3 works good enough. Therefore, it's good enough if you set this switch to 1. Try 3, test switch to 1, all other switches 0. That switch and signal processing, of course, it has to, whatever you speak, it will be recorded, it will be MFCCs are computed and decoded using the try phone model. So demo.shell, resting, but notice here that there is a stop here. Record Denovo switch. By default, I will set it to 8 seconds. Either 8 seconds, either you can change it up to you. So what it does is it uses a sock command called rec. Rec, nowadays it's the socks is by default, rec will be command there. It records it for 8 seconds and it is recording for this. You should change it to your 16,000 or whatever it is. These are the sampling frequency and it generates the speak to speak and all nonsense things. It runs the MFCC extraction and decoding and at the end, monophone testing, it shows the sentences. There is an e-graph, it will put it. It does various things at the end. Training testing is over, it just displays all the outputs, order rates and things like that. So essentially it is there, but what you need to check is, change is, because this was from some other database, you need to change this to 16,000 and this to whether 8 seconds, 6 seconds, whatever it is, but in between do not speak anything else. Once you press the enter, start speaking normally and just wait for the something other to appear on the terminal. So there is a demo is already there. Any other question? Yes. So you can show a demonstration. That is what I am trying to say to you. Of course, this demonstration will show the words in English letter, ek and so on. You can always, there are other ways of printing it in Devanagari script or whatever script. Those are all trivial things.

Speaker 2: Next question.

Speaker 1: All right. So we spent two hour, two hour with the 15, 10 minutes break. We spent about two hours today and three hours yesterday to have an overview of the usage of the Kali for a simple database. Things that you need to do next for people who had success so far. Things that you need to next is a step number three, that is one. And then you can, then I want you to repeat this job for your database. How will you repeat this job? Will you overwrite this? No, do not do that. So I will talk about that and then we will break. And one person had some problem here. Let us let me try to resolve that. And one more person had a problem with the installation itself. We will look at that a little later. So now I am going to talk about how to implement a similar system or another language or even the same language, but another database. Language does not matter. Database is what matters. You know the requirements. I have put it in some particular form. And I have followed some conventions. And I know that some people put the sentence digit as just three, in which case you need to change these scripts appropriately. We went through the scripts. Wherever it is pulling out the sentence number, you should pull out appropriate number of digits. So little bit of changes are there. That is the database dependent changes. But where will you implement it? Where am I? I am in AS number recognition system. And I will go one step up, double dot. I am in chief directory. Now if you see, there is a directory called AS number 21. But the old one, database name, it is still there. I did not move it. I asked you to re-copy it. Now this is why I vary the readme number. You see line number. I did not overwrite the old one. So the old directory is still there, which means that you do not have to unzip it again and again. You just redo this. It is of AS number recognition, whatever language, Tamil number or AS database, whatever it is. And then go there. And of course, this contains the databases. That is your database dependent. After that, almost everything follows the similar procedure. But based on your file naming conventions, you need to take care of. You need to edit certain scripts that I edit. What are all the scripts? This script you need to edit. This you do not have to worry about it. Let me just show the scripts where you need to edit it. Data preparation. This you need to edit it because it is pulling out the utterance ID, speaker ID and the transcription. And then you need to edit this. These are the two base things. Rest does not matter. I think rest does not matter. So these are this. What are these? One is this because it is database dependent. Another is this. This is also database. Database dependent means file naming conventions that you need to take care. So if you want to build it for another database, what you will do? You will start from here, CP minus R. You already done the CD. Let us say right now I am in the zero chief directory. I will start from here. Instead of this, I will give another name. You notice that I have put the language number here because I keep working in various languages. Even in a single language, there will be several databases. And this is my year. Sometimes I try at my not necessarily. It does not matter. Any questions? And I also put these two files here. This is this readme file is not the latest, but because I made some changes yesterday and today during our sessions. So I will also post it just for our references. I assume that you also have made the changes. So you can replace it.

Speaker 2: Any questions by anybody?

Speaker 1: Then what we do is it is 821. We will end the session. But Sajal, you do not log out because somebody has some problem, face some problem. And so please leave it there. And yeah, others can

Speaker 3: leave. Sir, I also face some issue. It is not running. You have faced some issue. Who else

Speaker 1: has faced some issue in running? I have faced this. Now what I want to know is whoever faced the issue earlier, first I want to attack it. In which stage

Speaker 3: Sajal faced the issue? I faced after MFCC when I changed the monophone training.

Speaker 1: Okay, from instead of monophone you went to the other one. That should be trivial. One minute, did you face the issue while training a monophone model? Yes, yes. Oh, monophone model itself you face the issue. Well, what I will do is please other person I just wait. This may be a trivial issue. So I may be able to solve. However, you need to show me the log file. That is why I assume that your errors are in the log file. So I will stop sharing.

Speaker 10: Sir, after changing the switches, I think it is working fine for everyone. But maybe that he has not changed. He has not replaced it with that new one.

Speaker 1: Yes, quite likely. So let us look at his log file and I will be able to. Yeah, you are right. That is why I want the log file. Let him display the log file. Thank you, Kim. Yes, we are at it. Kim, you are okay, right? Yes, sir. Yes. Okay. All right. So yeah, others can leave or wait. I have no problem.

Speaker 10: Sir, can I leave? It is working fine till now.

Speaker 1: Yes, yes. Everybody can leave now whoever has it has worked fine till the monophone. Rest is all fine. So all people for whom there is no problem with the monophone, you can leave.

Speaker 10: Please. Okay, sir. Thank you so much.

Speaker 1: Okay. All right, Sajal, you show me the editor. It has run correctly, no? I thought it ran correctly. I saw the ignore input failed command. You ran it from somewhere else. Run.shell, that is the problem. Okay, close it. Show me your terminal. Terminal, okay. One minute. Yeah, just do LS. See, old programs are running, you should not rerun it without waiting for this step. That could be the problem. But first let me check whether is there a run.shell. Yes, run.shell is there. Do a more run.shell. I just want to check that your switches are okay. More or gedit. Do you have it in gedit? Run.shell, if you have already opened it, show me that. Yeah, gedit run.shell. Switch, monotrain switch is one. Monotrain switch is one, that is correct. And you have saved it. This is fine. Okay, now let us go back to the terminal. Now run that again command. One minute. Run it. It is running now. Okay, now you go to the gedit log mono.txt, gedit log mono.txt, you have to open it. Go to the terminal, gedit log, right, right. Okay, return. Let us go. Let us go down. This is all standard. Go to the bottom. This we have corrected. There was some problem. Was there a problem? No problem. I thought there was a problem. There is not. Oh, it is running. Continue, continue. Go down. Yeah, that is the end. So it is running. It will take a minute. Okay, now let me say why probably it got into problem because if you run one job, when the other job is running, there will be problem. That is why people do not put that ampersand. If you are willing to wait, you do not put the ampersand and know how it will run. Right now, it is still running. Press and enter. Oh, now it is, no, the one is still running. Your original work is still running. Okay, so what I want you to do is go to the gedit log mono.txt and keep refreshing. Reload it. See, it has changed. Reload. Not there, right side. Reload. It has gone some more. Go down, go all the way down. So wait till the final WER comes. Then you are done. All right. Okay, Sajal. So you are through. Okay, now who has the create language model problem? Sir, I have problem. Shantanu, okay. Okay, please show me your error file. Then I will. Sajal can leave, do not leave, but you can assign leave or leave it as it is, but you are free to stay, but you can leave. Shantanu, please show me. Right, good. Thanks. One minute, can you note down, I can see your, you can see, do not worry, this is good. Can you identify the error? I can identify, I can see the error. What is the first error? Yeah, what is the first error? That is correct, correct. Yeah, no error. Show me where is the first error. So this is a standard problem. You see, can you tell what it is showing, which is the error? Highlight the error line. Yes, yes, command not found. So either you have not compiled some language model or this is not in the path. So let us find out where it is. I know where it is. We faced yesterday, somebody else also faced FST add, right? Okay, let us go to the terminal. Are we in the terminal? Let us go to the terminal. Yes, sir. Okay, fine. Now I want you to go to, let us say, cd space dot dot slash dot dot slash dot dot slash tools, T-O-O-L-S, enter, enter, L-S. L-S. Okay, let us see. Open FST is there, IRL MSTP is also there. L-S space, open FST space, open FST slash bin slash, not open FST, open FST 1.7.2, sorry, probably. Tab, put it, open at 1.7. There is a dash there also, right? Slash bin slash FST C star, C star, C and star, star is asterisk. Enter, enter. Yeah, but what, now repeat the command. Instead of C, you put A. We had problem with the A, A, return, add loops, enter. Arc sort is there, but add loops are not there. So it may be in the IRLSTM. So let us do the following. L-S space I-R, what, STLM slash bin slash FST A star, return. Let us see whether, okay, then L-S, IRLSTM, somewhere there is some bin, I do not know where that bin is. IRLSTM, just enter, return. Spelling mistake, sorry. What is it? IRSTLM, enter. There is a bin, but okay, but it could not find, okay. Let us execute the third previous command, that FSTA, IRLSTM, yeah, instead of, just remove the A, star, star, enter. So, for some reason, this add loops command is not, does not seem to be there, something you did not compile correctly, but let me just confirm it. So type in the following command, so type in the following command, find FIND space dot slash space minus name space single quote FST star single quote, enter. Okay, it said something, there are too many of them, repeat the command, but after FST put a A, A, okay, enter. It said some add loops or something, right, what it could not find, something it could not find, I forgot. Okay, CD space minus, CD space minus, minus, enter, just a single minus, okay. PWD, I am not sure whether where you are, okay. Now, let us go to the Kaldi, CD space, Kaldi slash, no, no, yeah, Kaldi slash EGS slash Shantanu, whatever, 0 Shantanu, yeah, whatever, is it 0 Shantanu, I do not know, 0 something, 0, slash AS, capital AS, slash capital AS, let us go to the directory, tab, will work, enter. Now, show me the error that is shown, I want to remember that error, because I need to see where that error is, show me that error message we got, g edit probably. Sir, here I ran command number, yeah, but I cannot see it, you have to, if it is in the editor, you have to share the editor again. Yeah, yeah, FST add self loops, I am surprised, one minute, let me see, where is it in my directory, yes sir, then we will know, FST add, it is not there, FST add self loops, okay, add self loops, no, it is not there, who is asking for it, one minute, prepare language.shelf, did you reinstall it, you said that you did it today recently, right, today, yes sir, yes sir, today only, one minute, and first question, did you write it to the same directory, Kaldi directory or did you remove the old directory? Sir, I removed the old one and

Speaker 4: redownloaded, reinstalled again today. Okay, okay, you removed the old directory,

Speaker 1: right, because that is one something that I suspect a little bit, let us see it, and one minute, let me just check on my terminal, where does this FST add self loop seem to be required, give me a few minutes, let me check why it is needed and why that guy is

Speaker 2: asking for it, let me do some investigation on my side, I am in, okay, zero chief, FST add, FST add, FST, OpenFst and ssrc Fstb, okay, let me see whether that is there in the SRT Let's see.

Speaker 1: Let me just check whether it is still there, FSP, add, find, dot, minus name, FSP, add, yes, I found it, it is there in FSP win, FSP, now do the following, head, path, dot, shen, enter, it is there, so you need to compile it something, Kali root src FSP dot FSP win, I found in FSP win, no, I see, src slash FSP, correct, type in your, what is that, prompt is too long, so let me just shorten it at least for this session, type in the following, capital P as 1 equal to single quote, dollar space single quote, double quote, sorry, not double, double quote, double quote, previously also double quote, equal to, remove the, remove the single, equal to double quote, enter, no, no, no, no space, just enter, enter, now, that long prompt is gone, if you want this to be a permanent feature, I can tell you

Speaker 4: what you need to do. Yes, sir, I found this Shantanu Administrator

Speaker 1: think center, this is very long, sir. So, what we will do is, I will, let us do that first, gedit, gedit, yeah, it should work, gedit space tilde slash, tilde, slash, dot B, BASHRC, okay, enter, now show me that your gedit, yes, yes, yes, yes, yes, yes,

Speaker 2: yes, you go to the end, okay, one minute, let me just, I will tell you, go a little

Speaker 1: bit up, some more up, some more up, there is a PS1 line, oh, these are the things that I gave probably, I do not know, go up, now, this is the problem for all this damn thing, so if you want, just think what I want you to do is, comment the line 69, yes, that line and add a line after 69, enter, PS1 equal to, equal to double quote, dollar space double quote, double quote, that is it, save it, okay. Next time onwards, it will always be dollars, okay, fine, okay, let us now, we can ignore this, we can close this and we will go to the terminal, give me a minute, for some reason, for some reason, you are called, okay, let us check whether it is there, type in the following command, ls space double dot slash double dot slash double dot slash src slash fstbin, fstbin slash fsta star, do you see that self-add loops is there but the cc is there, source is there but the program is not there, that is because you have not run, compiled it, somewhere you did not do the step properly, so on, so what we will do is the following, type in pushd, p-u-s-h-d space triple, yeah, double dot slash double dot slash double dot slash src, return, enter, okay, now we are in the src directory and now there is a make command, li, sorry, more install, let us see what is there in the, more m-o-r-e space install, I capital, right, capital I, tab, no, no, all are capital I, I guess, tab, return, okay, see there are some commands here, for some reason not everything has been done, okay, whether you did it in the tool install, that you are supposed to do first, let us do this installation again, because I do not know, I mean, if there is a problem there, it will close here, so we will spend some half an hour in just running through, it will not recompile probably because it has already recompiled. So, let us go to cd space dot dot slash tools, return, return, okay, more install, same earlier command, earlier command, you can repeat, return, okay, enter, q, just type in q. So, let us run this check first, this is that command you run it, extras, top command, top command, extras, yes, yes, that one, right, copy paste, okay, good, now let us run the make, yeah, I do not think that will matter, no, it may not matter, it may not matter, I do not know, okay, now you need to run it, I do not know, if I ran just like that, let me try make, just make, because we do not know the, I do not know the, just, it will take some 15 minutes to just check that everything is okay.

Speaker 4: Warding, sir, this is. No, it will not.

Speaker 1: This warding, sir. Relinking, that is okay, relinking is, there are already a link, therefore, it will be a link, what I should have done is to put it into the log file and see whether there is some error messages would be there, something would be there, irstm you already installed it, right? Yes, sir.

Speaker 4: Sir, it is there, irstm is not installed by default anymore.

Speaker 1: Yeah, so we just run that command just for the extra install, last but one line, yeah, yeah, that one, it is okay, just, it will probably do it, yeah, I do not know whether you did it, did you do it in the morning? Yes, sir, I did. Yeah, you already did it, therefore, now let us check that again, okay, cd space double dot slash tools, enter. There was a ls command that we looked in, what is that, fstbin, can you go up, up, up, up, okay, do not do it, okay, ls space fstbin, oh, no, it was in src, sorry, sorry, delete this command, cd space dot dot slash src, enter, ls space fstbin slash fsta star, enter, it has not compiled, okay. Now, let us cat install or more install, more install, yeah, anything, anything. Yes, tab works, okay. What you need to do? Now, let us do the, yeah, 3, 3 comma dot configure minus minus share, yeah. Sir, all together? No, no, no, one by one, one by one, there may be error, I do not know. Warning some, wait, wait, let me see, that is okay, I do not think so, Kalia successfully configured, okay. How many cores do you have? 6 cores, sir. Okay, so you can, yeah, yeah, make depend, yeah, yeah, yeah, one by one, yes, yes, yes, it does not matter, let it be, it does not matter, it just checking that everything is okay, if something is not okay, it will compile it. Okay, there was one more make command, there was a lot, no, no, no, sorry, is it just make, I do not, cat install, more install, I just want to be careful.

Speaker 4: Look at there suddenly or?

Speaker 1: No, no, no, type in the following command, yeah, more space install, no, it is just make, that command, let us also put, it will run faster, it seems to be, read error or something was there, we will see, we may have to rerun it again, there was some error was there, that is why I run it to the, I always output, redirect the output into log file and inspect it, anyway, let us wait, oh, it is doing something, it is downloading something, so something was not done properly, probably the last make step, there are some errors, it is okay, it is okay, let us press and enter, it is still doing it, let it do it. Now, let us try that command ls, what is that, fstbin or something, upper command, go back, arrow, arrow, yeah, enter, it has not done, therefore, there are some errors that we need to repair it. So, now you do the make, just make greater than ampersand log dot txt, return, just wait, let us let it, it will take one minute, let it complete, then we will inspect the log dot txt, something is not right, I saw some errors, but I was not sure,

Speaker 4: but let us see. Sir, actually, I installed Kaldi at 2 pm only, so I did not have permission to open the office, so I came here 2 pm, so in hurry, I reinstalled again, might be I did some mistakes.

Speaker 1: Hmm, I do not know, because we did the installation in the tools directory, there was no problem, do not worry, you are in the lab, is it? Yes, sir, office lab, sir. CAL office, okay. Yes, sir, yes, sir, it is Sunday today, so no one is allowed, but

Speaker 4: I was informed that I have work after 2 pm, so I have keys of my office.

Speaker 1: So, I am here only. Oh, yeah, yeah, I have been to CAL, so you have to make an entry permission and all those things, usual things, yeah.

Speaker 4: But sir, computer section is not open, I have provided a separate office, separate room. Okay, okay. Separate computer system, Narayan sir has provided all those facilities.

Speaker 1: Very, very good, where do you stay? In the hostel only, sir. Is there, is the hostel on the campus? I do not know. Yes, inside campus, sir. Oh, okay, that is good, okay, okay.

Speaker 4: So, I can work here till 9 pm, 9, 10, it is all my choice, sir.

Speaker 1: Yeah, yeah, it is a choice, but it is also responsibility.

Speaker 4: Yeah, yes, sir, yes, sir, big responsibility, huge responsibility. Computer unit is closed till at 5.30 only, sir.

Speaker 1: Yes, they are regular 9 to 5.30 job guys, poor guys have to go home.

Speaker 4: That is why I connect by both the ways, mobile and system, because system does not accept my voice, audio calls, so I am not able, so whenever I have to talk, I connect by both ways, desktop and mobile.

Speaker 1: So, right now, right now meet, is it, Google meet, is it?

Speaker 4: Yes, sir, Google meet. For audio calls, I connect by mobile and for a system of my

Speaker 1: this work, I connect by desktop. Oh, why system does not have no permission to use the audio, is it, maybe?

Speaker 4: Sir, I was trying during the workshop also, summer school also, my name was being called and I was not able to respond.

Speaker 1: So, is this the security aspect of CAL? No, sir. It is a problem with your laptop or?

Speaker 4: I have asked to the technician over here to check this, so what, because I have to connect by both ways.

Speaker 1: So, it is a problem with your audio card on the, on your desktop PC. Okay, yeah, and that is okay, yeah, yeah, yeah, somebody will correct it all that. Okay, Shantanu, yeah, he is still running, let, I am a little surprised that he is still, okay, it will take a minute, but let us wait. So, this is your what, fourth month in Karnataka?

Speaker 4: No, sir, sixth month, six months already have passed, I came here during second week of February and since then I have been here only.

Speaker 1: So, how many Kannada words do you know? More than six?

Speaker 4: Yes, sir. First, I started learning Aitho and then Tindi, Uta,

Speaker 1: Uta, Uta, Uta, Uta, Uta, Uta, Uta, yeah, there is a tone also, Mysore tone. See, within CIL does not matter, it is a National Institute, but when you go out in the bus or any any town or anything like that?

Speaker 4: The first day sir, at the University of Mysore, I faced this language issue. I was there for an enquiry of my PhD admission. So, go and enquire about the process. I went there and I was asking in English. By listening to my English, I got that he is not Kannadiga. So, okay. What should I share sir?

Speaker 5: First experience.

Speaker 4: But the best thing I have experienced here is that people value their mother tongue at the most. In my Bihar and all, people hide the English language the most.

Speaker 1: English language is a job, that's why. For work. Again, but I will blame educated people. People who want to educate, they want to learn English at the primary school, you know, KG onwards.

Speaker 4: Sir, but English is not just a language, it has become a prestige. That's why it has become like that. That's true. If you don't know Kannada, I don't feel ashamed. But if you don't know English, you should feel ashamed. That's what happened.

Speaker 1: He is in a big city. Small city. In the village, it is less, I think. I don't know.

Speaker 4: Sir, it is like that in the village too. If he is speaking in English, he doesn't know that he has come from that planet. It is such a scene, sir. In our Bihar, it is very dangerous.

Speaker 1: It is. It is everywhere in India. But it is more or less.

Speaker 4: Sir, I really liked it here in Karnataka or in South India that people value mother tongue. That's a great thing. As a linguist.

Speaker 1: You are a linguist, that's why. That's why, sir. Okay, now I am looking at your terminal. Whether it is running or not, I don't know. Can you open another terminal and check another terminal? There are of course multiple ways of doing it. Are you in the same terminal?

Speaker 9: Yes, sir.

Speaker 1: Okay. You put control Z and then type in BG. BG. BG. Return. Return. Okay. Now you put a top. TOP. It is running in the background. Enter. Enter. I want to know whether it is running or not. CC1. It is still compiling. Yeah, yeah, yeah. It is still compiling. Okay. Q. Type in Q. Q for quit. Quit. G edit whatever log file that we have created. I don't remember the name of the log file. Is it LS? LS will tell you. Yes. Log.txt. Yeah. No, don't do anything. Just LS. You type in LS, it will show you the log file name. LS.

Speaker 2: Return.

Speaker 1: Enter. It is a log.txt. G edit log.txt. Yeah. Show me the log.txt. Yes. Show me the log.txt. There may be errors there. Let us check that. Yeah, it is coming.

Speaker 2: Change on. You can reload it. Okay. Let us go down and see where the first error occurs. We have to go page. Yeah.

Speaker 1: You can do a control f and error. Err upper case. Control. Err. Upper case. No. No. Just err, not complete error. Err. Search. No. No okay so far no okay one minute WAR warning first let us lowercase uppercase nothing lowercase try lowercase again okay no okay fine then close that and reload it because now let us try it again warning or error normally do a crap that's control F that's simpler no okay error ERR no okay so it's running I do not know why it is running let us say wait let us go to the end drag it to the end let us see what just drag it it is still compiling open FST it is compiling let us see why it did not compile earlier I do not know but let us see change on this reload reload let us go to the end and see what what it is doing open FST still

Speaker 5: discriminating

Speaker 1: check in the terminal whether it has come back in the terminal press and enter it has done the job it says over or something on the terminal just press enter just ordinary enter yes okay then it is still continuing because if the job completed it will say completed square bracket number one completed something so it is still running so let us come back to log.txt it is still running let us wait probably for some reason for whatever reason some things were not compiled so it is compiling now it appears like that because it is taking more time right now yes sir because when we ran the make command earlier it did not take this much time one or two minutes yes but this has taken five minutes or more yes sir okay let us wait I will come back pardon me so much patience you have sir so it always comes back this is why I like sitting in the lab with the PhD students and project staff than sitting in the office room that is okay so I will come back after two minutes

Speaker 2: uh

Speaker 1: Let's see.

Speaker 6: And that's OK.

Speaker 1: Where is the first error? Let us see that. Because we shouldn't look at the last error, first error. Search for control. Search for E-R-R. Yeah. OK, is there any warning before that? I think I know. OK, warning, no. OK, I got the error. OK.

Speaker 6: Only one error.

Speaker 1: Yeah, yeah, yeah. See, what it said is that you said, please download and install port audio manually. For some reason, it is not doing it. One minute. Let me see. Could not find a tarball. Why it couldn't find? ES table version. Trying to download it via W. Do you have any web restriction on downloading somewhere or something? No, right? No, sir. OK, OK. So we'll do that. Now, this is port space audio, OK? So what we'll do is, you also do this because you learn this in the process. Go to the browser and search for that. What I'll do is this command. See, right now, you are there, right? Install port audio manually, that command, that string. Install port audio manually. Let's copy it. Copy it and paste it on the Google search. In fact, I normally also put the entire line, but that's OK. Let us see what it is. Install port audio manually. Actually, I should say Ubuntu, but that's fine.

Speaker 2: Manually, port audio manually. Cannot open or something.

Speaker 4: Oh, you are not doing anything, sir? Only log file is visible to you.

Speaker 1: No, that's OK. I don't care. But I'm also doing it here, Ubuntu. How to install port audio in Ubuntu? I'm just looking at that. How to install port audio in Ubuntu? I got a Stack Overflow answer. I'm just going there. But there's a code example also. Sudo apt, something is there. What I will do is let me just, because I like the answers rather than all the explanations. So see, I will copy the commands on the chat. That's what I found. So let me type it on the chat. There are two commands. One, first two lines are in a single line. Oh, you may not. On the terminal, you have to type in. Yeah, sudo, of course, sudo. Don't share the screen. Sudo, so that's where you have to put the password.

Speaker 9: Yes, sir.

Speaker 1: See, that may be the reason for your audio problem. For audio, OK. One. People call up my audio user. Did it work without any problem? The first one worked, sir. OK, now type in the second one. Let us see. Second line, whether it works.

Speaker 4: Only installed by audio, sir, 0.2.11.

Speaker 1: One minute. Now you can share your terminal so that I can see what happened.

Speaker 2: Stop the sharing and share again, share terminal again. OK.

Speaker 1: Installed successfully. OK, yeah. Now let us run the make again. Now let's go to the earlier command, make command. Let's go up. Put an ampersand. No, no, no, not this. Put another ampersand. No, not this. This itself, at the end, put an and sign. You know, and is shift 7. Return, return. Now enter, enter. Now let us go to the log.txt because we are rerunning it. Now share me the gedit of log.txt. See, this was not there. There was some problem. Your audio had some problem. That's why you couldn't speak. Probably.

Speaker 2: It's coming. Reload. Yeah, it will take some time. Hmm.

Speaker 1: Ah, could not find. OK, let's go down. That's OK. W get. Still, let us go down. Did it stop? Oh, it stopped. One minute.

Speaker 9: Hmm.

Speaker 1: What does the port audio do? See, then we can sort of talk that way. How do I check that? Set up port audio and Pi audio in Ubuntu 2.0 plane. How do I check that it has worked correctly? That's what I want to do. Hmm. I'm searching in the web about the port audio.

Speaker 2: Ah. OK. OK.

Speaker 1: Let us go to this place, because I see some commands for downloading. Let us see whether we can. PA stable v192011.

Speaker 4: Is there? Yes, but you don't have to get installed.

Speaker 1: Oh, oh, but it is the latest version is slightly different. Yeah. There is some installation commands. Yeah, if you go, who is this? Victor. You are Victor. OK. There's a command on the second page. To do apt-get install lib sound dev. Yes, sir. Yeah. You execute that command. Let's see.

Speaker 9: OK.

Speaker 4: The lib sound 2 dev is already the newest version.

Speaker 1: It already said that. OK. Now go to that. And this is step number 2. Can you see the step number 2? Port audio.com download.html. Let's go there and let's download it. Let's just wait. Let me also see, because he is looking for, this guy is looking for some version, and we are looking for, we are downloading some other version. Port audio.com download.html. It's taking time to load.

Speaker 4: Website not available.

Speaker 1: What is not available?

Speaker 4: Website is not available.

Speaker 1: Yeah, yeah. It is extremely slow. Yeah. One minute. Let me look at your error messages. PA stable. Here is my suggestion. Copy that line number 92 in the log.txt. Download of PA stable. Last one. 92, 92. Right now, I can see it. Last but three lines. Last but two lines. Ah, yeah, yeah. Put it on the Google and see what happens. Just that line. Failed, failed, failed. Including failed. I mean, Google will hopefully help. Did it give you something useful?

Speaker 4: No, sir. I can write like that.

Speaker 1: No, did the Google search give you any? No, copy and paste it on the Google search. It will give some answers. Yeah, take your time.

Speaker 4: Now this message is there.

Speaker 2: Hello? Sorry, go back, go back.

Speaker 1: Third version. Go back. There's a third choice. Kaldi, further down, further down. Ah, you see? Ah, yeah. Because we are doing with Kaldi. Click on that. Let's see what it says. Because, oh, no. They say install Port Audio Shell. That is a shell script. No. Let's go back. So somehow that net, go back to the Google search. Let me see, huh? This one, sir? Yeah, we can try that. Yes, because that's also GitHub GitLab. No, that's also shell script. So that's no good. I mean, it will not. Now, for some reason, maybe because, yeah. See, the site is not up, as we also try to do that, right? Therefore, make could not complete that step. And therefore, it crashed. This is my contention, my guesswork. So the only one way is to try it tomorrow. That's one, OK?

Speaker 9: Yes, sir.

Speaker 1: But before that, right now, let's go to the terminal and check for, by our good luck, whether that has been done already. Go to the terminal, share terminal. There was something like ls space fstbin slash fsta star, that command. Yes, sir. Yes, let's try that. You never know our luck. Oh, it has not done that. Wait. Type in ls and return. Let me see what is there. Now, this is in fstbin. And we have what kind of fst? gmm-bin-gst-plugin. Yeah, so cd fstbin, cd space fstbin. Enter ls. ls, Enter. Let us see. Just type in make, M-A-K-E, Enter. It is compiling. Let us hope that the audio problem is not serious and it is compiling. See, the make in the parent directory, what we were trying for five minutes ago, that got stuck because of the port audio problem. And because of that, it couldn't make it here. Hopefully, this make will do our job OK. Let us hope that. Let's complete this. And then we will go back to our as-num-reco-21 directory and run that bigram language model where we got into problem. We will see. So anyway, even if this runs, tomorrow you will still come back to this src directory and run the make and hope that that port audio thing gets there.

Speaker 4: Yes, sir. Again, I'll do.

Speaker 1: Tomorrow that you can do, yeah.

Speaker 4: Tomorrow, yes, sir. After office hour, after 6 PM.

Speaker 5: Thank you.

Speaker 1: Let me see. Oh, it did it. Just type in ls space FST A star, star, return. Can you see the FST add self loops in green? Yes. So it has compiled. Now, let us try that this is good enough. Let us hope that it is good enough. Let us go to our directory cd double dot slash double dot slash egs tab 0, something 0 tab slash as return. Enter. Now, we will run that make command that we had with the earlier make. Up, up, up arrow. Make? No, no, not make. What is that? Up arrow. What were you running? You forgot now.

Speaker 4: Create background ls.

Speaker 1: Ah, that was where we are getting the problem. Right, right. Let's run that. Create, yes, yes. That create. Yeah. What command not found? R power to FST command not found. OK, one more. See, this is the problem with the port audio. Can you see? R power to FST. So let me just see. Where is that R power to FST? We are now doing a quick fix.

Speaker 2: Where is R power to FST? I know it is somewhere.

Speaker 1: It's in LM bin. Yeah, cd space minus, cd space minus, enter, cd space double dot, enter, ls space lm bin, cd space double dot, enter. lm bin, better. See that. OK, cd space lm bin. We are doing make, enter, make, type, make, enter. See, all this make should have been done because of that guy. It's not running. So of course, there would be, we could have commented it, but I don't know. Let's do it. OK, now ls, enter, R power to FST, cd space double dot, enter, ls, now I want to check, enter. What other make files are there? Let us see.

Speaker 2: I am looking at the various directory, FST bin, hmm, right here.

Speaker 1: cd, well, wait, wait, wait. A cd space lm, enter, ls, is there a make file here? No make, yeah, there is a make file here. And has it compiled? It has not compiled, yeah. Just type in make, M-A-K-E, enter. Nothing to be done, OK. There, OK. Let us go back to our AS numerical, a cd space tilde slash Kaldi, tilde slash Kaldi, right? Yeah, Kaldi, I think. Kaldi, EGS, slash EGS, OK, enter. Now let's run the create language model.

Speaker 5: Oh, yeah.

Speaker 1: OK, Shantanu, some other time we'll meet. And of course, we'll continue this.

Speaker 4: Bye. Bye, sir. Good night. Thank you, sir.

Speaker 6: Thank you.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3718 customer reviews

1/730

Verified Order

“I needed an interview transcribed accurately and I was happy with the quick turnaround. ”

Jen

Jul 20, 2025

“Very accurate transcription, fast service, easy to use and order, thank you!”

Gabby

Jul 15, 2025

“I am beyond happy with this service, which I am using it produce interview transcripts for my dissertation research. The interface is easy, the customer service was prompt and informative, the transcript is accurate, and the pricing is wonderful. I will recommend GoTranscript to anyone who is in need of affordable human-powered transcription services.”

Justin McDonald

Jun 29, 2025

“great work. quick and professional”

christian oradesky

Jun 28, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support