Exploring LibriSpeech run.sh: Stages Breakdown and Insights
Delve into the stages of the LibriSpeech run.sh script with Daniel Povey for insights into data downloading, preparation, and neural net training processes.
File
Dan Kaldi 10 LibriSpeech run.sh explained
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hello, today we're with Daniel Povey, and we're going to ask him to give us a quick overview about what all of the stages in the run.sh script do. Okay, so this is the libri-speech run.sh. If you're trying to run this and you're a beginner, you might want to try the mini-libri-speech run.sh because it's much faster. This requires you to download a lot of data. So the bits at the top is just setting some directories about where you're going to store the data or where you have already stored the data. So look at the variable stage equals one, that's like a bash shell variable. If you've already run some stages, you might want to increase that. You can also do run.sh minus minus stage space one. That will set the stage to one from the command line, then it'll start from that point. So the first is the downloading data, you need a good web connection for this. Once it's downloaded, stage two, it's basically preparing the data in CalD's directory format that it can understand. Because libri-speech, like all data sets, has its own way of formatting the data. This puts it into a normalized form for CalD. Stage three, local prepare dict. So it's preparing a dictionary, which is like a mapping from words to sequences of phones. So it uses a pronunciation dictionary. This is the older way that people used to do speech recognition. These days, often we don't use dictionaries, at least for English. We automatically break the words into pieces and just have little pieces of several characters be the things that we recognize. But this is the traditional approach. So okay, prepare the dictionary, prepare lang, that's preparing the language model and search. Well, it's not just the language model, it's a lang directory, which is CalD's way of putting all in one place, things like the phone lists, the word lists, the language model, and various other information about the phones, like which ones are the silence phones. Stage four is creating a language model. We're not going to need this language model until quite a bit later, because this is for language model rescoring with a big language model, the foreground, which is quite a big N-gram language model. Stage five, wait a minute. It looks like it's looking like we already created the MFCCs, oh no, stage five doesn't do much. It just makes some directories. Stage six is when it computes the MFCC features. MFCCs are like, you can look them up. We dump them to disk in a compressed format. Stage seven, we create some smaller data directories. This is very fast. It's just messing with some file lists on disk, because we use subsets of data for the early stages. Stage eight is training a very simple system to start the alignment process. This is the kind of system people used to use, I guess, 30 or 40 years ago, it has just monophones. We don't model phones in context. Each phone is just one phone, so there's like 40 phones. It's a GMM, HMM system, which again, is like a technology from 20 or 30, maybe even more than 30 years ago now. Stage nine, we're using that bootstrapping of the monophone system to align the training data. That'll enable us to do the next stage much faster. Next, we train a better system with delta and delta, delta features, but actually, that's the same as the monophone system that also had delta and delta features, but the difference here is we're using context dependency, tri-phone context. We group phones, we count phones differently depending what the previous and next phone were, because people pronounce things and pronounce phones differently depending on the first and depending on the context. Next, we train an LDA plus MLT system. These are things that LDA and MLT are two different linear transforms of the features, the MFCC features. Next, we train a SAT system. This is all about speaker adaptation. We don't do a lot of speaker adaptation these days because deep neural nets don't benefit much from it, but in the GMM days, we did a lot of the speaker adaptation. The thing that you're probably mostly interested in is the very last stage of this script. Don't go there just yet. In the end, we're going to train a neural net, but we do all of these stages before to align the data and prepare it for the neural net training. It makes it work a little bit better than just training from scratch a neural net. It definitely doesn't justify the complexity, but we had already built all of this stuff, because that's how we used to do speech recognition. That's why it's all there. Okay, so stage 12, train SAT. That's training a speaker-dependent system where there's a linear transform of the features for each speaker. Stage 13, this is about improving the dictionary. Some words have more than one pronunciation in a dictionary, and this stage is all about pronunciations for the different pronunciations, like let's say the and the. They have different probabilities. Again, this is one of those things that gives us a quite small improvement. It's probably like 0.1, 0.2 percent absolute, but we developed it before, and we still do it. Okay, stage 14, this is another step of aligning the data. We align 100 hours of the data this time. So this stage seems to be training a small neural net system. To be honest, this is a historical system that we don't use anymore. I see now that stage 14 is not done by default, so we can ignore that. Okay, stage 15. We download some more data at this point, because LibriSpeech has like a 100-hour subset, a 360-hour subset, and a 500-hour subset. So we're just making sure that that's downloaded, the 360. We train a larger system with that. Again, some of these stages are slight overkill, but we just kept adding to the script. If we were designing this today, we probably wouldn't do so many separate stages. Okay, stage 16 is another stage of training with the entire data. Stage 17, we download the rest of the data, the 500 hours, if it's not downloaded already. We make sure that we have features on it. At stage 18, there's a quick training of the GMM system on all of the 1,000 hours. The reason we didn't do this before is because after a few hundred hours, GMM-based systems don't really benefit much. It won't get very much better. But it does make a big difference for the deep neural nets. They are very data hungry. In fact, these days, even 1,000 hours is considered a relatively small data set. People are working with 5,000, 10,000 hours, even tens of thousands of hours. Okay, stage 19, this does some data cleaning. Now, this actually doesn't give much improvement on libri-speech. This is probably not super necessary. We could maybe just have skipped this stage, and then we'd have to modify the run tdnn.sh to use the older, non-cleaned-up data directories. This data cleanup can potentially be useful if your data is very dirty. The libri-speech is very clean data. The transcripts and the audio match exactly. So we don't end up removing very much in the data cleanup script. Okay, so stage 20 is probably the most interesting stage. It has the most modern system. It's not 100% modern. It's maybe five years ago, it was totally state-of-the-art. These days, it's a practical system because you can build real-time recognition on it, which you can't do with a lot of the latest systems very easily. So what it is, it's a neural net with the tdnn, which is the same thing as a one-dimensional convolution, and it's a special thing called the factored tdnn. It's just a particular neural net topology. Anyway, so we could actually go and maybe another time, we can look inside that run tdnn.sh and talk about the different stages in there. But that's enough for this video, I think. Thank you. Bye. Bye.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript