Speaker 1: Hello, this is Daniel Povey and yesterday he mentioned this script, and today he's going to be taking us through it. Okay, so, so this is the main neural net training script and Caldy's libri speech recipe. This is called at the end of run.sh and you have to run most of those stages and run.sh before you can get to this because you need like the alignments and stuff. Now, now before you run this script you have to make sure you have a GPU, or preferably more than one GPU on your machine like they have to be NVIDIA GPUs. And you have, they should probably have enough memory like at least a few gigabytes. And Caldy has to be compiled for GPU which means that when you configure Caldy you have to have NVCC the NVIDIA compiler on your, on your path. Okay, let's scroll anyway it'll tell you if you got that wrong it'll give you a warning or something. Let's go down a bit. Go down. Further down. Okay, okay. So, I mean this is all comments still yeah okay here. This is the start of the actual script. So, okay sorry a bit up. So the first part that well the first significant part apart from setting variables is local and net three run I vector common sh so in Caldy's neural net recipes we used I vectors for a basic form of a speaker adaptation, so it's an extra input, as well as the NFC Steve's. I believe this is training and I vector extractor which is kind of a speaker ID system, but it's like a bunch of Gaussians and matrices and stuff. It's a way to get these I vectors so we, we run that let's go down a bit. You don't have to change these directories as long as you run the render sh but they can be changed. If you want to use for instance, you know, different base system that some of these things have to be matched with each other for instance, I think the alignment directory has to be from the same GMM as the GMM there and stuff like that. GMM is Gaussian mixture model. Okay so run chain comment sh that this is a bunch of stuff that's that's common to most of the that's common to most of these recipes in this directory local chain, it's, it's things like getting alignments building a simple left by phone tree which is what we use in these so called chain systems which is lattice free MMI, getting a left by phone tree like aligning the data with it, dumping higher dimensional MFC sees because we use I think 13 dimensional MFC see for the chain systems but we go up to 44, these neural net systems so that's one of the things that that I believe that's one of the things that that that thing does, it's done either there or in run I vector comment. Okay, so this stage 14. This is the main. You know, getting more specifically into the script. So we set a bunch of variables, like various ops to configure the neural network but this this script is creating this network config which is the kind of config file that specifies the network topology. So, these TDN and F layers, this is a kind of factorized TDN and which is the same as a 1d convolution. I mean it has some similarities to resonate but it's a, it's a, you know, it's a slightly different society different topology. The kind of feed through dimension of this thing is 1536. And in, in each layer there's like a small bottleneck that has a non linearity. It's a little bit different from conventional systems have a fairly small dimension that goes through and then there's a wide bottleneck. This just turned out to work better when we were tuning it. So time stride, like time stride equals one and three. That refers to kind of are we doing it 100 frames per second or like 33 frames per second. In something like Torch, this would be done differently. There's no notion of time stride in Torch, because when you subsample in Torch or TensorFlow you do that, like explicitly, you call something that subsamples the data. Now, Calde's neural net tools, they do the subsampling implicitly because you're, you're asking for the neural net output, once every three frames, and it figures out that because all of those last layers have time stride of three. It figures out that it only needs one in every three frames, and it just computes them. So it's a kind of, it's a slightly different approach to specify neural networks. So anyway, so that's the core of the network, the TDNF things. We have a couple of different outputs. We have output layer name equals output, and one name equals output dash xn meaning cross entropy. The output is for like the lattice for MMI output, which is what we call chain. And the output xn is for like smoothing with, smoothing with maximum likelihood over, over an alignment. The alignment is actually derived from the first output, the lattice for MMI. Although it is constrained by the lattices that you originally dumped, so it doesn't deviate more than a few frames on the left and right from your training data alignments. Okay, let's go down a little bit. Okay, xconfig to configs, that takes this like xconfig file and it turns into a format that the Calde binaries can directly use. It kind of expands them into components. As I said, Calde's neural net tools work quite differently from things like Torch. I'm not saying it's a better design, it's just different. Okay, so stage 15 is train.py. This is the main training script. So, the main thing that you might have to change if you run this is the numjobs initial and the numjobs final. Because this numjobs is the number of GPUs that you're using. And by default it starts at a certain number, let's say three, and it goes up to 16. The idea is that you don't need so many GPUs at the start, because it doesn't actually help you that much in speed because you're limited more by the probability of divergence than you are by like data noise. Now, this isn't always that well suited to your system, like the way the system used to work at Hopkins, we were running these individual jobs. We were running these individual phases of training separately on a queue, but if you just have to log into a machine and run a certain number of jobs, you probably want that fixed. So, what I'm getting to is you probably want to set the numjobs initial and final to be the same. And it could be like four or eight or however many GPUs you have on your machine. It'll take more iterations, it'll take a bit longer to use fewer GPUs, but it shouldn't be too bad speed wise. I don't know, maybe it'll take 12 or 24 hours to train this thing. That's very rough though. Okay. So, now before you run this thing there is something that you have to watch out for which is this. The Cal D binary is really expect the GPUs to be an exclusive mode. You can set the GPUs to exclusive mode by doing NVIDIA SMI minus C, I think it's zero but it might be three or something else, but it's something like that. Anyway, if you get this wrong, the logs will tell you what to do, you know, from these programs. Now, if you... There is a fix if you're not able to set the GPUs to exclusive mode because you don't have root. The problem that it causes is that basically the jobs won't know which GPU to grab, because it... By default they'll select the GPU that has the most memory free but if two jobs launch at the same time, they'll be... They'll end up... They'll end up using the same GPU and then one of them will fail from out of memory. So, I mean, this is a little bit tricky to solve if you don't have root. There are ways that you can like introduce random delays in the script. So, the way Kali works, this option on line 264, minus minus command decode command. Normally you'd set that to like run.pl or q.pl or something like that probably with options and you can find examples in run.sh or in the command.sh, cmd.sh. It's basically a kind of wrapper script that submits your job in a standardized... Using a standardized interface so that Kali doesn't have to care about whether you're using like grid engine or just running on the command line or slurm or whatever you're using. So, anyway, it might be possible to introduce a random delay in the run.sh and I think people have found ways to fix this. Anyway, let me see if there's any other interesting options. If we were tuning this script, probably the main thing that we try to tune is the initial effective L rate and the final effective L rate. Generally speaking, we find that this is like an exponentially decreasing learning rate schedule. Generally, we find that a factor of 10 between initial and final learning rates tends to work better, work best. However, personally, if I want to try to make it train faster, I might try to increase the L2, which we specified above when we created the network config. Can we go up again? A bit further. There's some variables if we go a bit higher, like tdnfops that has like L2 equals, we still have to go a bit higher. Yeah, yeah. No, no, no, not that high. I think there's a delay. It's like L2 is putting... Sorry, it's on stage 14, the beginning of stage 14. No, no, no. Further down, beginning. Oh, yeah, here we go. Here we go. So L2 regularize equals 0.008. So that's like parameter decay on every step. It's decreasing the parameters. That's also called weight decay. And that's going to effectively control that times the actual learning rate is what really sets the learning rate. Because if the parameters are smaller, and you change them by a certain amount, the relative change is larger. And because there's, I believe, batch norm layers in this topology inside the tdnf layer. It doesn't matter that the activations are smaller because they'll just be normalized. Okay, let's go back down to stage 15. Yeah, so that stuff about host name equals CLSP and blah, blah, blah. You can pretty much ignore that. That's for splitting storage across multiple disks. These days, you probably have a fast individual disk. Like with solid-state drives, disks are getting quite fast, so it may not be necessary to stripe across disks like that. Okay, let me see if there's any other... Most of these other things in this script you probably won't want to tune. Okay, let's go down to stage 16. There's makegraph.sh. This is creating a decoding graph like a finite-state transducer that encodes the language model information under the dictionary. Stage 17. This is setting off a bunch of decoding jobs in the background. Basically, it's decoding one of the test sets. Then it's doing various LM rescoring steps. This is probably ignored from stage 18. This is about online decoding. This is kind of a demo that's geared towards some people need to build real-time applications. And this online decoding, it's a kind of decoding that simulates what you would do if you had a real-time problem, so that you needed to recognize things as they were coming in. Okay, let's go to the top, the very top of the script. So this is where the complete top, right at the top. There may be some kind of delay. Yeah, here we go. So this is the word error rate. We haven't edited this for a few years because we haven't really changed the system. Unfortunately, it's been hard inside the Kaldi framework to implement some of the new things people are doing, like transformers. And I've moved on to so-called next-gen Kaldi, like if you look at Icefall and K2. I'm mostly working on that now, but these are the results we have, like 3.3 on test clean is our best result. I believe that left column. Oh, sorry, 3.29 with the system on test clean. Oh, sorry, no, that's dev. Let me look for test. Test is 3.8. So that's not a great number. These days, in the last few years, people have really improved on library speech. We're down to even like 1.9. Now, the only caveat is, so you may think that, OK, modern end-to-end systems are doing so much better than Kaldi. They're like getting half the word error rate on test clean of library speech. So that is great. But there are some caveats. Basically, if you want to build a product. It's still a lot easier to do so with this kind of system because it's naturally real time and also because it has a separate language model. So if you train your language model on a new type of data, you can easily recognize new data. Whereas with these end-to-end systems, they generally don't have language models that are separate. And it's actually really hard to combine them with language models. So if you have a new customer and you want to train some language model on their data, it's actually really hard to do that with an end-to-end system. And things like adding words is also difficult. I mean, end-to-end systems, they may not have a vocabulary as such. They may have these word pieces or something. But still adding new words is hard because the system kind of learns implicitly in a hard-to-debug way what words there are. Unless you just add to the data and retrain or fine-tune, it's really hard to quickly tune your model. But anyway, it is what it is. Maybe that's all for today. Thank you. Bye. Bye.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now