Understanding Checkpointing in TTS Model Training
Learn how to manage TTS model training process interruptions and optimize configurations using checkpoints. Explore restoration and continuation techniques.
File
Explaining Checkoints and RestoreContinue possibilities with Coqui TTS.
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Guten. First of all, I would like to say thank you for your comments and your feedbacks on Twitter or on comments on YouTube. So it might take some time, but I'm reading all the comments and I'm thinking about all your comments. So if you have suggestions on videos you would like to see or topics I should talk about, so feel free to comment them and I will think on doing so. And this is the case today. One of my followers on YouTube requested information about checkpoints and restoring or continuing a TTS model training. And this is what it's going to be about today. So first of all, probably the most of you will know that training a text-to-speech model is a long-running process. Obviously, depending on your hardware, it might take hours, days, or in my case, weeks. So what is the problem with restoring or continuing a training? So first of all, before we can think on to continue a training, we have to accept that there might be situations where our training process is stopped. And in my honest opinion, there are probably two main categories on that. The first is training is stopped by accident, such as hitting Ctrl-Z in the wrong terminal tab, or closing the wrong terminal tab and killing the training process, or system operating system failures, power failures, or if you use Google Colab Notebook, maybe there is a regular disconnect by default. So all of these might be reasons for your training process to stop accidentally. And the second category I think about is to stop training by choice. So why would you stop your training process by choice? Let's say you set up your configuration for TTS audio training, and you think everything is working good. Start your training process. Listen to the synthesized audio test samples. Take a look to the graphics on the TensorBoard and everything's looking good. And after thousands of training steps, you might hear by the audio samples or see by the TensorBoard graphs that training is not going to the right direction. So maybe you have trained 50,000 steps, for example, which were good, and then maybe by gradual training adjustments or whatever, you see that the training process is not going that well. So you stop the training by choice, would like to adjust configuration and start training from the latest maybe, or if you see that training went wrong some 10,000 steps previously, you maybe want to restore your training with a new configuration, a better adjusted configuration and an earlier checkpoint. So these are, first of all, reasons why your training might be stopped. And two options to continue the training, probably if it's stopped by accident, or to restore training from a chosen checkpoint with an adjusted configuration. And both parts I would like to show you now on the screen. So let's go into it. As you can see, I am connected to my training machine and have activated my Python virtual environment. And I'm running Koki's latest stable version 0.5.0 of TTS. So I'm in the directory structure of the receipts folder, my new work in progress task neutral data set, and training experiments for Tegetron2 DTC model training. So let's first of all take a look inside this folder. And as you can see, I have two training directories or two trainings done in the past. One started on January the 11th, and one started on January the 16th. So let's first of all take a look to the directory from the 11th, from the first training run. And here you can see that the checkpoint starting with 10,000, 20,000, 30,000, and so on. So Koki by default writes a checkpoint every 10,000 training steps. And you can use any of these checkpoint model files in combination with the config.json to synthesize voice. And as the synthesized audio will be better on more training, probably you will have more luck or better results with checkpoint files with higher numbers. So higher training steps already done. So that's the output from January the 11th. Let's take a look to the next training, January the 16th. And as you can see here, training is not starting with checkpoint 10,000, 20,000, 30,000, and so on. Training or checkpoint is starting with 130,000 steps. So obviously, it wasn't training from scratch. So as it is a new directory, it is a restore training. So and I personally did that because I've changed configuration. In that case, from phoneme-based training to character-based training. But what I like about the possibility to do a restore training is you can have or you have config.json file in every training experiment folder. And that's really amazing or great, because you can do a simple comparison, what I've changed from that training run to another. So because probably you know that TTS model training offers lots of configuration parameters and possibilities. So if you run several trainings, it might be a little bit confusing, what's the configuration on one or on which set. So in that case, to show you can simply say make a comparison from that config to the other config. And we can see that the description has changed from sample rate to character-based, this one was initially phoneme-based. And so use phonemes true to false. So you can make an easy comparison. But that's just what I've made in the past. Let's say I would like to continue my training from the latest checkpoint 130k. Well, 160k, that's the last one. So I will switch to that output directory from the training run I would like to continue. So I would recommend you to continue a training if you do not want to change configuration and training stopped by accident. So let's go into that directory, take a look. And the latest checkpoint is 160k. And there's a best model from 141,000 steps. To continue training, just run Python. Go to the base directory where you have your code key TTS root directory or base structure TTS bin and run train TTS Python. Let's start with continue training. So let's make it command argument continue path. And as I'm right now in that directory already, I can just make it dot just in case you can give to that argument fully qualified file name, not the file name, the directory. So do that. And let's see what is going to happen. Starting. So what you can see now is this line restoring from checkpoint 160,000. So I did not name that specific file name, that model checkpoint. So it's looking in the directory and identifying that this one is the latest checkpoint. So I will cancel it. I do not want to continue training for real. It's just to show you the way you can do it. And in addition to the checkpoint, the latest checkpoint, you can see that the best losses are taken from the best model, which is 141k. So and that's it for continuing training. If you keep that continue training process running, it will add more checkpoint files in that existing directory. So it will not create a new directory structure. So let's say, well, let's switch to the restore training path. And, for example, let's say I would like to continue my training from a previous step, for example, 80,000 steps. And in addition, I would like to change configuration. So I will go to the output experience folder, where my checkpoint from which I would like to restore is in, in this case, 80k. So this is here. Now I will run the same again. No, let's first of all change the configuration. So I can take this as the initial version. So let's copy this config.json to a temporary directory. Let's edit it. And for example, let's change, let's change, let's take a look for best. So let's change the keep all best, which is by default set to false, which is meant that just the latest best model is kept in the output folder. And I would like to keep all best models. So change that to true. So now I have copied the config.json from my training experiment folder to a temporary directory structure, edited it in this temp folder. And now I would like to restore training based on 80k checkpoint, which is not the latest in this directory with the adjusted configuration file. So let's run Python 3. What I will do is I will run it two or three times without the correct command line arguments, just to show you which exceptions you might see if you are missing a command line argument. So tts, tts in train tts. Python. Now the option is restore path. It will just give the path, not the checkpoint file itself, just the path and no configuration file. So as I'm in the right folder, I will just give this as a path. And probably I'll get an error on configuration or reading the configuration. So let's see. And here we are. We have problems on reading config stuff. So what we need in addition is we need to set a config path. And now I'd like to give the new adjusted and hopefully improved config.json. So let's take this one. And still I've not told the command line argument that I would like to restore from checkpoint 80,000. So let's run it again. It's just the directory I've entered. So let's see what's going to happen. So training is stopping because it is a directory. But restore training does not like the directory, it would like to have the specific checkpoint to restore training from. So let's go to restore path. And let's say checkpoint 80k. And run it again. If things are going right, training should now restore from 80,000 checkpoint with adjusted configuration. So let's hope for the best. So as you can see, process is running, we are restoring from the checkpoint 80,000 steps. Yeah, so far, so good. Let's check if there's a new output directory created because on restore options, a new output directory should be created. So let's take a look. And now we have previous experiment. We have that folder from today. And let's take this one. So go into that folder. Let's take a look. And the first checkpoint is that 80k checkpoint taken from the original previous training. And let's make a comparison. Let's make our diff from the config.js here with the version from January 11th, where we've taken the original config from. Let's check config.json. There's no difference. Why isn't there a config.js difference? Let's take a look. Keep all best true. Check this one. Okay. Okay, got it, got it. Sorry, my failure. Because I've run training previously, with the config path named, but no checkpoint file, the original file has been overwritten. So to show you not what to do, I made this mistake. But if you call the train TTS Python file, correct with restore path, file or directory structure, and the model name, the checkpoint name, and the config path to config.json, you will probably not run into that issue. So, but that's it on checkpoints, restore path and continue path. Let's go back to the front cam. So that's the video on checkpoints, restore and continue training options. I hope you liked that video. If it is so, feel free to share it with the community or nevertheless, I'm happy that I could help you personally. So thank you a lot and we will see and hear us next time. Bye.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript