Improving VITS Script and Voice Model Fine-Tuning
Explore methods for enhancing VITS model fine-tuning, focusing on voice restoration, training strategies, and balancing text encoder components.
File
YourTTS Training Discussion Experiences, Multistage Training, Demos, Prior Training Preservation
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hey again everyone, back a little sooner than expected with this one. I'm going to take a break with obsessing over training this model after this video because I have a couple other projects I want to get on with, but I thought it may be worthwhile sharing some of this. I will probably try to improve the VITS fine-tuning script I posted a few weeks back and do a quick video covering any major changes soon. If you want a walkthrough of the collab script used here, check the link down below in the description to the video breakdown. Down in the description, you'll also find links to the standalone scripts used to run the dataset maker, the transcription, and the fine-tuning on your own system if you have a beefy GPU with 12GB of VRAM. The scripts are a mess and you'll need to edit all the path names to be correct for your system and change the variable names. Now none of what I'm discussing here should really be taken as fact, best practices, or even effective. I am more or less throwing things at the wall and seeing what sticks. That said, I want to thank Khalid again for the comments, feedback, and the help. Looks like we've been trying a few of the same things and a few different ones, so thank you for all the information. So if any of you are wondering why I'm not using public datasets for these tests, here's why. I like driving myself insane. Sort of joking. My primary area of interest is voice restoration, so I really don't expect to get or even aim to get CD-quality audio out of these because I'm not really beginning with high-fidelity samples. I'm basically doing as much as I can to break things and try to discern where the boundaries are for usability. LJ Speech, VCTK, and numerous others are high-quality, well-annotated datasets that can be used for training experiments. You'll get great results with great input. My input, when not experimenting, will never be high-quality, and I want to see how the model output is affected. For example, VITS does an amazing job at recreating the sound of old radio recordings when fed with enough noisy samples for the model to pick up the microphone amplifier hiss, some of the tape crackle, and all those powerful popping plosives from a lack of compression. The first little tweak I want to cover here is how to approach something closer to fine-tuning, by not destroying the entirety of what's already in the model. In the first part of the script, find where the empty list called de-vector files is created. After that, append the speakers file from the original modder's speakers.json. This will allow the speaker embeddings for the model's voices to be included, hopefully keeping some of the original voice quality. To what degree, I have no idea. The output quality of the voices is still affected by the other components of the model, but this will allow access to the data trained with the speaker IDs from the original voices without needing the original dataset. This won't train the text encoder because there's no text, so you'll need to supply other high-quality datasets to rebuild that, and the quality seems to be affected a lot by poor transcription. The VITS args section is where we can twiddle some more knobs. There are a handful of variables that control how training is done by freezing various parts of the model and allowing others to be trained. When fine-tuning with the method here, it looks like the text encoder gets obliterated, so we need to retrain it. At early step counts, the model goes from nonsense to relatively well-structured babble quickly, but the fidelity output is still pretty bad. From guttural noises to robotic baby talk is how I would describe this. If placed in the VITS args section, these two lines will reinitialize, that is, recreate the text encoder and duration predictor. Reinit text encoder equals true, and reinit dp equals true. You can do this with either a continue or restore run, but only do it if you need to begin training these modules from fresh. If you're satisfied with the pronunciations and pauses, you can freeze the text encoder and duration predictor. Don't put these reinit lines in your config if you're not reinitializing the modules. Freeze the text encoder and duration predictor with freeze encoder equals true, and freeze dp equals true. Here's a voice developing as the text encoder trains, so you can hear how the pronunciation develops and then degrades.

Speaker 2: If you're irritated by repeated sound effects, skip the next couple minutes.

Speaker 1: Here's how one of the voices develops as the encoder trains. I'm going to go through the samples to find one with good pacing and pronunciation. I don't want to just assume that the latest checkpoint has the best attributes, so I'm going to choose one around 13,000 steps here to use for future training runs.

Speaker 3: I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future training runs. I'm going to go through the samples to find one around 13,000 steps here to use for future

Speaker 1: Continuing training doesn't really help the baby talk issue. Syllables are still misplaced and phonemes are flipped. As training continues, this often gets worse and sometimes entire sounds are dropped. The text encoder seems to train at a different rate than the rest of the components, takes very few steps, and appears to overfit before the rest of the model catches up in terms of quality. If you're getting issues with flipped phoneme sounds, you can improve it a lot by unfreezing the text encoder and duration predictor, setting detach dp input to false, and then training it until it sounds good. Then in the next restore session, freeze the text encoder, duration predictor, and set detach dp input to true. Make sure you don't reinitialize the text encoder or duration predictor on the second restore function to start a new training session or you'll undo all the training you've already done. A second option is to reinitialize the text encoder and then retrain it using phonemes. And I found that out by doing it by accident. The power of being an idiot, I guess. I know everything because I know nothing. In the vits config section, set use phonemes to equal true and phonemizer to equal espeak. Set the phoneme language to English, text cleaners to English cleaners or multilingual cleaners, and use the dataset cache path shown above to store the phoneme cache in your dataset directory. Here I'm going to show the difference between the raw text and phonemized input by clearing the text encoder and duration predictor again. Once again here I'll be playing repeated samples, so skip this if you don't want to hear how the voice develops.

Speaker 3: I've had decent luck with reinitializing the text encoder and duration predictor at

Speaker 1: the beginning of a new fine-tuning session, that is when I'm using the restore function, and then listening to the samples generated in TensorBoard. When I'm satisfied with the structure of the output, I stop the training session. So this would be after a few thousand steps, like 7 to 13 thousand usually. For the next stage, I freeze the text encoder and duration predictor, set freeze PE, freeze flow decoder, and freeze waveform decoder to false, and I make sure I'm not reinitializing the duration predictor or text encoder. With any luck the speech patterns should be fixed for the rest of the training, and the fine-tuning can focus on areas that affect the fidelity of the output audio. Here are a few samples after freezing the duration predictor and text encoder, and allowing the training to focus on other areas.

Speaker 3: Again I'll be playing a couple minutes of repeated samples here.

Speaker 4: This cake is great, it's so delicious and moist.

Speaker 3: This cake is great, it's so delicious and moist.

Speaker 4: This cake is great, it's so delicious and moist. I haven't quite one time to develop a voice, and now that I have it, I'm not going to be

Speaker 5: silent. I haven't quite one time to develop a voice, and now that I have it, I'm not going to be silent. I haven't quite one time to develop a voice, and now that I have it, I'm not going to be silent. I haven't quite one time to develop a voice, and now that I have it, I'm not going to be

Speaker 4: silent. Be a voice, not an echoes. You're a voice, not an echo.

Speaker 1: There are three voices in this training set. Johnny Cash, Duke Nukem, and Rod Serling. The Rod Serling dataset was the smallest and lowest quality in terms of fidelity. However, the pacing and sentence structure is much more natural than the other two. The Duke voice comes from ripping samples from as many games as I could find. The samples are aligned somewhat awkwardly and need longer, more natural pauses between phrases. The Johnny Cash voice comes from a reading of the New Testament, but again, due to the nature of the reading, some of the pauses are a little unnatural when compared to regular speech. Out of the three, it's the largest and probably most uniform, though. These kinds of defects will carry over to the training, and you may be able to adapt around them by adjusting the training parameters. The graphs in TensorBoard can be helpful for choosing which checkpoints to evaluate and continue training on. You can skim through to find which checkpoints have the lowest loss in the area you're focusing on. Basically, try to get nice, smooth loss curves. Loss 1 and average general loss are probably the easiest to watch while training. Well, that's about all I wanted to cover today in this one. I've probably forgotten a few things, but you can let me know down in the down below if you have any questions. If you want to help out the channel and support these kinds of videos, hit the like button, subscribe, and leave a comment down below. It helps to get it out into the algorithm. Thanks for watching.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript