Challenges and Progress in Your TTS Training

Convert Your Audio To Text

4.9/5

3723 customer reviews

Exploring difficulties and tips in training Your TTS models, dataset usage, script insights, and fine-tuning results for improved voice synthesis.

Revisiting YourTTS - Details about Training, Datasets, and experiences Voice Cloning with Coqui TTS

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hey again, everyone. This week I'm going to be taking another look at your TTS. Initially I found it a little difficult to train consistently, and got a little tired of fiddling with it after the inconsistency was compounded by a few bugs that have been since fixed in Koki. Is it any easier now? Well, in short, not so much, but the process isn't really all that different than training any other model with Koki. So I'll talk about that, the datasets that I've trained with, the training parameters that I've tried, some issues that have come up, training the model, using the model for voice conversion, speech generation, a couple projects that aim to make dataset processing a little easier, and whatever else I can think of as a wrap-up to the VITS and your TTS videos. I highly recommend reading over a few pieces of documentation before trying to train your TTS with Koki. First, the Koki tutorial for nervous beginners. This covers most of what I'm going to be doing here, and I really won't be deviating much from the format of the training recipe shown there. The code I'll be using later on is just that your TTS training script with the variables I want to set hardcoded rather than editing a config.json file and using that. Checkpoints generated by the script will have the necessary config.json file and metadata files. Skim the VITS documentation page for information about what the various variables do. Your TTS is based on VITS, and many of the same Koki configuration options are applicable. The GitHub release page for your TTS provides command line entries for speaker voice substitution and zero-shot voice cloning. However, with both of these, my experience has been less than stellar. I don't know what sample length or quality you need to provide for this to work well, but it doesn't work well on my system, but I'll go over that later. You can also find links to Google Colab scripts to replicate the training and inference experiments from the Your TTS paper. I'll link to all of those resources down in the description below. I'm not going to go through training a full model on Colab in this video, I've done enough of that in the past with VITS training videos. The linked notebook is essentially just the Koki Your TTS training script from the GitHub repo with the configuration options hardcoded for convenience sake. Unlike those notebooks I've posted in past videos, there's no dataset creation or processing section here. You'll need to make datasets some other way. A lot of the audio processing tools have conflicting dependencies, so it makes having them coexist with the Koki training framework a bit of a nightmare. Seems like a lot of people had difficulty training Your TTS, so I thought it may be best to keep things simple in this one and just stick to training the model in this video. First, this is the Your TTS model running out of the box.

Speaker 2: There are two built-in voices that I'm able to use to generate speech, though there appears to be six trained in. This cannot pronounce capitalized acronyms, so TTS had to be spelled phonetically using the name of the beverage and the sound the snake makes.

Speaker 3: This is the male voice speaking in English. The other voice keys have new-line characters after the name for some reason, and I can't seem to trigger them. I'm not sure if these are an amalgamation of training voices, or selected from the various open-source training datasets used.

Speaker 2: I have to say I'm not overly impressed with either of these voices. The model consistently mispronounces many common English words. For example, let's say datasets again. Deh-heh-sets, deh-heh-sets, deh-heh-sets.

Speaker 3: There isn't a natural prosody to medium and long sentences because emphasis is placed on the wrong syllables. Oh, if any of you are weebs, you'll understand this example, it sounds like Chris-Chan talking. I'll be really blunt here. This sounds like garbage, and I don't know why everyone was so excited about this release when VITS was already around. I sound like Microsoft Sam's younger brother. If Clippy had a voice, it would be this.

Speaker 2: Well, now that we're all thoroughly underwhelmed, let's take a look at the datasets I'll be using to fine-tune and retrain the model, then look at the script, and review some results.

Speaker 1: I didn't have a ton of luck with voice substitution either. Have a listen.

Speaker 4: Shame. I used to read word-up magazines, and now I have a DL up in the list.

Speaker 1: Zero-shot voice cloning did a bit better, but still quite far off from sounding realistic. It's about the level of a poor impersonation attempt.

Speaker 4: It was all a dream. I used to read word-up magazine. Salt and pepper and heady dup in the limousine. Page and pictures on my wall. Every Saturday rap attack, Mr. Magic Marley Marle.

Speaker 1: The datasets I'll be using here have been gathered from various commercial sources, so I can't share the trained model. Unfortunately, I don't have access to any open-source audio datasets where lines were expressively or emotively read. I also think it would be helpful to have known voices to use as examples, so you can hear how well or not they've been captured. But I don't really want to use the bog-standard demo character of Donald Trump. I'm going to have to hear enough of him over the next year. A breakdown of the datasets is as follows. I've got about 4 hours and 36 minutes in 5500 segments of Alex Trebek. Oscars, wild. Entertainment history, I resign. In the neck. I've got 5 hours and 42 minutes in 10,000 segments of Johnny Cash.

Speaker 5: And therefore these powers are at work in him, for Herod had laid hold of John and bound him and put him in prison for the sake of Herodias.

Speaker 1: I've got 1 hour and 5 minutes of Duke Nukem in 1900 segments.

Speaker 6: It ain't that kind of party, stinkfinger. You okay?

Speaker 1: Let's climb. This thing is holding on by an ass hair. I've got 3 hours and 36 minutes of Stephen Fry in 2850 segments.

Speaker 7: He just said, drink up. Thank you. The PA, ghastly noise. The boat spared on. They couldn't be more wrong, he said.

Speaker 1: 6 hours and 16 minutes of voice actor John McClane in 8400 samples.

Speaker 7: Rydell was a big, quiet Tennessean with a sad, shy grin, cheap sunglasses, and a walkie talkie screwed permanently into one ear.

Speaker 1: 3 hours and 44 minutes in 6400 samples of voice actor Richard Zindin.

Speaker 8: They were invisible in mirrors, but he knew that was untrue, as untrue as the belief that they transformed themselves into bats. That was a superstition that logic plus observation had easily disposed of.

Speaker 1: Ron Perlman at 1 hour and 11 minutes with 1470 samples.

Speaker 6: Got the key card. Come on, get up. The computer has stopped working. Hey, it's a punk ass ninja.

Speaker 1: And 5093 samples of some British woman at 1 hour and 42 minutes.

Speaker 9: Racing in events is a great way to increase your influence. Competing in events will raise your influence. Hans has new claims to investigate at the car files.

Speaker 1: These datasets are large enough that I'll be using them to try training a new YourTTS model from scratch once the fine-tuned model trains to a point where it sounds good. For fine-tuning the pre-trained model, the authors say that as little as a minute of speech can be used. But from my test runs, 15 minutes is probably going to get you closer to realistic speech without needing to endlessly fiddle with the training parameters. For training, the default rate of 2e-4 may work best. I have to use very small batches on my system, typically 16, so I decrease the learning rate. When fine-tuning, I tend to lower the learning rate to 1e-4 or 5e-5. With Koki, you can choose to train using characters or using phonemes assuming the language is supported by the backend phonemizers. The Koki released YourTTS model was trained using characters, not phonemes, so fine-tuning may take longer than expected if you're using phonemes. The model will need to learn how to map the phonemes to sounds because it only knows character mappings for the trained languages. Before you get into doing your training sessions, do a trial run of the trainer to dial in the ideal number of loader workers for your system. It'll probably be between 1 and 4, but let each value run for a couple hundred steps and keep watch of the step time and loader time. Increase the number of loaders reasonably until you reach the lowest loader time and step time. To try to manage GPU memory, you can lower the batch size or lower the max text length. It may also be a good idea to set startByLongest to true, so the longest samples are batched together and will probably trigger an out-of-memory error if one is going to pop up. You can also set the vits.config.mixPrecision flag to true, but you'll also need to run inference with mixPrecision and it doesn't seem to train as well. Add configurations for each of your datasets and then add the variable to the dataset.config.list. The computeEmbeddings function will iterate through each dataset and will compute a file of embeddings and store it in the dataset directory. If you alter your dataset for audio files, you'll also need to delete the embeddings file. If you're going to attempt multilingual training, set useLanguageEmbedding to true. Set the main language and also set the phoneme language for each dataset. I think Cokie is set up to process phonemes for each language now, I think. I vaguely recall it being planned. I think it happened, but you may want to check. I have a weird little loop in the script that generates test sentences for each speaker. Normally you need to specify the speaker names manually for each of the test sentences. If you have dozens or hundreds of speakers, you may want to remove this and return to the original test sentences. Although I've used this for about 100 speakers and it was fine. Rather than read over the training script here, I'll link to a commented version in the description for this video. Don't feel obligated to use my modified version over the regular Cokie version though, this is just what I use. Once you've got a model training, you're going to want to keep watch of its progression using TensorBoard. If you're training on your own system, open another console window and launch TensorBoard with TensorBoard, dash dash, logdir, and then the path to your run directory. Then in a browser window, navigate to the address shown in the console window. There are a handful of things that should be monitored and don't necessarily train uniformly or in the same direction. One factor may degrade while another improves. This can be affected by your batch size and learning rate, but also the dataset audio quality and the transcription accuracy, or whatever zodiac sign we're in or cosmic rays. Loader time is affected by system load and loader workers. If you're using Linux on Windows with WSL, keep your dataset and training files on your WSL partition. Shuffling files between the Linux file system and the Windows file system is very slow, even if they're both on the same drive. This can have a huge impact on your loading times. There are two loss average functions to watch, and they can often go in opposing directions. I'll link to this post by Ederson, one of the Koki devs, that explains the loss functions a little, but in short, loss 0 is a sum of the other loss functions. Loss 1 is a sum of, well, I don't know. My past assumption was that loss 1 factored in the speaker encoder and loss 0 didn't, but Ederson's post says otherwise. I've tried staring at the code and trying to work out the calculation from the framework, but I ended up cross-eyed with a migraine. It's a little beyond me. From a practical standpoint, duration loss represents the process-year pacing of the speech and the length of silences. Feature loss is the overall tone and pitch, and male loss is how well the audio is being synthesized. These should all be trending downward as training progresses, but may not go down smoothly, or may not all trend down at the same time. Even as the model is nearing a state where the output sounds good and the correct sounds are being spoken, loss 1 and loss 0 will diverge. In these cases, loss 1 often seems to be the best indicator of model quality. This can make relying on the best checkpoint saved by the trainer a little difficult, so you want to navigate to the audio tab of TensorBoard and listen to the audio samples being generated after every epoch. In terms of training time, it depends on your data, how it's divided, and the training parameters. If you're fine-tuning with the original character set and not using phonemes, you should begin to hear a reasonably clear speech after a few epochs. If you're fine-tuning with phonemes, you'll probably hear some speech sounds around 10-15,000 steps, and a reasonably clear voice between 30-50,000. This sort of all goes out the window if you're training more than one voice at a time, though. Using phonemes with the YourTTS model requires a little more initial fine-tuning steps, and this may not work as well as characters, because the original model was not trained using phonemes, but using characters. Similar to the VITS model, it's possible to freeze or reset various parts of the YourTTS model during training. These take effect when restoring a checkpoint or beginning a new training run, not when continuing a run, because they alter the config file. If you continue a training run after setting one of these options, it will remain set, and will reset or freeze the layers when the training starts, potentially erasing any progress made. When you want to relaunch a training session, you'll need to do a restore run using the desired checkpoint, or edit the config.json file to disable the option you previously enabled. ReInitTextEncoder will reset the text encoder. This is helpful if the model has learned misspeakings due to poor transcriptions in the past. ReInitDP resets the duration predictor. This can be helpful if the model is clipping most or all of words, or making whispering or sighing sounds where there are pauses in speech. FreezeEncoder will freeze the text encoder, which can be particularly handy if you have noisy or poor quality audio. The text encoder may pick up the transcriptions before those sounds are learned. The posterior encoder, duration predictor, flow decoder, and waveform decoder can all be frozen as well. The only one I've used to any degree of success is freezing the duration predictor. I was training a model using some very noisy audio and it picked up the characters in ProCity, but needed a lot more training before the speech sounded clear and not distorted. For specifics about training in languages other than English, skim some of the other videos I've posted, or try searching the Kogi GitHub discussion board. I have a few videos up with VITS training in other languages, which I'll link down in the description. Core TTS is based off VITS, and most of the configuration options apply. Once a model is cooked to your satisfaction, you can stop training and then give it a test. Here I'm going to first fetch the list of speakers trained into the model. Using the Kogi TTS command line application, specify the path to the model and config file, and then use the listSpeakersIdXs flag. Use the listLanguageIdXs flag to fetch the list of trained in languages. Now I'll try generating speech with a few of the new voices.

Speaker 6: This is the Ron Perlman voice. The model is still under train.

Speaker 7: Here is the Stephen Fye voice that was trained using the Received Pronunciation Language code.

Speaker 10: This is how the female voice has progressed so far.

Speaker 1: To perform voice substitution with one of the trained voices, in addition to the model path and config path, specify the language index if it's a multilingual model, and specify the speaker index for the voice you're using, and then provide the original source file using the reference wave flag, and specify a file output for the new audio.

Speaker 10: Well, that result was a little better than the one with the Donald Trump voice earlier

Speaker 1: on in the video. So let me do a Deadpool moment here and break the fourth wall. I'm going to let that model continue training, and get to editing some of this video together. At the end I'll revisit the model and run a few more samples, and then try that voice conversion out again and see how things progress. I wasn't really going to cover making datasets in this video, and I'm still not going to, but I want to point out a few projects that may make things a little easier if you're going to try building a pipeline for processing audio files. I have very little experience using either of these, but if I hadn't already mashed together my own mess of tools, I'd be using one of these regularly. VocalForge aims to be an end-to-end toolkit for creating datasets. First I tried it out, it was early on in its development, and I think a lot of the kinks have been worked out. It does use some pretty involved and large projects like Nemo, so it isn't the most lightweight, but give it a look if you want a semi-autonomous tool. I usually use a more manual approach, and this next set of scripts and tools does a lot of what I glued together on my own, but does it a lot cleaner and a lot faster. Someone had stumbled across one of my TTS videos that I'd posted, and they were nice enough to share their dataset processing pipeline on GitHub. You'll need some basic bash scripting and Python skills with this one, but you can browse the scripts in the tools directory, and then edit the pipeline shell script to set up your processing chain. After letting the multi-speaker celebrity voice model train for a while, I realized that the Alex Trebek voice was throwing off the alignment. Too many samples were being purged because I didn't transcribe the numerical values in this one, and it was a lot of the dataset. It was throwing off the training of the other voices, leaving huge gaps between words, and introduced a lot of odd noises. Removing the voice and then beginning a new training session helped move things in the right direction again, but this was a fine-tune of the Koki model, and I wasn't really super thrilled with it anyway. So I changed my mind and scrapped it, and began training a new YourTTS model from scratch using the VCTK dataset. I've posted a demo of the Irish-accented voices from the VCTK dataset. There were 13 in total, and I think the model has so far managed to keep them all distinct. The posted demo was still under-trained, and there were a couple misspeakings that I hope will smooth out once the voices clear up again. However, I may have added the new voices too early. Adding more speakers before the prior speakers can clearly pronounce words tends to solidify those misspoken words or syllables.

Speaker 11: I'm sorry, but I don't want to be an emperor. That's not my business. I don't want to rule or conquer anyone. I should like to help everyone if possible to, gentile black man white. We all want to help one another.

Speaker 12: Human beings are like that. We want to live by each other's happiness, not by each other's misery. We don't want to hate and despise one another.

Speaker 13: In this world there is room for everyone, and a good earth is rich and can provide for everyone. The way of life can be free and beautiful, but we have lost the way.

Speaker 14: Greed has poisoned man's souls, has barricaded the world with hate, has goose-stabbed us into misery and bloodshed. We've developed speed, but we have shut ourselves in.

Speaker 15: Natury that gives abundance has left us want. Our knowledge has made us cynical. Our cleverness hard and unkind. We seem too much and feel too little.

Speaker 16: More than machinery, we need humanity. More than cleverness, we need kindness and gentleness. Without these qualities, life will be violent and all will be lost.

Speaker 1: But I got excited about how it was progressing, and tried to add in another difficult-to-capture accent. A few Indian-accented English speakers, also from the VCTK dataset.

Speaker 17: The impact of Artificial Intelligence on human rights, democracy and the rule of law is one of the most crucial factors that will define the period in which we live, and probably the whole century.

Speaker 18: Our Artificial Intelligence is designed and works is very complex, but its impact on our life can be easily explained with an example of a state that harms me, but can bring the state to court, which I cannot do with algorithms for harming me.

Speaker 19: I fear the technological development may uproot the human rights protection system we painstakingly built over the past 70 years.

Speaker 1: There's still a long way to go for training, but this seems like it's going the right way. I've added a lot of data here, so I'm going to need to let this train until a stable minimum is reached with the last one value, before adding any more voices. This is training with a batch size of 16, and a starting learning rate of 1e-4. If you have a beefy 24GB card, or are renting a server, you can increase the batch size and increase the learning rate. This should reduce training time by a lot. A larger batch size may need a little bit more data than working with a small batch size. If you have alignment issues with a higher batch size, decrease the learning rate. If you can only do batches of 16 or less, and have a huge multi-hour dataset, you may find that the model is overfitting quickly. If this happens, reduce the learning rate, or reduce the size of your dataset. Well that's about all I wanted to cover in this one. If this model turns out well, I'll put up a download somewhere, if there aren't any licensing issues with the dataset. If you found this helpful, share it around, hit the like button, subscribe, or leave a comment. Everything helps the channel. I don't really post these anywhere, so they just kind of depend on the algorithm and viewers getting things out there. And a special thank you to everyone from the various Discord groups, and yes, even you 4chan. You're fun people. Well, I'll be back soon with, well, I'm not quite sure yet. If there's something you want to see, let me know in the channel video requests post. Thanks for watching, and as always, stay human.