Speaker 1: Hey again everyone, I've updated the Google Colab notebooks for the VITS and YourTTS training scripts. There are separate notebooks for multispeaker English VITS, training a single speaker in another language other than English, and a multispeaker English YourTTS notebook. All three are linked down below in the description. All three notebooks are very similar, though should be treated as separate, and by that I mean don't try arbitrarily swapping datasets without being aware of the structure or the sample rates of your prepared samples and whatnot. Our TTS is a 16kHz model and I've been using 22.5kHz samples on the VITS notebooks. I'm going to go through preparing a dataset for the alternate language notebook and then training a Spanish speaking model as best I can. Then I'll go through the multispeaker English notebook and the YourTTS notebook, but I'll be skipping the dataset bit and only really looking at the differences between the notebooks. This video is mainly going to be focusing on the notebooks specifically, which are just the model training scripts provided by Kogi that I've adapted a little and added onto. Before I get into the scripts, I want to touch on a few things that came up in the comments recently. First, these scripts are not really an all-in-one point-and-click solution to training a voice model. They're pretty close, but you'll still need to put in some work. Second, the answer to almost any other question is going to be, it depends, but try it. How many training steps will it take? What will the output quality be? And these types of questions have no real solid answer. It depends on your data, how much data, how the data is structured, how the data is loaded, and the infinite interactions with the training parameters. But after a few thousand steps, you should begin to hear misspoken words being formed with poor quality audio. If the audio cleans up and the words are still being babbled, then it's likely a transcription error or you've modified your dataset and not deleted the phoneme cache. So delete the phoneme cache if you altered your dataset in any way. Also download your dataset and review each file before attempting training. If you have any blank transcriptions or a bad audio file, training will fail. The errors won't always be obvious from the messages. If you have extremely short, below 1 second, or long, above 9 second audio files, training may also fail. I'm going to go through the alternate language VITS training script, but don't expect amazing results here out of the box. This is just a starting point if you want to try experimenting with the stuff yourself. For languages other than English, you're probably going to get best results training a new model from scratch unless your language is very similar to English. This notebook is set up to use phonemes for training using eSpeak. This is only going to be possible if the language you want to train is supported by eSpeak, and should only be done on languages officially supported, otherwise the phoneme input from eSpeak to Koki will be of low quality. Check the supported languages list on the eSpeakNG site and then use the appropriate language code when necessary. For other languages, you'll need to set the use phonemes variable to false and change the phonemizer lines to none. This will train the model using raw text input, but this does seem to take considerably longer to train to a similar degree of quality. Your dataset needs to be as accurately transcribed as possible, with all the numerals transcribed into their spoken counterparts and any characters like percent or dollar signs transcribed and removed. If your language of interest uses another character set, it needs to be defined in the model configuration before training. Use the code block to display all the characters in your dataset loaded, and then set the character configuration manually by creating another code block with the character definitions. With Koki you can also define custom phonemes, but I don't have any experience doing that, so I'd recommend digging through the Koki TTS github and taking a look at the old Mozilla TTS pages to see if there's any information. Alright enough of that, let's go take a look at that notebook. I'm going to go through this pretty quick and time lapse through any sections that take a while like computing the spectres or caching phonemes. Just a little disclaimer, I may know a few individual Spanish words, but I'm in no way a Spanish speaker. I have no way of evaluating this model that I'm training, or really knowing if the training data is accurate at all. It could be saying very energetic nonsense, I have no idea. There are a lot of difficulties training new languages, so don't expect things to go smoothly or work the first time. When you open the notebook, click the link to save a copy to your google drive so your changes will be saved. Open the copy and then you can close the original. In your copy of the notebook, run the first cell to connect your google drive account. Model checkpoints, config files, logs, and your datasets will be saved here. In the next set of cells you can adjust the dataset name, upload directory, and run name. You can leave the trainer output and model file settings alone. If the sample uploads directory did not exist, it was created when the dataset settings cell was run. In that folder on your google drive, place the samples for each new speaker in a directory with the speaker name. For example, I've called this speaker Spanish. If I was adding another speaker, I would place the samples in a directory named, for example, Bob and Doug. There are four run types in the script. If you're beginning a new fine tuning session, use restore. If you want to begin a new fine tuning session based on a previous checkpoint, use restore checkpoint. If you want to resume an interrupted previous training session, use continue. And if you're training a new model from scratch, select new model. Run the next cell to download and build Zipf's RNN Noise, a neural network denoise library which will be used later on to process your samples. The next cell will install OpenAI's Whisper speech-to-text and translation framework, its requirements, and SOX, which will be used to split audio samples later on. Run the next cell to install Koki text-to-speech and its requirements. The next cell is optional. Running it lists all of the various pre-trained models available on the Koki hub. Use the next cell to list all the folders in the sample upload directory. This is just for your reference if you're working with a lot of speakers. Each speaker's data should be uploaded to a separate subdirectory. Set the name of the speaker and the sample upload folder in the next input boxes, then run the cell. The next block here will set the audio processing options. This is a bit janky and works best when all the options are enabled. But you can disable the filter, denoise, splitting, and normalization if you need to. The processing cell here is where a lot of the magic happens. You can use OGG files, WAV files, or MP3 files as your samples. Sample file extensions are renamed to catch any variations, and then files are converted to 22.5kHz mono audio. The samples are run through RNN Noise, then split based on silence, filtered and normalized, and split again into forced 9-second segments to catch any long files. All snippets are deleted, and then the audio is converted to FLAC format to follow the VCTK dataset format. This notebook is a little different than the others, because we need to adjust the language settings in a few areas. The first area is OpenAI's whisper speech-to-text. If a language isn't specified, Whisper will attempt to guess the language of the clip. This takes a lot of extra time, and on short audio clips like the ones we're using in training often gives bad transcripts. Some of the Whisper models are English only, but the large models are multilingual. Select the Whisper site for a complete list of supported languages, and their estimated accuracy. Enter the language name in the input box, and then run the cell. Select the model with the dropdown list, and then run the cell in the list to set the variable. Run the next cell to download the chosen model. Run the cell directly below this to begin transcribing the sample audio clips. Repeat the cell to set the processing directory, and a transcription cell for each speaker that you are adding. This next cell will look at the transcripts generated by Whisper, and move any blank transcripts and corresponding audio files to a folder called Bad Files in your dataset directory. Set the dataset name and the speaker, and then run the cells. If you're attempting to fine-tune a model, use the next cell to download the pre-trained English language VITS model from the Koki hub. As mentioned earlier, attempting to fine-tune the English language model on another language is probably going to yield poor results. Load TensorBoard by running the next two cells. This is where you can look at graphs and listen to audio samples while your model is training. Run these cells before training, and then hit the refresh button inside the TensorBoard dashboard to load the latest log data as it's generated by the trainer. If you're trying to continue a previous run, set the run type to continue, and then use this cell to list all of the run folders. Use the next cell to list all the checkpoints in the run folder if you're trying to restore a run from a particular checkpoint, and then set the checkpoint in the cell below. Run the next code block to load the libraries for training. The next code block has the dataset configuration. Here you will need to set the dataset language using the language code, and choose if you're using a phonemizer or not. If not, leave the phonemizer field to blank. The next block is for configuring the model arguments. Read the VITS documentation on the Koki site and the VITS models research paper for details on what each of these does. For this notebook, enable the speaker weighted sampler by setting the use weighted sampler option to true if training more than one speaker, and set it to false otherwise. Adjust the text cleaner if necessary and set the phonemizer options. If your language is well supported by ESpeak, you can train using phonemes by setting the phoneme language to your language code, and setting use phonemes to true. Enter test sentences following the format in the script, and then run the cell after setting your options. This next cell will load the audio processor, and then the next cell will precompute the vectors for your dataset. If you alter your dataset, you'll also need to delete the cached vector file in your dataset directory. This can take a while, and if you're on the free tier of Google Colab, make a cup of tea and find a newspaper somewhere. You'll probably get a Turing test pop up asking if robots can lie about being robots, and making you click pictures of things Tesla's affectionately autopilot into like houses and firetrucks. If Colab thinks your session is idle, you'll be disconnected and need to run the setup steps again. The next cell loads the speaker encoder, and fetches the speaker information from the precomputed vector file. Run the next cells to load the tokenizer and initialize the model configuration, and then run the cell to load the training data. This next block of code is optional. It will dump the character list from your dataset. Use this to set a custom character configuration if you need to, or to identify any odd characters that may have appeared in your transcripts. This block of dropdowns is used to set the configuration options. These should all be set to false by default. Refer to the VITS documentation and the research paper for information on how these might affect training. Initialize the trainer by running the next cell. If you adjusted the training options above, you'll need to re-initialize the trainer again before beginning. Run the final cell to begin your training. At the bottom of the script is an example of the Coqui TTS command line tool, which can be used to generate speech with your model. Configure the directories to point to your checkpoints. If you want to save the generated speech, you can just right click on the audio player and select save. I'm going to go through and play some samples of the Spanish speaking model as it develops. Keep in mind that this was done using probably dirty data because I have no way of evaluating the quality of the transcriptions.
Speaker 2: The multi-speaker English language VITS training notebook is pretty much the same as the alternate
Speaker 1: language notebook that I just went over, except that things are hard-coded for English. One of the bigger differences is that the test sentences in this one are generated using a loop that takes all of the found speaker names and generates test sentences for each. This notebook also uses the external speaker encoder released by Coqui to train using vectors, so you'll also need to download that to generate with your model. The model configuration here is very similar to the previous notebook. Things are set to use phonemized text for input done with the eSpeak wrapper. Running this notebook should be the same as the alternate language notebook. If you're training a single speaker, find the use weighted sampler attribute and set that to false.
Speaker 3: Here's an early sample of a voice as it begins training.
Speaker 1: And finally, the Your TTS notebook has been updated. This one is a little bit more difficult to train well, but it does produce some great sounding voices, though I would probably recommend sticking to VITS unless you really want to get fiddling around with the training parameters. The major difference between this and the VITS notebooks is that I've been using 22.5k audio for the VITS notebooks, and the Your TTS notebook needs 16kHz samples, but that's handled by the processing section. If you've seen the Your TTS video that I posted, or can get through the VITS notebook, I don't think there's a lot here that needs to be explained in terms of running the script.
Speaker 4: Here's a very early sample of what the model sounds like as it begins training.
Speaker 1: Using vectors gives clearer sounding audio faster, but you won't get discernible speech until the encoder trains, and you won't get well-placed multiple-tone variations of a phrase until the model develops those deep integrations. That's about all I wanted to cover today in terms of the scripts. I'm going to try to do a short video about the TensorBoard graphs, and I may post short videos about various tools and scripts that can be used to work with audio datasets. I don't think there's much more I can do in terms of extending the Colab notebooks, aside from the obvious efficiency improvements that need to be done. But if you have any ideas, let me know down below in the comments. These videos have been getting more viewers than I thought they would, so thanks for showing them around. Unfortunately, I can't really be providing support for the scripts. They kind of work when you don't do anything weird, probably, but I'm not making any promises. The whole thing relies on a janky chain of open-source tools poorly glued together by me, and I am an idiot. That doesn't mean I won't reply if you post a question or run into a problem, I might, just don't expect me to. I don't work for Koki, and I'm not affiliated with Koki in any way, I'm just a mostly blind guy that's been enjoying playing around with their TDS tools. So thanks again for everyone that's shared these videos. If you're still here, I want to give a mention to a channel I really like called Unscripted Coding. He does, well, unscripted coding, and looks at a lot of open-source tools. It's been a fun way to stumble across new projects and ideas, so check out his channel if you're a like-minded weirdo. Alright, thanks for watching.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now