Comprehensive Guide to Updated Colab Notebook
Explore an updated Colab notebook for multi-speaker VITS training, speaker embeddings, and audio processing methods. Aimed at improved speech synthesis.
File
Train a VITS Speech Model using Coqui TTS Updated Script and Audio Processing Tools
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hey again everyone. This video is going to be all over the place. For the most part it's going to be a look at the updated Colab notebook as I go through it, and I'll be taking a side to talk about other things in more detail when needed. The notebook is laid out to train a multi-speaker VITS model with speaker embeddings. This is different than the other Colab notebooks for VITS training that I've posted recently. Unlike the others, this one doesn't use the external speaker encoder released by Koki to train using vectors. The downsides to this are that it may not train as quickly, and it may not train as well with dirty data at first. However, the training will likely improve after the initial struggle, especially if you're fine tuning. I haven't explored the commercial offerings by Koki much, but I know they're working on some spectacular sounding TTS stuff with Koki Studio. If you want production quality models, I'd suggest checking them out and seeing what they have to offer. The open source Koki framework is constantly being updated which brings both new features and the occasional headache. Unfortunately, sometimes those headaches are pretty frequent on Colab. You may notice errors when installing or running the Koki scripts in Colab. Keep an eye on the box where the console output is shown when Koki is installing. If you see a red notice that the runtime needs to be restarted, you won't be able to ignore it. Click Runtime from the menu and then Restart Runtime. Once the runtime restarts, you can continue with the rest of the script. This resets the variables, but the downloaded or installed packages will still be there. I'm going to go through the Colab script now and talk a little bit about the dataset scripts that I've been using. As I do, I'll try to show the steps for how to use these scripts to make a VCTK format dataset for using this notebook. I won't be going into extreme depth on any of the tools here. For that, check out the links in the description for each of the models and tools used. The authors can explain their use infinitely better than I can, and this is going to be a pretty long video already. For those that want to run Koki on their own computer, it's certainly possible. It's much easier to use Linux, but Windows works pretty well for synthesizing speech. Training on Windows takes a little bit more feeling, but Thorsten Voist posted a video with a section on installing Koki on Windows for training, if you're interested. So check that one out. I'll link it down below in the description. Before I go through the notebook, I just want to reiterate that this notebook applies to training on Google Colab only. If you want to train on other hardware, it may be best to copy the Python code to your own training script, or to use the Koki-provided train-tts script in the recipes subfolder on GitHub, and a prepared config.json file. Jupyter is a pretty big resource hog, and isn't really needed for training. You can always install TensorBoard to your Linux or Windows virtual environment, and view the dashboard in a browser. I've already run the first cell in the notebook to begin installing Koki-tts. As of this recording, due to version incompatibilities with some of the packages, the Colab runtime needs to be restarted after installing Koki-tts. Click Runtime on the menu, and then Restart Runtime, or click the button in the output box. Run the next cell to connect your Google Drive account. This is where you'll need to store your samples for processing, store your dataset, and where the training output including models, config files, and training logs are saved. Your Google Drive account will be mounted to the folder name, slash content, slash drive, slash my drive, with my and drive capitalized. The cell after connecting your Google Colab will set some variables. dsname is the root folder of your primary dataset. It should be stored in content, drive, my drive. Training output will be stored in a directory named trainer output under this folder. Uploaddir is a folder on your Google Drive where your samples are stored. If this doesn't exist it will be created. Model file is the download path to the pre-trained vits model from the Koki model hub. There are a few trained vits models available now, but if you try to use another you'll need to manually change this path wherever it's found in the script. Run name is a short name describing your training run. Like the other notebooks I've posted, this one has four run types. Restore is for beginning a new fine-tuning session based on a pre-trained model downloaded from the Koki model hub. Restore checkpoint is for restoring a file from your own fine-tuned checkpoint. Continue is for resuming an interrupted training session using the best loss model in the run directory. This may not always be the highest quality output though. New model is for training a new model. This section of the notebook is where you can do some audio processing. If you've used any of the other notebooks I've posted, this is going to be a bit different and treats your audio files differently. So if you don't want anything overwritten, don't blindly run things here. All of these are destructive processing methods. First up I've been playing with Demux from Facebook Research a bit. Demux is described as a state-of-the-art music source separation model currently capable of separating drums, bass, and vocals from the rest of the accompaniment. After trying it out on some music tracks and some isolated acapella tracks, I found that the Hybrid Transformer V4 model does a good job at removing subtle noise. This section is set up to use the V4 Hybrid Transformer model. Run the cell to install the latest development build of Demux. Set the directory and file extension to process in the next box. The script will move the files to a directory named backup, run Demux to generate two files, vocals.wav and novocals.wav. The vocals.wav is then converted to 22kHz and renamed to replace the original audio sample. Run the cell to write the script to your Google Drive and then run the next cell to process the audio files. Run the cell to write the script to your Google Drive. Here are a couple samples of Rod Serling processed with Demux. The removed noise can be heard in the novocals track.

Speaker 2: The world is getting noisier and noisier. It's hard to find a nice quiet spot. Even

Speaker 3: out here, noise. There's no getting away from it, right? Wrong. It holds a promise for anyone who smokes. It's Oasis. And as its name implies, it promises you the most refreshing, the softest taste of all. I'm Rod Serling, kind of an expatriate from the twilight zone. I'm Rod Serling, kind of an expatriate from the twilight zone. I'm a writer.

Speaker 2: They come from every human experience that you either witness or have heard about, translated into your brain. They come from every human experience that you've ever seen. I'm a writer. They come from every human experience that you either witness or have heard about, translated into your brain and your own sense of dialogue.

Speaker 1: I've been trying out a few different audio normalization options. A useful utility for evening out the audio in a dataset is the aptly named Normalize Audio. Install the utility and set the file extension and the directory to process, then write the script to Google Drive and run the script with the last cell. The script is set to normalize all the audio in a directory to minus 27 dB. The other audio normalization option I've added here is using FFmpeg Normalize. FFmpeg Normalize works best on longer files longer than three seconds, so it can fail when processing datasets. The script here is only for pre-processing samples before they're segmented. Run the cell to install the utility, set the file extension to process, set the directory of samples to process, and then run the cell to write the script to your Google Drive. Run the next cell to normalize the audio samples. The script is set to normalized in minus 27 dB and resampled at 22 kilohertz audio. The original files in the sample directory will be overwritten with the normalized files. The original Rod Serling samples were pretty uneven, so here's a sample of what they sound like after normalization.

Speaker 2: With the world getting noisier and noisier, it's hard to find a nice quiet spot. Even out here, noise. There's no getting away from it, right? Wrong. This next segment can be used to split

Speaker 1: your long wave samples into segments. The files are backed up to a folder called pre-split backup. The original files in the directory will be replaced by the split files. The first split attempts to break the wave files based on silence. If there are remaining segments longer than nine seconds, they will be forced split with the next run of socks. This may lead to clipped words. It's best to use a more advanced splitting method like voice activity detection or tools in Audacity which work very well and pretty quickly, but this is what I've got at the moment. Run the cell to write the script to your Google Drive, then run the next cell to split the audio files. The last block here will require some editing on your part. Edit new dataset to be the name of the subdirectory where you'd like your dataset stored. Edit new speaker name to be the name of the speaker you are adding. Sample wave files will be converted to flac files and placed in a subdirectory with the speaker name in the wave 48 silence trimmed folder of the dataset directory. This next section can be used to transcribe audio samples using OpenAI's whisper audio samples using OpenAI's whisper speech to text. The large v2 model is the latest but may perform worse than the large v1 in some instances. Select the model using the drop down box then run the cell. Run the next cell to download the model. Set the dataset root path name in the next cell and the speaker name and then run the cell. Run the final cell to transcribe your clips. There are a couple tools here that can be used to crudely error check your dataset. The first cell will search the transcripts for any empty files or files containing characters not in the common latin language alphabet and move them to a folder named bad files. The second block can be used to find files that have either a missing audio file or a missing transcript file. In either case the files will be moved to a folder named missing. The next section of the notebook is for training. Run the first cell to download the pre-trained vits model from the Koki hub. The next two cells will load TensorBoard pointing to the dataset directory set as a ds name at the beginning of the notebook. TensorBoard can be used to listen to audio samples and view graphs of training data while your model is training. To load the latest logs click the refresh button in the TensorBoard dashboard. If you've run the script before and are restoring from a previous checkpoint, set the restore checkpoint with these cells. Run the cell to list all the folders in the trainer output directory of your dataset. Then use the cell to list the checkpoints in the run folder. Then set your checkpoint in the next cell and run it. This next section is where the training is actually done. This is essentially the Koki vits training recipe and the config.json settings in one script so the variables can easily be managed without needing to modify any external files. With the current settings this will train or fine-tune a vits model using english phonemes and speaker embeddings. Run the first code block here to load the Koki TTS libraries. Run the next cell to set up some variables. This will set up the dataset using the VCTK formatter, ESpeak phonemizer, and the American English alphabet. If you're training in another language that supports phonemes, change the language code here to your desired language. The next cell will configure the audio settings and set the model and training arguments. Speaker embeddings here are set to true. In the vits config section you may want to adjust the checkpoint save steps and how many checkpoints should be saved based on best loss. If you're training in a language other than english that is supported by ESpeak, change the phoneme language code. Run the next cells to load the audio processor and set up the tokenizer. The next cell will load the dataset samples. If you have a dataset larger than a couple thousand samples it could take a little while. Set up the speaker manager by running the next cell and load the model using the following one. This cell will initialize the trainer using the run type and model checkpoint you specified and you could begin training by running the final cell. Once your model is trained, download a checkpoint and config.json file. If you've installed Koki on your system you can synthesize speech using the Koki server demo application. I've been trying to train a Hindi language model so here's how it can be loaded using the server script. In the console of the TTS directory, run the server script by specifying the model path and config path in the command line. The console window will show what port the server is launched on and then in a browser go to localhost colon and put whatever port is shown in the console window. Now just select a speaker and type in your words. The generated sample can be saved if you right click on the audio player.

Speaker 4: You can also use the command line tool to generate speech and write it to a WAV file.

Speaker 1: Specify the model path, config path, and speaker name using the speaker idx flag. Use the out path flag to specify the output file name and use the text flag and coded text to specify your text.

Speaker 4: And I think that's all I wanted to cover in this one. I may add the RNN noise processing

Speaker 1: back into this notebook at a later time. If you want to try the Hindi model that I've partially trained, check the download links below in the description. I don't even know if the actual language is correct since I'm not a speaker, so someone that understands it will need to evaluate it. Hopefully it can serve as a base for further fine tuning. If you want to use it with the Colab script, check my other posts for the VITS Hindi training video. However, I have changed this model over from using the external speaker encoder to using speaker embeddings. That way it's easier to use in the server demo app and command line tool. Alright, that's enough rambling from me today. If you found this video helpful, hit the thumbs up button because it helps YouTube suggest the video. Thanks for watching.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript