Comprehensive Guide to Koki TTS and VITS Model Fine-Tuning
Learn to use Google Colab for Koki TTS and VITS model fine-tuning, audio denoising, and speech-to-text processing on Linux via WSL2 and Anaconda.
File
Near-Automated Voice Cloning Whisper STT Coqui TTS Fine Tune a VITS Model on Colab or Linux
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hey again everyone. This is a follow-up to the past Koki TTS and VITS model training video that I put up a few months ago. The Koki TTS framework has been updated to version 0.10 with a ton of new voice models, bug fixes, and performance tweaks. I'm not going to cover any of that here, but check out the github page for a full list of changes. In this video, I mainly want to go over using the Google Collab script that I put together, then I'll cover installing everything from the Collab script in Ubuntu Linux running on Windows 10 using WSL2. I won't be covering setting up WSL or installing Ubuntu, but it's easy enough if your OS supports it, and there are a ton of instructions and tutorials out there if you want to go looking for them. This Collab script brings together Koki TTS, OpenAI's Whisper, Speech-to-Text, Zipf's RNN Noise, FFmpeg, and a handful of command-line utilities to process audio files, create an LJSpeech formatted dataset, and fine-tune a VITS model. You'll need a few gigabytes of free storage space on your Google Drive account for the script to run successfully. The model checkpoints are going to be stored there, and they're about 900 megabytes each. A lot of the code and the output is hidden in the notebook, but that can be easily toggled using the view options in the taskbar if you want to see it. Run the first cell to connect to your Google Drive account. You'll get a pop-up window that you need to accept, and then your Google Drive root will be mounted to a folder called content slash drive slash my drive, with my and drive capitalized. In the next cell, set your dataset name. Don't use any spaces, but use dashes and underscores. Trainer output is a dataset subfolder where the trainer output will be. The upload directory is the folder on your Google Drive where the sample WAV and MP3 files will be uploaded. Run this cell, and the directories will be created if they don't exist. The next cell will delete temporary files if the dataset processing has been run before in your Collab session. Ignore this one if you don't need to clean up any temporary files. The next cell will clone and install Zipf's RNN Noise GitHub repo and the requirements. Later on, this will be used to denoise the audio samples. Next, install OpenAHA's Whisper, Speech-to-Text, and Translation framework, and once that's done, run the next cell to install the eSpeak NG Phonemizer and the CoQi Text-to-Speech framework. The next cell will take the MP3s placed in the upload directory and convert them to one-channel mono 22,050 Hz WAV files, and process any uploaded WAV files to ensure that they're uniform format. You can use the File Upload button in Google Collab to put the files in your upload directory or upload them through your Google Drive. As it loops through, it does a little magic to remove the suffix from the file names. Then it uses the Socks utility to split the WAV files based on silence. Adjust the first decimal value in the Socks command line to adjust the amount of silence that needs to be detected before a split is triggered. You may need to load up your audio file into an editor like Audacity to be able to measure the length of a pause, but somewhere between 0.2 and 0.5 should split the audio file into sentences. The instruction at the end will delete files below 15 kilobytes, which are too short to be potentially useful clips. The next cell will run the split audio clips through RNN Noise to denoise them, and then output them with uniform levels to a folder named Converted. And then you can use the next cell to copy the WAV files to your new dataset directory. You should download them from your Google Drive and listen to them post-processing to make sure they sound correct before spending time training a model though. The next cell here will run the audio clips through OpenAI's Whisper speech-to-text. I have the script set to use the Medium model, which is reasonably fast and accurate. There are several other models available with differing performance and speed, so check out the OpenAI Whisper page for more details. This cell could take a while to run because Whisper needs to download the model, which for the Medium version is about 1.4 gigabytes. The file name without a suffix, and the transcript repeated twice, delimited by a pipe character, is the LJ speech format output here and saved to a file called metadata.csv in your dataset directory. Be sure to proofread the transcript before training a model. Occasionally Whisper trips up and returns a string of demon-speak gibberish instead of anything resembling your audio clip. You can display the created dataset with the next cell. Use the next cell to download the VITS model to your Colab session. It will be placed in a cache directory and then deleted when the session ends. The training script here comes from the Koki TTS GitHub repo, and the next cell will save a version with some configured variables to your dataset directory as train-underscore-vits.py. The next cell will begin the fine-tuning. You may see an error about layers not found once you run it. Allow the script to train until a best model file is generated, and then click the Start Stop button to stop the training. Use the next cell to list the trainer sessions in your output directory. Copy and paste the session you want to restore into the next box and run the cell. Run the next cell to list all the files in the session directory. Copy the model name you want to restore to the next cell and then run that. The final training cell here will begin a training session with the model file above. This will allow the training to continue with the missing layers restored. If you want to use TensorBoard to look up your past training sessions, run the last two cells after setting your dataset variables at the top of the script. It may take a few minutes for TensorBoard to appear, and you may find it necessary to whitelist Google Colab in your add or pop-up blocker if it's not functioning. To follow the same steps as above, but on Linux, you'll need to do a little bit more editing, but things are still pretty simple. I'm going to be using an Anaconda environment to keep things isolated. Once you download and install Anaconda, create a new environment with conda create-n, your environment name, and then git pip. Activate the conda environment. Activate the conda environment with conda activate-n, and then your environment name. I've linked a list of the commands that can be copied and pasted into the Linux terminal to speed things up. First, install pyloudnorm using pip install pyloudnorm. I'm going to be storing my github repos in a directory called repos, so I'll make that include zip's rnnnoise repo. Next, install the requirements for rnnnoise, and a few utilities like socks and ffmpeg. Then build rnnnoise following the instructions on the github page. They're also in the document in the comments, so you can copy and paste them if you like. The demo tool built by the rnnnoise package is going to be used later on to denoise the samples in a shell script. I'm going to clone the whisper repo with the others for personal reference later on, but you don't necessarily need to. You can install it using pip install git plus, and then the whisper github repo. Whisper will then install Torch if it isn't already installed in your environment, and this could take a little while. Next, install the espeak-ng phonemizer, and clone the koki-tts github repo if you'd like to. There's a lot of handy scripts and Jupyter notebooks in there. Install koki-tts using pip install tts. Type tts double dash list underscore models to check if koki installed correctly. Make a directory somewhere to store your mp3 and wav audio clips. I've called mine audio files here. To open the directory in Windows Explorer, change to it and then type explorer space period and then press enter. Place your audio clips here. Make a directory in your audio clips directory called out. And then run this tangled command which you can copy from the link document to convert your mp3 files into mono 22,050 hz wav files. If you have wav samples, also run the one for wav files to convert them and copy them. Change to the out directory and make a directory called splits. Inside the splits directory, run the command that will use find to feed the wav files to socks and split the audio files based on silences. Change the first decimal place to the minimum amount of silence that needs to be detected to trigger a split. If you can't find a good split point between roughly 0.2 and 0.5 seconds, you can also run the alternative command to split the files into 8 second segments, but this can split words as well. Change to the splits directory and run the command to delete the short wav files. And if you can, you should probably listen to them at this point to make sure that they sound alright. Load up the Python script for RNN noise in an editor and set the path to your audio clips. Run the script and your files will be copied to the dataset directory in a subdirectory named converted. After the script is done, change to this directory and run the command to delete any short wav files that may have been created. Rename the converted folder to wavs. Leave that directory and open up the whisper script. Set your dataset name and change the model name if you want to try out one of the others. Run the script and if things work correctly, you should get a file named metadata.csv in your dataset directory with an ljspeech formatted dataset. You should probably proofread the transcripts before using this script. You should probably proofread the transcripts before using them because occasionally the model will output some nonsense. Run the cokie tts command to generate a sample using the vits ljspeech model and it will be downloaded to your system. Run the python training script with the restore path pointing to the model file and the config path pointing to the model's config file. Follow the same steps as in the colab section to fix the issue with the missing layers. Allow the model to train until a best model file is created, press ctrl-c to stop training, and then reload the trainer pointing to the path of the best model file just created and it will resume with the layers restored. Well, that's all for now. I just wanted to go over this notebook and the scripts that may make it a little quicker for people to experiment with some of these models. If you have any questions, let me know in the comments. And if you found this video helpful, hit the like button so more people will be able to see it. Thanks for watching. And I'll see you in the next video.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript