Optimizing Your TTS Scripts for Better Efficiency
Explore enhanced TTS script efficiency, improve audio quality, and manage dataset processing with detailed guidance and troubleshooting tips.
File
Updated Fine-Tuning YourTTS with Automated STT Datasets on Google Colab for AI Voice Cloning
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hey again everyone. There's been a bit of interest in the Your TTS script I posted a couple of weeks ago, so I wanted to do a bit of a follow-up. I've tweaked the script a little to make things a bit more efficient and to hopefully improve the audio quality and the auto-generated transcripts a little bit. Now I do a lot of my work on an ancient Dell E6500 laptop with an enormous extended size 8 hour battery pack, which I keep forgetting has a resolution of 1200x800, so apologies for the fill bars on the sides of some of the videos. This is going to be a bit of a rehash of the first video, but hopefully a little bit more streamlined. If you haven't watched that one, you don't need to, unless you need the Koki TTS local install instructions or a local copy of the Python code. But if you're installing things on your own machine, you can probably handle copy and pasting the Colab Python code, running the bash commands by hand, and changing the path names on your own. If not, just let me know in the comments and I'll try to sort it out when I have some time. I'm going to go over the Colab script from start to finish, so I'll cover the changes as they pop up. We'll start off with the Colab script. By going through the beginning of the script where we create a dataset for a new speaker, you can repeat this multiple times to process the samples for several new voices within one dataset. After going through the script to the point where the dataset is made, I'm going to loop back around and talk about pre-processing the data before uploading the samples. Then I'll continue on with the rest of the script and go over some samples of models that I've been training. Alright, let's get to it. I haven't compiled the updated training units yet. I've burnt a lot of computing time training bad models due to some sort of misconfiguration. I did dozens of runs using Koki Trainer 0.0.21 and they all failed. I assumed it was something that I did, but then thought to check all my package versions. I noticed the trainer was mismatched with the one that ships with Koki TTS 0.10.2, so after downgrading the Koki Trainer back to 0.0.20, things went back to normal. Once I get a couple more models done, I'll update the notes in the notebook. The first cell will connect to your Google Drive account. If you've got multiple Google accounts, switch the account from within Colab first, then go over to the web-based Google Drive and switch to the correct account. Then run the cell to connect to your Google Drive. For example, I have two Google accounts, but only one of them has extra paid Google Drive and Colab Pro access. There's often a really annoying sign-in loop if the two accounts get mismatched. After connecting the Google Drive, set some variables. If you're new to Colab, change the values first, then run the cell. The S name is the name of your dataset. Name it anything you like, but don't use any weird characters or spaces. Your dataset will be stored on your Google Drive in a folder with this name. Trainer output is a sub-directory of the dataset directory, and this will be where the trained models and associated files are stored. The upload directory is where the uploaded samples are stored. This is the base directory, and it will be created on your Google Drive. The model directory is the location of the pre-trained YourTTS checkpoint, which will be downloaded later. You can leave this as it is if you're on Google Colab. There's a simple name to describe your training run. It's stored in the config file. Again, no weird characters or spaces. There are two types of runs for this training script, continue or restore. Continue is for resuming a previous interrupted training run, and restore is for beginning a new fine-tuning session using the base checkpoint mentioned above. The next cell will build RNN Noise Sense requirements. This is a very effective library for denoising. The script later uses the example application that comes RNN noise and some wrapper code to denoise the audio samples. The next cell will install SOCKS, OpenAI Whisper and its requirements. And then run the next cell to install Koki TTS and force install Koki Trainer 0.0.20. Upload each new voice to a subdirectory of the sample uploads directory. For example, if you have a new voice named Bob, go to your Google Drive, make a directory named Bob in your sample uploads directory, and put all of Bob's MP3 or WAV format samples in there. Set the subfolder name and the name of the new speaker within the next cell. You don't need to be creative, you can name them the same thing. The next cell does a lot of the heavy lifting. The audio samples are converted to mono, 16kHz files, to standardize them, and then passed to RNN noise. RNN noise requires up sampled files, so SOCKS will up sample them. Then, the gain and volume flags are used to lower the chance of clipping. The waves are passed to RNN noise, and after denoising, are passed to SOCKS with a high pass and low pass filter. Pi loud norm is then used to peak and loudness normalize the final clips. The clips are then split using SOCKS two times. The first pass splits the audio based on silence intervals. It's currently set to 0.2 seconds, but you may need to adjust this up or down depending on your speaker. This generally splits speech into sentence segments, but there could be instances of long sentences over 10 seconds. The split files are then passed through SOCKS again to four splits of 8 seconds. This may end up with some clipped words, but that might be better than simply discarding longer samples, because often the rest of the sentence is really high quality speech. Files smaller than 35 kilobytes are then deleted, because they'll be too small to be useful. The split samples are then converted to a FLAC format and renamed to correspond with the VCTK dataset format. Run the next cell to download the Whisper speech-to-text model. The co-lab script is currently set to load the large V1 model, but if that doesn't work on your instance, click Show Code and switch to the medium.en model. Reloading the model often crashes co-labs. You can list the speaker directories with the next cell, and then set the speaker to process with the following cell. The next cell will run Whisper on your audio samples. The large V1 model is quite a bit slower than the medium.en model, but the output quality is spectacular, though you will need to re-transcribe a few things by hand later due to how it handles numbers and dollar signs. Repeat the speaker processing and transcription steps above for each new speaker in the dataset. Cokie has some general purpose guidelines for dataset creation in one of the GitHub pages that I'll link down below. All of these apply here as well, so it's a good idea to read it over. The samples I'm going to be working with in this video are going to be of terrible quality, so it might be difficult to get a stable model. In general, you want consistent recording quality, a high degree of fidelity, and as many samples as you can provide. To listen to audio samples in Windows, I like to use Sonic Visualizer. I'm a big fan of products with literal names. This lets you get a waveform or spectrographic view of the audio file, and play and scrub through it. It's lightweight, has negligible loading times, so if you need to go through hundreds of samples, this can be really helpful. The view panes are entirely customizable, and you can add multiple views for a single file or change the colour scheme. I find something like the high visibility fruit salad works great when trying to discern peaks. One of the green themes works great when looking at the total mix because the background noise really stands out. I'm going to go ahead and play through the sample. Cokie is a great example of how to get a good sample. I'll let one of the audio samples introduce our voice model.

Speaker 2: With the world getting noisier and noisier, it's hard to find a nice quiet spot these days. And it's becoming harder. Even out here. Noise. There's no getting away from it.

Speaker 1: Well, it's got a fun old-timey feel to it. This is definitely not an ideal recording to use for making a voice model. This particular Mike Wallace interview has amplifier hiss, tape crackle, a bit of warble, but under it all, there's still a richness to the recorded voice. This may be salvageable with some work. After sorting usable and unusable recordings, we can start the prep work. For quick trimming of files, Waveshop is a really handy program. It's ancient, but should work on all Windows systems above Windows XP. It's a bare-bones, non-destructive editor, good for cutting things out of audio files while not altering the rest of the file. It's about as speedy as you can get. The first thing I want to do is isolate the speaker. It's time-consuming to listen to the entire file and trim out things as they appear. If you get a feel for the relative volume and tone of the speaker, you may be able to visually scrub the audio file and clip the voices out easier. Here I'll need to trim out the bumper music and hosts so far. As I go through, I'll also look for any lengthy pops. I'll also look for any lengthy pops. In addition, I'll try to trim out any sudden loud sounds like hands hitting the table, coughs, stutters, uhs, and really anything else that I don't want the model to pick up.

Speaker 3: Looking at the waveform view can help with noise removal later on.

Speaker 1: Unfortunately, this will be a bit difficult for me to show because OBS decided to record at about 3 frames per second. If any of you have any suggestions for another free screen recording application that also grabs audio, preferably for Windows, please let me know down in the comments. The audio is going to skip around here a little as I'm dropping the playhead throughout the track to find segments that need clipping out, which would be apparent if the video was actually keeping up. If you look over on the left hand side of the waveform view at the bottom, you notice a sharp, periodic dip between 20 and around 80 Hz. These are some of the crackles. Above around 8000 Hz you'll see a dense concentration of peaks.

Speaker 3: This is some of the hiss.

Speaker 1: The low and high pass filters of the audio processing section of the collab script will help out some of the frequencies to cut down on any of the noise that R&N noise may miss. I find Waveshop a particularly useful editor because of its speed. Just hit play and then drag to select the playhead and it will reposition. Hit delete and the selected segments will be cleared. Waveshop will only manipulate the edited segments, so it won't alter any of the other audio. If you want to maintain the pause of a selected segment, you can use the insert silence under the edit menu or use the keyboard shortcut alt and insert. The selected segments will be replaced by an interval filter.

Speaker 3: If you want to listen to the tracks pre-split, but processed by R&N noise, they can be

Speaker 1: found in the denoise subfolder in the sample directory. Here's that same Mike Wallace interview after noise filtering, applying the filters and some normalization. Even making up letters of, what do they call it, to plug in the speakers.

Speaker 3: I don't know. I want to ask, and I'll say, you never know, mark the food. Do not rebuy my feeds. I'm not talking, like, 2. I'm watching a period of making videos where he hurt my bad mouth. Not long and carrying him into the sack.

Speaker 4: So, yeah, if you've got something like that in your backlog, which can be insonas, or it's a little gross.

Speaker 1: Don't worry, don't be easily confused when you think of Waveshop. Yeah, yeah. Cool. because of its excellent file and text manipulation options. Under Search, bring up the Find Within Files option and select the path of your unzipped transcript files. Use star.txt as the extension and tell it to search within subfolders if you specify a base directory. I'm going to use the regular expression option to search for a specific set of characters, or rather characters not within this specific set of characters. The square brackets specify the exclusion, the character tells the search to begin at the start of the line, the lowercase a to z, uppercase a to z, 0 to 9, space, period, apostrophe, question mark, and exclamation mark are the characters we want to ignore. Everything else will trigger a match. This will catch a few things. Percentages, transcriptions with the word slash, dollar amounts, partial or full quotations, and odd mistranslations in demon speak that Whisper does every now and again, and ampersands. After searching, you can right-click on the search window and select Open All to open all the files that matched. For mistranslations, I'll make a note of the file name and then delete both the clip and the transcript from the dataset. Windows Search makes this pretty easy. From within the base directory, type the part of the file name and both components should pop up. Dollar amounts should be transcribed back into natural speech, but be aware of how the speaker said the original phrase. $100 could be a hundred dollars or one hundred dollars. And these are clearly not the same words, although the concept is the same. The other thing you should probably do is transcribe numerals into their spoken forms. Actually start with 0 and search each number through 9. Some speakers may say numerals close together or more rapidly than the rest of the speech, so you may want to join them with a dash. If you alter your dataset locally, you'll need to delete it from your Google Drive and re-upload. Google Drive has a nasty habit of becoming desynchronized, and it's a real pain when it does. If you delete your original dataset, the best option is to re-upload it through the Google Drive application. It's a lot faster for many small files. Use the Google Drive web interface as a second option, but don't bother uploading anything through the file browser in Colab because it's complete trash. Once the dataset is finalized, we'll continue the Colab script by computing the speaker embeddings. This will take some time, but only needs to be done once for your dataset. Speaker embeddings will be saved to speakers.json. Speaker ids should probably be saved to speakers.pth, but I'll fix that later at some point. Currently they're being saved to speaker underscore ids dot json. The next segment of the script will set up the audio configuration and model arguments. These are left as defaults from the original training recipe. Before running the next cell, you'll need to edit a few little things. If you're using more than one voice in your dataset, you can enable the weighted sampler. The script is set to keep 5 checkpoints, along with checkpoints with the best loss. If you're low on Google Drive storage, save fewer checkpoints and set save all bets to false. But the only lines that you really need to change here are the test sentences. Replace vctk underscore rod here with vctk underscore whatever your new speaker name is. The next cell will manually engage the speaker embeddings. The next cell will manually engage the speaker encoder and save the model metadata to the dataset directory, and then set the speaker ids for the training data. The next cell will initialize the model from the configuration loaded into memory. If you're continuing a previous training run, you can list all the run folders in the trainer output directory by using the next cell, and then copy and paste the name of the trainer run into the following cell. The next cell will initialize the trainer with the specified run type, and then you'll begin training with the final cell. So here's how that rod sterling model is developing.

Speaker 4: And here are a few samples of a larger dataset model that seemed to be

Speaker 1: And here are a few samples of a larger dataset model that seemed to be going well until I had a CUDA crash. I don't understand how it could have happened, but when I attempted to reload, it appears the dataset has corrupted in some way. and somewhere in the mix of 15,000-odd files is a broken audio file that refuses to load. So, that being said, keep it back in your data set after processing it.

Speaker 5: To be quite a long time to develop a voice, and now that I have it, but it now couldn't be silent.

Speaker 6: Be a voice, not an echo. I'm sorry, dude. I'm afraid I can't do that.

Speaker 1: This cake is great. It's so delicious and moist. To be quite a long time to develop a voice, and now that I have it, but it now couldn't be silent. Be a voice, not an echo.

Speaker 5: I'm sorry, dude.

Speaker 7: I'm afraid I can't do that. This cake is great. It's so delicious and moist. When switching to my local machine,

Speaker 1: I increased the batch size for the loader. As you can see, this sped things up a bit. So, I'm going to increase the batch size and improve the training quality. So, a higher batch size may work better for you. Forcing the audio clips into 8-second splits or less helps avoid some of the outer memory errors. Also, another thing that helps inexplicably is correcting any unknown characters in the transcripts. I'm guessing when they're dropped, they're being padded, and sometimes the padding causes some overflow.

Speaker 4: To be quite a long time to develop a voice, and now that I have it, but it now couldn't be silent. Be a voice, not an echo.

Speaker 7: Well, that's about all I wanted to cover today, since I've rambled long enough. This may get you a little further, your filling with your TGS. I want to try training with a few frozen layers when I have some time, and see if that changes anything. If you want a cold script with a little more consistency, you may want to look at the VITS fine tuning script in my other video. Thanks for watching, and thank you to all the subscribers. The channel is well over 600 now, which is just wild. I don't really share these videos anywhere, so swerve to pan out all you posting them on Reddit or Fortune or the algorithm of reviews. Thanks for getting them out there. Back soon with another one.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript