Offline Text-to-Speech with Koki TTS and Sonic Visualizer
Explore offline text-to-speech applications and learn how to use Koki TTS framework for voice synthesis with tools like Audacity and Sonic Visualizer.
File
Voice Cloning Tutorial with Coqui TTS and Google Colab Fine Tune Your Own VITS Model for Free
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: I was looking for an offline text-to-speech application a few months ago and was really disappointed with the quality of what was out there. Some of the voices hadn't been updated in nearly a decade and several of the companies had been acquired by competitors and moved to cloud services. Cloud services are a problem with me when they meet with my terrible connection quality and even worse luck. So I try to stick with dedicated applications and self-hosted services when possible. That led me to Thorsten Muller's YouTube channel and his great series of Koki TTS videos. Koki TTS is a framework or library for text-to-speech generation. It sounds simple, and it is and it isn't. The underlying tech is mind-bogglingly complicated, but the implementation is about as difficult as a modern LEGO set. If you're interested in recording your own clips to fine-tune or train a voice model of your own, take a look at Thorsten's Mimic 3 and Mycroft AI videos. Check the description down below for a link to his channel. Before I dive into the tutorial, I just want to say that most of the code here was written by other people and pieced together from random forum posts. I wish I could credit the authors directly, but in lieu of that, I simply want to say thank you to everyone that sacrificed their time to work on open-source projects. So I'm going to run through a few things in this video. How to require clips to use in your dataset, how to process them, and finally, how to fine-tune a VITS text-to-speech model using the Koki TTS framework on Google Cloud Computing Services. Before we get started, let's get some software together. Sonic Visualizer is a lightweight open-source application for listening to, and as the name implies, visualizing audio clips. In terms of visualization, it does more than I can possibly understand. I really just need quick loading times with a basic spectrogram and that does this all spectacularly. Audacity is an open-source editor that is particularly useful for its set of plugins. Later on, we'll be using the Noise Reduction plugin to clean up our chosen audio track, and then using the Silence Finder plugin to automatically segment our audio clip into short segments. If you have trouble getting the Silence Finder plugin to show up in any of the version 3 branches of Audacity, download one of the older version 2 releases instead. It'll work just as well for what we need. YTDLP is a video downloader that we'll be using to download our video, extract the audio, and grab the subtitles that will serve as the template to speed up the transcription of our audio clips. FFmpeg is a video converter that can also be used to convert subtitles from one format to another. I won't be using it much in this video, but if you need to extract audio or convert videos or subtitle formats after the initial download, FFmpeg will probably have you covered. It's open-source, and its code is built into many video editing applications, but they're a standalone build for Windows. For editing the transcript, I'm going to be using Notepad++. It's a fast, customizable, and open-source editor that has an autosave feature, so hopefully you won't lose anything if a crash happens. Now let's go find an audio source. This is going to be a shotgun approach to dataset creation. Get a large source for sampling, process it all, automatically segment it, and then discard the samples that are too short or too long. After that, we'll process the samples that fit our needs as we work on the transcript. Ideally, you want to get samples that cover every possible sound in the language that you are training. That is to say, you want to have as much phoneme coverage as possible. There is a script that comes as part of the Koki TTS package that can evaluate your dataset for phoneme coverage, but I didn't have any luck getting it to run. To train the model, use 16-bit, mono, 22,050 Hz wave files between 2 and 14 seconds. Try to get a good distribution of sample lengths for your dataset. The Bill Gates voice was trained on approximately 200 samples ranging from 2 to 14 seconds. Samples that are too long will take too much GPU memory to train on Colab. I've found that too many short samples result in choppy metallic or robotic sounding output, but your mileage may vary. This is all wildly unpredictable. I've trained the same model for the same number of checkpoints on the same dataset and ended up with two distinctly sounding models in the end. So when searching for an audio clip, narrow down your selection by looking for clips over 20 minutes long. To make the next steps easier, filter your search for clips that have subtitles or captions available. These will cut down on transcription time because machine learning has already done half the work for you. One-on-one interviews tend to be good sources of audio because the host will let the subject speak after asking a question. It's also generally fairly easy to distinguish between the two speakers when visualizing the audio clips. That way you can quickly cut out the segments that you don't want to sample. Unless you have some fancy-pants AI plug-in suite like iZotope Oxygen, you'll want to spend some time looking for the clearest audio clips you can find. Any reverb or distortion in the voice that you're sampling will probably be carried over to your model. You can use multiple sources, but the samples need to sound very close to one another to get decent results in the end. To download the clip and subtitles, load up a console window. Navigate to where you want to save the files or create a new directory. To download the video, extract the audio and download the subtitles, enter the following. This selects the best audio and video stream, extracts the audio at the best quality to a WAV file, downloads auto-generated subtitles in English if available, and then downloads the subtitles in ASS, SRT, whatever the best format is, or VTT. This might be a little antiquated now since YouTube seems to only use VTT subtitles now. YTDLP will download the video, extract the audio to a WAV, and download the VTT subtitle file. The subtitles are a bit of a mess, but can be cleaned up by using FFmpeg or an online tool like the one at Subtitletools.com. If you upload a VTT subtitle here, you can download a plain text file with all the timestamps removed. This can be used as a draft to speed up your transcription. If you end up having a file with a bunch of duplicate lines, Notepad++ has you covered. Under the edit menu, find the line operations submenu and then select remove duplicates. Open up the extracted audio file on Audacity and zoom into the waveform and script through it. Try to find where the person you're interested in is speaking. Trim out the larger sections of audio you're not interested in, but don't worry about being too detailed. You'll need to trim the clips later anyway. Find a selection of the audio track that has some ambient noise. Try to find a section that has a few seconds to sample, but shorter segments may work as well. Highlight the noise sample, then click the effects menu tab and load up the noise reduction plugin. Next click the get noise profile button. After identifying the noise to filter out, select the audio that you want it filtered from. In this case it's the entire file. You can hit control plus A to select all the audio. Load up the noise reduction plugin once again, and then preview the audio with the default parameters. Fiddle with the sliders to see if you can get a better result. The noise reduction setting controls how much the noise is reduced by. If this value is set too high, some of the audio that you want to keep will be reduced as well. If the sensitivity setting is too low, there will be some audio artifacts or distortion in the output. If it's too high, the overall clarity may be reduced. The frequency smoothing setting may reduce how harsh the artifacts sound. The audacity manual notes that the default value is 3, and that settings lower than that might favor music, and settings higher than that might favor spoken word. Fiddle around a bit, but don't be too aggressive with the noise reduction. If you have to doctor your audio a lot, you should probably find a better audio quality source instead. Once satisfied, press OK to apply the noise reduction. Now I'm going to segment the audio by using the silence finder plugin. Drub through the audio and try to get a feel for how long your subject breaks between sentences. You can use the length and end of selection option at the selection bar down below to make this a little easier. This part takes some trial and error though. The outcome we're looking for is as many segments from 2 to 14 seconds of audio as possible. There may be some longer ones and some shorter ones, and we'll discard those or process them later. You can add additional markers to segment the audio on your own, or move the generated markers as needed. After the markers are set, you can export the audio segments by opening up the file menu and clicking export multiple. Select the folder to save the segments to. Make sure they are saved as signed 16-bit PCM. Split the tracks based on labels, and include the audio before the first label. In the files as numbering after filename prefix, choose a short, easy to type prefix for the filename because you're going to be typing this a lot when writing your transcript. Hit export and Audacity will save the segmented audio files. If you intend on skipping the rest of the editing and processing, convert your audio file to mono in Audacity before exporting and resample it to 22,050 Hz. If you install Sonic Visualizer and set it as the default application to open WAV files, you can use it to preview and sort your samples. Audacity is a fine editor, but it needs a few too many keystrokes to edit WAV files since it doesn't open them directly. Waveshop is a lightweight editor that can be used to trim up your samples. Open up Waveshop and drag and drop a batch into the window. Select the silences at the beginning or end of the sample, and then hit the delete key to remove them. Play through your samples and delete any stutters, ums, or ahs. Keep an eye on the waveform and you'll be able to identify roughly where they are by how they look once you spot a few. Press CTRL plus S at the same time to save your file.

Speaker 2: Since you're already listening to your samples, you might want to transcribe them as you go.

Speaker 1: Another efficient option is to use Sonic Visualizer to quickly open and play through them later. The transcript is going to be in the LJ Speech dataset format. There are three fields for each entry. The file name, a transcription of the audio sample, and then a normalized version of the transcription. Each field is separated by a pipe character. To make things easier, you can just make the last two fields the same. Things will work just fine. In your transcript, you may need to account for mispronunciations or spell things phonetically. You'll want to keep a close eye on punctuation and add semicolons or commas where the speaker pauses. And the transcribed samples with either a comma or a period. I use whichever sounds more natural when spoken. Be sure to double check for misspellings or any typos before moving on. I've been continuing with the default of naming the transcript metadata.csv for consistency. And that's the file name used in the training script later on, but you can name your file whatever you like. Now we can head over to Google Colab, get the script set up, finish processing the audio samples, and begin fine-tuning. Google Colab environments are deleted when the browser disconnects, so we'll be storing our work on Google Drive. Open up a new notebook, or use the one linked below. Under the Runtime menu, find the Hardware Accelerator option, and then select the GPU Runtime. Then hit Save. Running the first cell will connect your Google Drive account, which will appear as the directory Content-slash-Drive-slash-MyDrive, with the capitalization on My and Drive. The next cell installs PyLoudNorm and RNN Noise. These will be used to normalize the audio samples and perform additional noise reduction. If you want to skip this step, ensure your audio files are encoded as 22,050 Hz and mono. Next we'll install the ESpeak Phonemizer and the Goki TTS framework. Then we're going to run a command to list the models available from the Goki framework to check to see if it's working correctly. The next cell will download the pre-trained VITS model and generate some sample text in the path specified by the out-underscore-path command-line argument. To process the audio samples, upload the script to your Google Drive and place the samples in a directory named Original. The processed samples will be put in a directory named Converted. Once that's done, rename the Converted folder to Waves, then create a directory for your dataset and your work. Move the Waves folder there. Upload your metadata.csv file to that directory. So in total, you should have a metadata.csv folder and a folder named Waves in whatever your dataset directory is. Change the paths in the next cell to reflect your dataset directory. If you know how to interpret the results, TensorBoard can be used to evaluate the fine-tuning of your model. Edit the paths to reflect your training script output directory. Launch the TensorBoard cells before loading the training script. If this is a first run, you won't have any logs for TensorBoard to analyze. Next, open the training script in the text editor and change the output path to reflect your dataset directory. I like to have the training logs and output in a subdirectory called TrainerOutput, but you do you. Do the same for the path in your dataset config section, and then upload the script to your Google Drive. Don't change the name of the dataset, though. Leave this as ljspeech. This specifies the format of the dataset. The training script is going to save a checkpoint of the model after a specified interval. 1000 steps is probably a good value for the initial fine-tuning run. Each checkpoint is about a gigabyte, and they'll quickly fill your Google Drive if the save interval is too small. You may need to free up space at some point by emptying your Google Drive trash bin if you get a low disk space warning. Google Drive will permanently delete your trash after a month, but if your Google Drive gets full, you'll have to do it manually. Save any changes to the training script and upload it to your Google Drive. Change the path name to reflect the training script in the next cell if necessary. Run this now to begin fine-tuning, and you should see output similar to what's on the screen. After 10,000 or so steps, you'll probably have a model that's beginning to sound a little like your subject. If you've resumed a training run, you can use the Audio tab in the TensorBoard panel to play some audio text clips. You can also generate synthesized audio by using the console application and directing it up a .wav file. This is all a bit clunky, but hey, it's new technology. It gets a little easier to use if you spend time to set up the cookie DTS framework in a local Linux installation. You can install Linux to a USB flash drive and then use the cookie DTS server with your model to generate speech. Or you could use shell scripts with the console application to generate large blocks of speech to a .wav file. Spend some time reading the cookie DTS documentation and create a tutorial. They should be enough to get you started if you want to set up a cookie on your own.

Speaker 2: Hopefully this was enough to get you started with voice synthesis using cookie DTS. This was a very surface level tutorial and not meant as best practices. For that, refer to the documentation and the resources available through the discussion groups. The VITS model seems to be the most forgiving when it comes to using a small number of samples. But if you want to go further, another great sounding model is Glow DTS. Thanks for watching. If you have any questions or comments, get in touch with me here on YouTube and I'll do my best to get back to you.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript