Advanced Audio Processing with Jupyter Notebooks

Convert Your Audio To Text

4.9/5

3720 customer reviews

Explore RNN noise integration, audio normalization, and source separation tools in a Jupyter notebook environment for efficient dataset creation.

Create Datasets for Voice Model Training on Google Colab Updated Tools for Coqui TTS Training

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hey again everyone. I wasn't planning on doing this video, but I ended up wanting to add RNN noise back into the audio processing tools. It ended up being a bit of a pain, and since I had already spent quite a bit of time, I thought I'd add a couple more things to the dataset processing section. Most of the dataset tools are separate now, so processing audio with this notebook is a little different than the others. I'm gonna try to just go over the changes, and I'm gonna skip covering the training in this one, because the training script instructions are the same as the last video. A bit of a disclaimer before I go over the notebook and tools. This is a very large Jupyter notebook. All the tools may not play well with one another, and in particular with the Koki training section. Some tools may require dependencies that conflict with other tools or scripts, and you may not be able to run through this from top to bottom in one session. Things in this notebook may break unexpectedly. They currently work, but they might not work tomorrow. If packages get upgraded somewhere along the chain of dependencies, sometimes things break. This notebook will probably also make your browser slow to a crawl if you try to run through all the sections and don't clear the cell output in some of the processing sections after you use them. These notebooks are really just ones I'm using for things and sharing with everyone. They're not supposed to be some production-ready software. There are a whole bunch of software-as-a-service offerings for voice clone available if that's what you're interested in. These are just what I'm using to make my life a little bit easier while I make datasets and play around with trainings and models. With that out of the way, let's go through the notebook. Run the cell to connect your Google Drive account after accepting the pop-ups that appear. The next cell will set some variables and create directories on your Google Drive if they don't already exist. This cell is mainly for setting some training variables, so if you're not training, you might not need to run this. Upload your audio samples that you want to process to your Google Drive somewhere. Your Google Drive is connected or mounted to your notebook as slash content slash drive slash my drive with my and drive capitalized. Colab notebooks are a Linux environment and files and paths are case-sensitive. The easiest way to upload things is through the Google Drive web interface or by installing the desktop application and syncing your files. File transfers using the upload function Colab are painfully slow and often fail. Your sample should be in WAV, MP3, OGG, or FLAC format. After uploading the samples, ZIF's RNN noise can be used to remove noise from long audio clips before they're segmented. Run the cell to clone the GitHub repository and build the demo tool that's used to denoise the audio. In the next cell, set the file extension and full path to your directory of audio samples. Run the cell after setting the variables. Run the next cell to write the shell script to your Google Drive and then run the following cell to run RNN noise. The way this is set up in the notebook is to take your sample in the indicated file format, move the files to a folder named backup-rn, convert it to a raw format compatible with RNN noise, denoise the audio, and convert it back to a mono 22 kilohertz WAV file in the originally indicated sample directory. So in summary, the original files are replaced with denoised single-channel 22 kilohertz WAV files. This RNN noise section seems to work on any length of audio from 2 seconds to over an hour, but some of the temporary working files from conversion may be large, so be conscious of how much free space you have on your Google Drive account. If you're feeling really adventurous, you could always edit the paths so the temporary working files are on your Google Collab session and the final files get saved to your Google Drive, but I'll leave that up to you. When you run a few of the tools here, you could run into out-of-memory errors if you have very long audio files. Long WAV files can be split using FFmpeg from within this notebook. Edit the command line in the notebook to use FFmpeg. Replace the folder name and file name after the dash I to specify the input file. Change the output directory to where you'd like the segment stored. If your output directory doesn't exist yet, create it similar to the commented line above. Edit the beginning of the file name at the end of the FFmpeg command line to specify the output file name prefix. The percent symbol 03D part of the file name will have FFmpeg automatically add three trailing digits to the file name with its segment number. The next section here is DMUX from Facebook Research. They describe DMUX as a state-of-the-art music source separation model currently capable of separating drums, bass, and vocals from the rest of the accompaniment. It can be used to attempt to remove background music from your audio samples, but it can also sometimes isolate speech from other background noise as well. There are a few different source separation models to choose from with version 4 of DMUX. Most of them seem to work best on actual music, but the hybrid transformer model being used here can sometimes get decent results with dialogue. Run the first cell to install DMUX. Set the path name and file extension to process in the next cell and then run the cell. Run the next cell to write the script to your Google Drive and then run the following cell to process your audio clips. This is going to move the original files to a folder named backup-dm and then process the files with DMUX. DMUX will output two files vocals.wav and novocals.wav. Vocals.wav is renamed and resampled to 22kHz and moved to the original sample directory, replacing the original file. The directory for each sample containing no vocals is deleted. If you want to save it and download the no vocals file to hear the noise, comment out the RM-RF separated line in the script above before running the cells. There are two options for normalizing a set of files here. The first ffmpeg normalize will work on long unsegmented audio files. Set the folder to process and the file extension and then run the cell. Run the next cell to write the script and then run the final cell to normalize the folder of files. The original files are moved to a folder named backup. Currently this is set to normalize files to minus 16 DB, but change this in the script before running the cell if you need to. The other option for normalizing files here is the application normalize audio. This will work on both small segmented files as well as long unsegmented files. Set the variables and run the cells to normalize your audio files, but you may want to adjust the level in the command line before running the cells. This next section can be used to attempt to isolate speakers in your audio sample and then export audio segments for each speaker using the speaker diarization model from Pionote Audio. To use this you'll need to sign up for a Hugging Face account and then you'll need to generate an access token and then you'll need to request access to the gated model. Don't worry though, this all takes about 60 seconds and involves clicking through the links in the text box for the Pionote section here. Install Pionote Audio using the next cell and then put your Hugging Face access token in between the quotes in the next cell and run it. Set the input audio file and the output path for the generated segments. The files are named numerically so if you have more than one file to process use a file name prefix such as the original file name. The script will attempt to diarize the audio file and then save each speaker's audio to a separate folder. Pydub is used to attempt to split each iterative speech on silence and save the segments. This may still result in long files over 9 seconds or short 0 second files of noise. You will need to download and filter your data sets before using them or you'll end up with poor results when training. After running Pionote and Pydub to identify the speakers and segment the audio file, go to your Google Drive and navigate to the segmented files. Download the folders as a zip to your system and sort through the files. The diarization process is far from perfect but it's infinitely easier than doing it by hand and often does a pretty good job if you feed it clean enough audio. I run a couple classic episodes of the late night paranormal radio show Coast to Coast AM with Art Bell through the tools in the notebook here. These were AM radio recordings to cassette tape which were digitized so the audio quality is poor and there's a ton of noise. I've downloaded the split and re-encoded files from my Google Drive so let's take a look and see how well Pionote audio is able to isolate the speakers. Here's the first speaker that was identified in this track. And as you can see when there's some overlapping speech it's difficult for the model to identify the speaker. This is probably something that you'd want to delete anyway if you're making a text-to-speech training data set. The second speaker identified was the intro announcer and it did a good job of

Speaker 2: isolating his very distinct voice. And the third

Speaker 1: speaker identified was the host Art Bell.

Speaker 2: These clips still need to be

Speaker 1: filtered for length discarding any very short or very long clips but overall the ones generated here don't really need any more processing. These files can be used to make a VCTK format data set for training the VITS voice models using Koki by generating transcripts using OpenAI's Whisper. First create a directory for your data set on your Google Drive. I'm going to call this one vctk-atest-22k-ds. Inside this directory create a subdirectory called wave underscore silence underscore trimmed. Inside the wave directory create another directory with your speaker name. I'll call this one art. Place the mono 22 kilohertz FLAC files inside the speaker's audio directory. Set your data set path in the variables cell at the beginning of the notebook and run that cell. Scroll down and click the cell to install OpenAI's Whisper speech-to-text and its requirements. Scroll down the notebook some more to the Whisper section and select the model you'd like to use from the drop-down. Set your speaker name which is the subdirectory inside the wave 48 directory and run the cells to begin transcription. After transcription there are a couple tools in the notebook that can be used to remove obviously bad transcriptions if there were any. The first will check for any empty transcripts and move the corresponding audio files to a backup directory called bad files. The second will remove any transcriptions with odd Unicode characters but this may not be appropriate for all languages. Hopefully you can use a few of these tools to speed up making data sets. Partially automating the process has certainly made it less agonizing. But these are obviously just basic implementations of the libraries and tools so I'm sure with a little creativity you can probably go even further. That's all I wanted to cover in this one. If you found this video helpful hit the like button or subscribe. It helps get the videos out there because I don't really post them anywhere. Thanks for watching and I'll be back soon with some non voice machine learning videos. There's a great large language model that I want to dive into I just need to be able to get a server with 64 or 128 gigs of RAM to convert the checkpoint and there have been some huge developments in the text-to-video realm that I want to get into. Thanks for watching and stay human.