Building a Real-Time Speech Recognition System with Deep Learning
Explore the complexities of speech recognition and how deep learning can address them. Learn to build and train a neural network for accurate transcriptions.
File
I Built a Personal Speech Recognition System for my AI Assistant
Added on 09/06/2024
Speakers
add Add new speaker

Speaker 1: Speech. It's the most natural form of human communication. This is my demo of a real-time speech recognition system using deep learning. Yo what's up world Michael here and today we're going to be talking about speech recognition, why it's hard, and how deep learning can help solve it. Later in this video we're going to build our own neural network and train the speech recognition model from scratch. So humans are really good at understanding speech so you would also think it's easy for computers to do too as well right. Speech recognition is actually really hard for computers. Speech is essentially sound waves which lives in a physical world with their own physical properties. For example a person's age, gender, style, personality, accent all affects how they speak in physical properties of sound. A computer also got to consider the environmental noise around the speaker and the type of microphones they're using to record. So because there's so many variations and nuances in the physical properties of speech it makes it extremely hard to come up with all the rules possible for speech recognition. Not only do you have to deal with the physical properties of speech but you have to deal with the linguistic properties of it as well. Consider the sentence I read a book last night I like to read. Read and read are spelled the same but they sound different. Red can be spelled like the color red but in this context it needs to be spelled like red for reading. So language itself is really complex and it has a lot of nuance and variations that you would have to come up with all possible rules for it as well to have an effective speech recognition. So what do you do when you have like what seems like an insurmountable amount of rules? You use deep learning. So deep learning has changed the game for a lot of complex tasks. Any modern speech recognition system today leverages deep learning in some way. So to build an effective speech recognition system we have to have a strategy on how to tackle the physical properties of speech as well as the linguistic properties of it. Let's start with the physical property. To properly deal with the variations and nuances that comes with the physicality of speech like age, gender, microphone, environmental condition, etc. we'll build an acoustic model. On a high level our acoustic model will be a neural network that takes in speech waves as input and outputs to transcribe text. In order for our neural network to know how to properly transcribe the speech waves to text we'll have to train it with a ton of speech data. Let's first consider what model we want to use by looking at the type of problem we are trying to solve. Speech is a naturally occurring time sequence meaning we need a neural network that can process sequential data. The neural network also needs to be lightweight in terms of memory and compute because we want to run it real-time on everyday consumer machines. Recurrent neural networks or RNN for short are a natural fit for this task as it excels at processing sequential data even when we configure it to be a smaller network size. So we'll use that as our acoustic model. Now let's consider the data. What a neural network can learn is dependent on the data you train it with. If we want our neural network to learn the nuances of speech we'll need speech data that has a lot of variations of gender, age, accent, environmental noise, and types of microphones. Common voice is an open-source speech data set initiative led by Mozilla that has many of those variations so it'll be perfect. Let's listen to a couple samples.

Speaker 2: Gordon no longer mentions bootlaces and the two engines soon talk about trucks. Fathers granted Jews, Latin colonies continue. It has since been renamed Yaser Araft International Airport. Each audio sample comes with labels of the

Speaker 1: transcribed text. Okay I think we have a solid approach. We have a data set that contains some variation in nuances as well as a lightweight neural network architecture. Now let's talk about the linguistic aspect of speech. To inject linguistic features into the transcriptions we'll use something called a language model alongside a rescoring algorithm. To understand how this works at a high level let's look at the acoustic models output. A speech recognition neural network is probabilistic so for each time step it outputs a probability of possible words. You can naively take the highest probable word of each time step and emit that as your transcript but your network can easily make linguistic mistakes like using the word red for color when it should be red for reading. This is when the language model comes into play. A language model can determine what is a more likely sentence by building a probability distribution over sequences of words it trains on. You can use a language model and rescore the probabilities depending on the context of the sentence. You'll get an idea of how all this works in a minute. For our implementation of a language model we can use an open source project called KenLM which is a rules-based language model. We want to use KenLM because it's lightweight and super fast unlike the much heavier neural network based language models. A neural based language model like transformers from HuggingFace have been proven to produce better results but since our goal is low compute we'll use KenLM which works well enough. For the rescoring algorithm we'll use what's called a CTC beam search. A beam search combined with the language model is how we'll rescore the outputs for better transcriptions. The logic of the beam search algorithm can get pretty complex and very boring and I'm way too busy to attempt an in-depth explanation here but here's the basic premise. The beam search algorithm will traverse the outputs of the acoustic model and use the language model to build hypotheses aka beams for the final output. During the beam search if the language model sees that the word book exists in the transcription it will boost the probability of the beams that contains red like reading instead of red like color because it makes more sense. This process will produce more accurate transcriptions. Thank the programming guides for open source code so we don't have to build the language model and a CTC beam search rescoring algorithm ourselves which will save us a ton of time. So to sum it up using a language model in a CTC beam search algorithm we can inject language information into the acoustic models output which results in more accurate transcriptions. Okay I think we have a pretty solid strategy on how to tackle this speech recognition problem. For the physical properties we'll implement an acoustic model and for the linguistic properties we'll implement a language model with a rescoring algorithm. Let's get to building. First we need to build a data processing pipeline. We'll need to transform the raw audio waves into what's called Melspecialgrams. You can just think of Melspecialgrams as a pictorial representation of sound. We'll also need to process our character labels. Our models will be character based meaning it will output characters instead of word probabilities. Decoding character probabilities is more efficient because we only have to worry about 27 probabilities for each output instead of like a hundred or thousand of possible words. We need to augment our data so we can effectively have a bigger data set. We'll use SpecAugment. SpecAugment is augmentation technique that cuts out the information in the time and frequency domain effectively destroying pieces of the data. This makes the neural network more robust because it's forced to learn how to make correct predictions with imperfect data making it more generalizable to real world. Now let's move on to the model. The model consists of a convolutional layer, three dense layer, and an RNN LSTM layer. The purpose of the convolutional layer is two things. It learns to extract better features from the Melspecialgram and also reduce the time dimensions of the data. In theory the CNN layer will produce features that should be more robust causing the RNNs to produce better predictions. We also set the stride of the CNN layer to 2 therefore reducing the time steps of the Melspecialgram by half allowing the RNNs to do less work because there are less time steps which would make the overall network faster. We add in two more dense layers in between the CNN and the RNNs. The purpose of the dense layer is to also learn to produce a more robust set of features for the RNNs. For the RNN layer we're using LSTM Bayer and the RNN takes the features produced by the previous dense layer and step-by-step produces an output that can be used for prediction. We also have a final dense layer with a softmax activation that acts like a classifier. A classifier takes the RNNs output and predicts character probabilities for each time step. We add layer normalization, Galo activation, and dropout between each layer with the purpose of making the network more generalizable and robust to real-world data. In deep learning adding more layers can lead to better results but since we want this to be a lightweight model we stop at five layers. We have one CNN layer, one LSTM RNN layer, and three dense layers. For the training script we'll use PyTorch Lightning. PyTorch Lightning is a library for PyTorch that decouples the science code from the engineering code. For the training objective I use CTC loss function which makes it super easy to train speech recognition models. It can assign probabilities given an input making it possible to just have your audio sample pair with their corresponding text labels without needing to align these characters to every single frame of the audio. Oh, hey. Just finishing up the rest of the code here. This code is open source so if you want to see the full implementation details make sure you check out the github repo in the description below. Now that we have the code we need to start training. This is the perfect time to introduce my training rig, War Machine. War Machine is my personal deep learning rig I use to train models. If you're following along I recommend using a GPU so if you don't have a GPU you can use free alternatives like Google Collab or Kaggle Kernels. Okay, let's get to training. All right, so training's finally finished. It took a couple days but I'm pretty happy with the results. Check it out. The loss curves, they both look pretty good. It doesn't seem like anything's overfitting. Also, while everything was training I implemented the language model and the rescoring algorithm from the open source packages. Also, I made a little web demo using Flask to demo the speech recognition model. So, let me set everything up and let's test this thing out. Okay, so I got the demo prepared. The first demo is going to be just the acoustic model without the language model and a rescoring algorithm. This is to showcase why it's important. Hi, this is Michael demoing my speech recognition system without a language model and a rescoring algorithm. As you can see, it's shit. It's not very good. Let me set up the second demo.

Speaker 3: Okay, so this is the second demo with the language model and the rescoring algorithm.

Speaker 1: Hi, this is Michael demoing my speech recognition system with a language model and a rescoring algorithm.

Speaker 3: As you can see, it does a lot better but it's not perfect.

Speaker 1: Okay, so the speech recognition system with the language model and the rescoring algorithm works pretty well with me. But let me reveal something I haven't mentioned yet. I collected about an hour of my speech and trained it with the common voice data set which is about a thousand hours. One thing I forgot to mention when filming this clip was that I also up sampled the one hour recording of myself to about 50 hours so it can be more representative in the entire training data set. Okay, play. So, there is a possibility that that extra hour of my voice data set is going to be used to that extra hour of my voice has biased the algorithm to work really well with me. So, to test that theory, I want to try this speech recognition system with other people.

Speaker 3: Hello, this is my brother-in-law. I need you to test this out for me real quick. All right. Just press that start button and say whatever you want. Hi, baby girl. Say something else. You look fine. Okay, say some more stuff. Hey, you better quit doing around.

Speaker 4: Okay, as you can see, it doesn't work very well on his voice, but when I start

Speaker 3: talking, you can see it start picking up on what I'm saying, right? Oh, dang, that's cool. Okay, thank you.

Speaker 1: Okay, so our next guest is big on privacy, so we'll just skip the introductions and go straight into the demo. And play. Say anything you want.

Speaker 4: I love Charlie, baby, and Oliver. Say something else. This thing sucks. Okay, as you can tell, it only works well for me again. It doesn't work very well for her, so she thinks it sucks.

Speaker 1: All right, so as you can tell from my guest's reaction, the speech recognition system is not that great, at least on them. Works really well for me. That's because I've biased the algorithm by adding my own data. All of this was expected, though. Deep Speech 2, a very famous speech recognition paper from Baidu, claimed that you need about 11,000 hours of audio data to have an effective speech recognition system, and we used like, what, a thousand hours or so? Their model also has 70 million parameters compared to our model, which is 4 million parameters. I chose a small architecture on purpose because I want it to be small, and I want it to run real time on any consumer device, like my laptop. So I think overall, the system works pretty well if you can collect your own data. So if you want to train your own speech recognition system, I recommend you collect your own data using something like the Mimic Recording Studio, which is what I used. I'll also open source a pre-trained model that you can download, and then you can just fine-tune that on your own data, so you don't have to go through the trouble of training it on a thousand hours of common voice like I did. I have the links to all the goodies in the description, so make sure you check that out. So this video was part of the series of how to build your own AI voice assistant using PyTorch. So far, I've done the wake word detection, which is the first video. Now I did the automatic speech recognition. I still have the natural language understanding part, which is the way to map the transcription to some sort of action, like what's the weather like, and then I also have to do the speech synthesis part, which is the synthetic voice of the AI voice assistant. So if you want to be updated when those videos come out, make sure you hit that like and subscribe button. Also, I have a discord server that's getting pretty active. If you want a community of AI enthusiasts, practitioners, and hackers, make sure you join the discord server. We want to start planning events in the discord server, so if you're curious about what they are, then make sure you join. Okay, so that's it for this video, and as always, thanks for watching.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript