Building Real-Time Speech-to-Text with Kafka
Explore creating a real-time speech-to-text system using AI models and Kafka, focusing on latency reduction, accuracy, and modular post-processing techniques.
File
Why Local AI is the Future Real-Time Speech Transcription with Whisper Kafka
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: on this transcriptions.all topic that did not exist before got created over here. Hello and welcome everyone. Let's continue our Kafka journey and build this real-time text-to- speech or speech to text or other system over here for us. So speech to text involves converting spoken words into text almost instantaneously. This is achieved using advanced AI models, right? So we have OpenAI's Whisper model or Google's speech to text API or some similar technologies. The real challenge over here is minimizing the latency while maintaining high accuracy, especially when dealing with accents and diverse languages or some noisy environment. As you can see that the system over here is transcribing my voice on the go. It is having very low latency plus having a good accuracy over here too. And I'm gonna refresh the screen on the right to give you a demo of something else that happened in the background. this transcriptions.all topic that did not exist before got created over here. And when I'm pausing, you can see that the whole aggregated transcription is being printed on the console. So there is some kind of voice activity detection also going on over here, which is what I wanna show you. So I'll pause for a few seconds over here you will see that the voice activity detection over here will kick in that no voice is detected. There we go. And then once no voice is detected, there is some action that is being taken over here. So I'm going to go into this transcriptions.all topic, go to the messages over here, and you will see that there are some messages that are flowing in right now. So there are all these are offset 0, 1, 2, 3, 4 and the fourth offset being the most recent message over here. It is coming in and as soon as I'll stop talking now this last message that I am speaking right now will get transcribed on the fly. It will get transcribed and it will flow in over here too and I want to let you see what's going to happen when I stop talking over here. So just take a look at this particular GPU and what happens to it as soon as I stop talking. So you saw how it spiked up to 100% for us for a moment and then it did the transcription over there came back down to a lower number and I'm also going to refresh this over here so you'll see a fifth message show up over here. I'm going to refresh this this is the fifth message and I'm And I'm gonna expand that and you can see this is the actual message that got brought in from the transcription system all the way to Kafka. So how did this happen? And let me show you the architecture diagram of this, that this is the bigger piece where we have a producer, consumer, and all that. Amongst all these different components today, we'll only talk about this producer and posting the message to Kafka, right? So this is the part of the equation that we'll deal with today. The producer being our voice activity detection plus voice transcription, plus getting the text ready to be sent to Kafka, right? So that is the piece that we'll be taking a look at. So there are two different technologies in general that are working over here. Technology one is where we are getting the speech to text in action and technology too, where we are sending this data to Kafka. And while we are sending to Kafka, there are a lot of different things that you will learn from this small little project. And you can incorporate these tools and techniques and tips that I'll give you today in this video into your projects to decouple the architectures that you have in your AI-based project, or in general in any other project that you might be working into. and you'll see me pausing from here and there during the video over here and the reason being I will I do want you guys to see how fast the transcription is over here like there are some configurations that we will see in the application today that we are building where you can you can in a way configure for how long do you want to wait before the voice detected can be transcribed. Right so like right now I have set that delay as two seconds. So after two seconds of voice not being detected it starts the transcription process. Right there as you saw it kicked in after two seconds of voice not being detected and then it takes all that data and dumps it into Kafka over here Kafka topic over here so this is pretty cool that it is doing this for us and now while we are in the Kafka's UI I also want to show you in the schema registry it has registered a schema for this topic too so I'm going to open this and you can see that it says that there is a record transcription segment is the name of this. The namespace is STS transcription. So all these three things are valuable, but the most valuable things are the field section. So there are three aspects of the data that's coming in. Timestamp, text, and user ID, and it's of the type string, string, and string. So this is something schema that that you have passed along along with the message, and this the the schema can evolve over the period of time to and we'll talk some of those things also in this video how the whole schema registry works with the messages that are being sent to the Kafka topic over here so let's take a look at the architecture from the text from the speech to text part of you right. So we'll first take a look at that architecture over here. And what you'll notice over here is that there is a producer and a consumer kind of architecture. And this is this producer consumer is different from the Kafka based producer consumer that we're talking about. This producer and consumer is focused towards only the voice to text conversion right so producer in in the voice to text conversion um scope means like it handles the recording and the voice activity detection part of it right so what i mean by that is there is a producer section that is detecting whether the voice has been detected or not right and then once the voice is detected it captures it and it sends it to consumer for transcribing it so consumer handles the transcription using faster whisper and producer does the does the voice activity detection and recording of the voice right so these are the two main concepts that that you will see in the voice or speech to text aspect of it. And of course, there is an Kafka integration towards the end, which sends the final text downstream, for instance, for any kind of analytics or for doing text-to-speech or actually doing intent analysis that we will see in the next video that I'm gonna do over here. So this is like a mini assembly line for voice data. So we record audio, we detect when someone is actually talking, break it down into smaller files, push it to a queue and then transcribe them. It's simple, right? So let's do a quick walkthrough the different pieces of code that come together in order for this to work. So let me show you how the whole directory is laid out over here. So you will see this file VAD.py and this is where all the voice activity detection magic is happening. And then there is this transcriber.py and this takes in the audio chunks and uses the faster whisper for transcription over here for us. then there is post processing that pie and this is where after we have received the text we can do some kind of post processing over here and optionally send it to Kafka or to whatever we want to do and this has been made modular over here so that you can do and incorporate your own post processing methods over here too. We also have helpers.py which is the utility function and you'll see that this is where we are loading the configuration file so configuration is split in config.yaml file as you can see over here and this is all your configurations like your favorite microphone what kind of model that you are using the voice model that is and you'll see that I'm using the floor 30 true which is the full precision model over here because I have NVIDIA 3090 using CUDA. But if you want to use CPU then you can use the CPU as your device and then you can probably lower it down to int8 so that your CPU can actually do the processing in the real time over here. All right in helpers you'll see there are more methods as well. So there are methods like clearing the context and this was something that I was implementing. I can minimize that for now. Saving transcriptions to JSON, which is where after all of your transcriptions are done, you can actually save it to your transcription history over here. So everything that you have said in a session after the code started, you will have it written down in a nicely formatted JSON file over here. And then there are some other utility methods also helper methods in this file tool. Then you have audio-utils.py. And this is an interesting file also because it deals with your local audio devices on your hardware. So for example, I have multiple microphones hooked on to my current machine. Let me process this particular speech over here. It looks like it was going on for a long time. There we go. So, I have multiple microphones that are attached to this particular machine over here. And this causes, you know, like the need to list down different microphones that you have. And then you can choose whichever microphone you want to pick from. And this is where this particular audio utils comes into picture and lets you pick the correct microphone that you want for your use case. And then, of course, there is in the end main.py, which is where the execution starts, right? So this is the heart where everything starts from here. And the main.py is where you'll see the producer and the consumer threads being started. And then they are joined so that the system waits for them to finish and then proceeds after that, right? So this is really a very modular approach that I have followed over here, such that if you want to incorporate the speech to text part of the system, then you can very easily do that. You can just take the bits and pieces from here and you can just modify your post processing and you can continue developing your own applications over here. So in the post processing, what happens is after all your voice is recorded and your recording is done, you can just invoke after transcription, the most latest transcription, whatever you want to do with it. In our case, we are just sending it to Kafka. So we are just calling this produce message method over here. And in the produce message method, we are just sending that particular message to Kafka. And you can do whatever else you'd want to do. you just want to not call Kafka, you want to utilize some kind of REST API over here to do some other processing on it, right? You want to just send your voice to Olama server directly doing a REST API call. Of course, you can do that. You can just change this produce method to something else. And you can just call that particular method over here too. so all that being said let's go to main.py over here now and you will see that main.py at the start once once the main file starts it uses this arg parse right so this is used to parse the command-line arguments for the model name device and optional sampling rate so we have we also have signal handlers over here and signal handlers are used to gracefully shut down the application whenever we are doing the ctrl C and let me give you a short demo of that so you would see what's going on over here so I'll let this processing finish and you can really appreciate how this This huge transcription just took like a second to process over here. I'm really still fascinated by the leaps of in which the technology has progressed right now. And even though I'm using the full precision model 32 32 bit model over here, it is still just processing so fast. the faster whisper v3 turbo model whisper large in fact whisper large v3 turbo model is really really amazing over here i'm going to let this finish and then i'll press ctrl c over here so you can see how the system gracefully shuts down here's a ctrl c and you can see that the system gracefully shut down. It did some cleaning up of temporary files and all that. And I'll also show you one more thing that happened. So in the transcription history, right here, you'll see all the transcriptions that we had, they got stored over here too. And even if whatever the size of the transcription was doesn't matter, it got nicely formatted and stored in this transcription history over here so it can also be used as you know like if you just want to record your voice while you're doing something for every timestamp you can have your voice recorded over here all right so let me go back and restart the application so i'm gonna start this application over here python sct main.py and then continue with this video over here so you can see when the application starts, it is probing for this particular microphone that I have configured in config.yaml file. Let me show you that. So in config.yaml file, I have already configured a microphone Zebra Elite, sorry, Zebra Speak 410. Why? Because Helgato Wave XLR is the mic that I'm using for OBS right now. So I have disabled that microphone to be used for our speech to text transcription and I'm using a separate microphone over here for our speech to text right now. All right. So let's go back to main.py and let's take a look at that again. And so we have signal handlers to gracefully shut down the application and we saw that ctrl c did gracefully shut down our application over here and it just you know like didn't blow up it actually cleaned up some of the files that it was creating and what files was it actually creating so if i stop talking for a second you will see a particular audio file appear over here so i'll pause over here and keep an eye and you will see a particular audio file appear. There we go, it's right here, tempaudio1.wav. And what is this audio one file? So let me pull our this particular thing over here to our console window, and I'll stop talking again, and you will see another file will appear in a few seconds. all right so there are two audio files that get generated and then does it mean if I keep on talking and stopping a new file will be generated every time it did not so what happened over there this is actually the flip-flop mechanism that many production grade applications use so what's happening over here is that whenever I talk and whenever it is recording the data so whenever voice activity detection is recording my voice it is recording into temp audio one first and then whenever I stop talking and there the transcription is actually happening at that time I do not want to lose whatever else I'm talking right so So if I was talking, I stopped, now transcription is happening, but while transcription is happening, I'm talking again. So at that time, I want my voice to be recorded in the second file. So then the current active file switches to the second file while the voice is being recorded. And when I stop talking, then the audio file 2 will be taken to the transcription and the current active file will be switched to audio 1. So this is one of the ways to make sure that I can continuously talk. And even if the system is taking longer to process my voice, while the voice is being processed, while the voice is being transcribed, the spoken language is not lost. And this is the flip flop mechanism where we have flip as the current active voice file and flop as the backup voice where the voice is being recorded actually this is the flip-flop mechanism that i have introduced over here which really helps a lot for those systems where it takes a lot of time to do the transcription right so let's say if you do not have a powerful hardware like mine if you do not have a 3090 or 4090 hanging around with you and you are doing the cpu-based transcription and you have let this run for this long you can see that i've been talking for so long and you have let this run for quite a long time then the amount of time that it will take for your CPU to process can go into 15-20 seconds let's say or even more or longer during that amount of time when your processing is actually happening the transcription is happening and if you are talking during that time then the flip-flop mechanism kicks in and your none of the things that you have said get lost and this is this like becomes really handy in those situations for us. All right okay so you can see that um while I was uh while the transcription was happening wow even even 3090 kind of struggles with a huge transcription with a full precision right so that that also is interesting over here okay so I'll let this work in the background and I'll focus on some other aspects of main.py that we have. So we do have device selection over here as you can see. We get the list of devices and then finally it starts the two threads in producer and consumer over here at the bottom. Let me show you. So there is a producer thread and you can see the producer thread is going to record audio stream and then there is a consumer thread which is going to transcribe audio stream which is which is the whole concept of producer and consumer. This is separate from Kafka's producers and consumers so do not confuse it with that. I've introduced my own producer consumer concept for the speech to text transcription over here. Alright, so see how the cue is the link between these two and if you skip setting up the concurrency the right way, your transcription might lag over here or your audio might drop out, right? So so you have to watch closely how these things are working together for us. We can actually now move to the VAD over here, which is the voice activity detection. This is actually what you've been watching. So all these informational messages, voice detected and all that, right? This is the part where VAD is coming into picture. And trust me, VAD is a game changer, right? So in VAD, in voice activity detection, we are using Celero VAD model over here. And this model helps us detect when someone is talking versus when there is silence, right? So we are recording one second chunks and checking if there is speech. And if there is speech, then we are accumulating that into a buffer. And if there is a silence for more than a certain threshold, and which this threshold can be configured in config.yaml, right, so let me show you those also in a couple seconds after I finish my thought over here. But we are in a way recording audio in one second chunks, checking if there is speech. And if there is speech, we are accumulating that into a buffer. there is silence for more than certain threshold amount of time then we just finalize that chunk and write it into a temporary file and push it into a queue and this is how end-to-end how it is connected. What are those configuration settings? So let me go to config.yaml and show you that. So there are three configuration settings that you will see related to our speech. So setting number Number one is Time Gap Threshold. Time Gap Threshold is a concept that I have introduced where, let's say, a user is talking, right? So let me actually demonstrate that to you over here. So if I'm talking right now, and because the Time Gap Threshold is very small right now, so it won't really matter. let's say while talking if i have smaller sentences if i'm if i'm not having huge sentences over here if i talk pause for a second talk pause for a second talk pause for a second and talk right that kind of thing and if the if the end speech delay is also not very high right so what happens is that we do not want to treat two different voice segments as two different voice segments in case those segments are close to each other right and actually let me let me demonstrate that to you I'm gonna shut down this application over here all right so I've shut down this application and for me to demonstrate that to you, I'll have to actually increase the end speech delay to one and let me change the time gap threshold to 10 over here. Okay, so what that would mean is that the end of the speech will now be more aggressively calculated as soon as I will pause even for a second it will start pushing out the voice into the into the buffer into the audio segments for us but when it is constructing one particular transcribed text it will check whether the last time when the speech stopped was it 10 seconds or more right before the next time the speech started so let me actually demo it to you and it will make more sense then so i'm having the model load over here waiting for the message okay there we go so i have set something and i'll pause and i'll continue talking now and because 10 seconds have not passed it is going to not treat this as a separate segment all right so if this has worked the way it should have worked right let's see let's see if there is a bug in the system or not but if it had if it did work the way it was supposed to work this whole segment will be treated as one which it did. So if you see even though there were three different segments, the first segment was waiting for the message. Okay there we go so I have set something and I'll pause. I think this is where I paused right and and we can we can take a look over here too. So okay this is where the started again and waiting for the message okay there we go so I have said something and I will pause this was the first time when I stopped but then 10 seconds didn't pass till I moved on to my next message and the next message was this one and then the 10 seconds haven't passed when I moved on to the third message which is this one so because we had enough threshold in between and this helps us you know like if you if you are wanting to have faster transcriptions like if you pause more often you do not want the buffer to get collected a lot then you can lower down the end speech delay and increase the time gap threshold and you'll have a longer segments that are being sent for downstream processing which is what the whole concept of time gap threshold was over here I'm going to start up the transcription back again and then I'll continue talking over here. Okay. The next part over here is, so let's see. So we were on the VAD over here, okay, all right. So why does VAD actually matter so much? The VAD matters because this is really a game changer technology for us because up until now, what we were doing is, is that we were just recording audio segments. and we were just sending it to the back-end system for processing even if there is no voice right so the the voice differentiation is not just from the the noise right so there could be a lot of for like a hammer in the background could also be a voice but that is not a human voice the VAD actually differentiates between the noise and the voice. So this is where you want to have a VAD in the picture. So if you do not have a VAD in the picture, right, then you'll continuously be sending the segments of audio or the downstream systems to process and then you will check, hey, whether that segment actually had any human voice or not, which puts a lot of load on the downstream systems. if the voice activity detection can be taken care up front by your own application on the edge device right because these sclero models they are very cheap to deploy and they do not cause or you require a lot of resources and this is where you you have seen in your Google and Alexa's where they have the voice activity detection built into which detect the voice and they also detect like a keyword on top of the VAD as well. Okay let's go down to the transcriber section over here and see what all has been implemented in there. So in transcriber right and this is where this is one of my favorite portions of this code too. So when the thread it picks up those audio sections or the audio chunks from the queue that have been created by VAD this is being now passed on to some kind of transcription model right and in our case we are using these whisper models and these whisper models are free to use the one that we are using in this case is that let me show you in config.yaml file whisper large v3 turbo model and this is an amazing model this is like an a real game-changer model I cannot believe how awesome this is not only can it do transcription in English if I talk to it in Hindi it can do transcriptions in Hindi and multitude of different languages too and let me show you a quick demo of that over here and I'm going to speak in Hindi for a little bit and I'll show you if it can detect that or not so let's see if it can transcribe in Hindi or not wow well that wasn't hindi that wasn't i think it thought that it is urdu because i was being more polite and urdu is a little bit more polite a language i mean hindi is too but urdu is more of course so let me let me try to do something else in hindi uh maybe a more um traditional hindi let me see that Do ladke khaana kha rahe hain? Perfect. All right. All right. So, so that is definitely Hindi over there. So which is, which is nice. All right. So you can see that it doesn't, does not only do transcription in one language. It can actually do transcription in multiple different languages. there is a there is a argument that you can pass over here where you can you can say that hey this is multilingual so you can see that multilingual argument has been set to true over here and it automatically then then detects what language it is and it transcribes in that way so the transcriber actually it is pretty cool it just does simple things loads that model from the config.yml file or also from the command line arguments you can load the model too and then transcribes each chunk of the file group small segments if they are close to each other in time right so we saw the concept of we saw the concept of this time gap threshold over here and this is the transcriber actually does that for us it groups these small segments if they are too close together in time so your final transcript looks neat and then it sends the latest transcribed segment to the post processing let me find that over here this is the invoke after transcription so it sends it to the post processing for us after that so and now post processing can do whatever it wants to do it can potentially send it to Kafka also which is what it is doing right now in our use case and Let's see how many messages we have flooded into Kafka now. So the last time we checked was seven messages. Let's see how many new messages have we sent over here. So we have a ton of different messages flowing in. The last message was this 37. Let's see if the message in Hindi made its way. There it is. Two boys are eating food. This also came in over here. All right, so let's keep on going and let's see what is actually happening the last section over here which is the post-processing section and if you are a big data fan you will love this because after each transcription we can validate the text with identic right so what we are doing is we are getting the data in here we are validating whether this is a valid Pydentic model for us in the transcription segment which is what actually I didn't talk about it but in the model section we have defined a Pydentic model over here so we are having three attributes timestamp text and user so these three attributes in the transcription section find identic model or schema whatever you want to call it and then in our trans post processing we are accepting this particular transcription segment over here and we are sending it to produce message method and we send it to Kafka using Avro for schema enforcement right and this is huge if you have a robust pipeline that expects strongly typed messages so if you're not using Kafka that's fine too you can adapt it to say a REST API call but if you do want a bulletproof data pipeline hooking this up is a next level skill so definitely try it out if you are serious about real-time data streaming Kafka is the way to go and what I have done over here is if you see there is a schema that has been provided while we are producing the message so let me scroll up over here and I can show you a couple other things that are happening so So in order for us to understand how the messages are being sent to Kafka, I'll have to explain to you how Kafka works. So good news is that I have already done this in detail in three other videos in the same series of this video. So if you watch those three videos, then you'll see how the Kafka works end to end and how do you produce and consume messages in Kafka. However, I'll still go deeper into it in this video because I do want to cover end to end how Avro is working over here. All right. So we easily go and spin up all of our docker services, which is easy enough. You just take this docker compose, go to this particular directory, do docker compose up and you can see once the docker is up, then you have all these different docker containers in the intent analysis up. You do not need all of them. do however need the Kafka UI, the Confluent schema registry and the Bitnami Kafka. But if you just go in and do a docker compose up in this particular intent analysis folder, then you will have all the services up as you need, right? And I've gone in detail what these services are in my previous video. I'm not gonna do that again, but just wanted to let you know that this is where all these services are getting up and everything is free. So you can just go in, just do a docker compose up, it will pull from all the free repos and get all the services up for you. After your services are up, then you have to take a look at the schema over here. What is a schema? So if you see transcriptionvalue.avsc, this is the format in which we want to send our data. And if you remember, we also showed that particular format over here. So in the schema registry, this got registered this is the same schema that got registered in the schema registry for us and the way this happens is when you send in the very first message with this particular schema attached to it and where is that happening if you go to post processing what we are saying is that there is in the Avro producer there is a particular schema registry available and what we're going to do is not only send the message but we will also send the schema associated with the message which is the the magic over here that not only we are sending the particular message but we are also sending the Avro schema over here so in the producer we whenever we create the producer we we define the bootstrap server and if you have some security configurations you can define that too over here in our case it's all just local so we just have the broker url but you have the server's broker url you have the schema registry url and you are providing which particular schema that you are using value schema in our case we are not sending keys but if you have key then you can also use a key schema but we are sending the value schema attached to it and for the very first message it is registering that schema over there. You can also register the schema using the REST API call by sending that schema directly but you can also register it by the very first message if you have those things configured in your docker-compose which we have in our case. Alright and once you send the message now it validates so that the message abides by the schema and if it does it sends that message and registers in the producer's message you know over here so now you in the topic over here so now you can have all your messages come in with the particular schema attached to it and in future when your messages are being consumed then the consumer can not only get the schema for this particular message right so it has the schema associated to it so produce the consumer can get the schema and you can evolve the schema over the period of time so for any other future message if you have changed an attribute then you can evolve your schema over the period of time too and you can change that particular attribute or you can add or remove an attribute from the schema and you do not have to change the topic so which becomes huge if your consumers are Avro consumers then your consumption can fall back to a previous schema or use the most recent updated schema and all that and we'll see all those things when we go into the intent analysis section in some future video over here. All right and this is how the end-to-end speech-to-text system is working. So if I go back to my original diagram over here, so producer is producing the speech, converting it to text, and then once you have the text, you are sending it to Kafka. Sounds very simple, right? But there are a ton of different steps. However, once you are done, you know, with this end to end application where you can actually send a message to Kafka using the schema registry, you have achieved one of the most reliable ways of streaming data. And I will leave you over here with that thought. Thank you everyone for watching this video and I will see you guys in the next one.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript