Mastering Streaming Transcription: Real-Time Speech-to-Text Conversion

Convert Your Audio To Text

4.9/5

3727 customer reviews

Learn how to perform real-time speech-to-text transcription using DeepSpeech. Understand the differences between batch and streaming transcription methods.

Speech to Text - Real Time Streaming Transcription

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: Hi. In this video, we are going to see how we can perform a streaming transcription on speech-to-text. Now, what is streaming transcription? In my last video, I discussed on batch transcription. So, you take an audio file, you feed an audio file to a speech-to-text model, it transcribes the entire audio at one shot and gives you the output. If you have not seen the video, you can click the link on the top and watch it, or the link is also in the video description below. When we talk about streaming transcription, these are basically typically used for your real-time transcription. Say you want to upload an audio and kind of convert it into text, but the audio is around one hour of play or 20 minutes or 30 minutes. Typically, the speech-to-text process will take around 20 minutes or 30 minutes to do it, and you don't want to wait that long. So, that's where streaming transcription comes into play. What you can do is you can upload the audio, you can buffer the audio in chunks, and then feed the chunk into your speech-to-text transcription and only transcribe a part of it. So, as the audio is kind of getting converted, you can see the output in real-time. Streaming transcription is also applicable when you're talking on a microphone and you want to immediately convert the speech-to-text output. So, typically, in a conversational AI system, when you're talking like, okay, Google, put me the maps. So, as you're talking, if you want to understand, if you see like, okay, Google will start typing it. So, you want to do the transcription real-time, and there also the streaming transcription comes into play. So, let's get started with the video. I would recommend you watch the batch transcription video, which I mentioned earlier. It's there in the YouTube description, but if not as well, it's fine. Some of the concepts, I will cover it and show the difference where the streaming transcription and the batch transcription differs. So, for this video, what I'm doing is, I'm installing some of Unix-specific libraries, LibaSound, PortAudio, which are like prerequisite for your DeepSpeech. And then I am installing DeepSpeech, which is the speech-to-text package we are going to use. I have covered DeepSpeech as well in a separate video. But to give a quick gist, DeepSpeech is basically an Baidu open-sourced package. They use an end-to-end deep learning model to train an acoustic model. And apart from acoustic model, there is a language model as well to improve the accuracy of the output. So, that's what I'm going to do. I have already installed DeepSpeech. And if you see over here, I am downloading two different files from DeepSpeech website. One is the PBMM file, which is nothing but the acoustic model. And this is the end-to-end deep learning library I was talking about. And you have a scorer model, which is nothing but your language model. The language model works on top of the acoustic model output to increase the accuracy of the transcription. So, these are the two files. I have already downloaded it. Now, I am importing a DeepSpeech library. I am importing the model function from that. I am importing NumPy, OS, Wave, and JSON. I am importing, but I am not going to use this in this particular video. And then I am having iPython display audio to play the audio. So, let me run this. Then I am assigning multiple paths. I am assigning the model file path and the language file path. Now, this is the files I downloaded on the top. So, I am telling this is my model file. This is my language model. And then I have multiple parameters that are set. The LM alpha and LM beta are for the language model. And this is taken from the DeepSpeech GitHub repo. This is the best parameter they have published over there based on the hyperparameter search. Hyperparameter search when the model was trained. But you can tune this model to see how you are... Tune these parameters to see how your transcription performs as well. The beamwidth is another parameter. Now, beamwidth basically tells you how many different word sequences will be evaluated by the model. So, if you have a beamwidth of 500, maybe 500 word sequences will be evaluated to find the best probability out of it. If I give 100, it's going to evaluate 100 different word sequences. Now, the more you give, the better transcription you may get. But it also increases the processing time of your end transcription. So, these are the parameters. And then I am calling the model object which I imported on the top and sending the model file path. And once I have the model object, what I am doing is I am enabling an external scoring component for the language model and giving the language model file path. So, I have run this. Now, next what I am going to do is I am going to also set the parameters that I have, lm alpha and lm beta for the scoring model, that is a language model, and beamwidth for your entire hackaustic model. So, let me run this. Now, this is where... This is different from the batch model. So, in the batch model, what we do is we read an audio, we took the entire audio into buffer and sent it for transcription. Here, what we are going to do is we are going to create a stream so that as the audio is read, we are going to take chunk of the audio and then send it for transcription. That's why I am creating a stream object here. So, I am doing a model.createStream and it will give me a stream object. And after that, I am creating two functions. The first function is to read the audio file that I am sending. So, I am sending the audio file. I am getting the frame rate. What is the frame rate of the audio file? What is the number of frames in the audio file? I want to know how many frames are there so that I can kind of iterate and buffer the audio for that particular final frames. And then I am getting a buffer where I am basically taking the frames and then reading the frames and keeping it in a buffer and returning buffer and rate. This is one function. This is nothing but just reads an audio file and sends a byte array of the audio file and also what rate it is. The next thing I am going to do is I am creating one more function called transcribe. In this case, transcribe streaming because transcribe as it is getting streamed I am passing the audio. I am calling the top function read wave file. I am passing the audio file and it returns me buffer and rate and I am setting some of the parameters. In this, what I am doing is I am basically checking my offset. So, I am starting from offset zero. I am checking less than length of buffer. So, my buffer is nothing but the audio byte array. I am checking the length of the audio byte array. If it is less than audio byte array, I am iterating a loop. And in this loop, what I am telling take a batch size of 8196. So, I am taking batch size of 8 KB of the audio every time and transcribing it. That's what this batch size is. So, if you see over here, what I am doing is I am taking the offset plus batch size that is the first 8 KB, read it and then I am taking the buffer object which has the audio and I am passing the starting offset that is zero in this case and then the end offset that is the 8 KB for me. So, I have to read 8 KB chunk. Once I have this chunk size, I am passing it to the numpy array from buffer. So, if I am doing a batch audio, what I am going to do is I am going to pass the buffer directly into this from buffer object. But in this case, I am taking only a portion of an audio to transcribe every time. So, that's why I am sending it to numpy.fromBuffer and then what I am doing is I am telling like stream.feedAudioContent basically, whatever data I got, I am passing it to it and then I am telling like stream.intermediateDecode which will decode at the step. In this case, a set of stream object of 8 KB buffer. It will decode it and give the text. I am printing the text. Now, this is something I have commented out. If in case, what will happen like it will keep printing one after the other and you can see a lot of duplicate data. If you want, you can just clear the console out and watch it. But I want to show you how it looks like and once the first 8 KB is read, I am sending my end offset to offset. So, my top again will come 8 KB plus again 8 KB that is the 8 KB to 8 KB. It will buffer the next 8 KB, next 8 KB till the end of the audio. So, that's what I am doing and it's returning true. Once I have done with the transcription. So, this is run. I am going to download a wave file from the DeepSpeech repo which is nothing but a man1.wp I will show you what it contains and I am downloading it as speech.wav file and let me run this and if you see the, if I list the directory, you can see basically, I have the speech.wav I also have the pbmm score I downloaded. Now, let me take and play this audio file. So, that's why I am calling this audio object which is IPython library that I am using. So, let's listen to the audio. So, this is a very short audio that I have. This is a very short audio that I have. Now, what I am going to do is I am going to take this audio file and call the transcribe streaming function and pass this audio file. That's what I am going to do. So, now once I run it, you can see rather than waiting for the entire file to get completed, it will do it. Just a minute. I think I did not run one more function on the top. Let me go and run the read wave file. Yeah. And then, now let me call this transcribe function. And now you can see rather than as the audio is getting spoken, basically, it's trying to run it and it's printing the output. So, if you see over here, in the course of December, because it has only taken part of the audio, it is trying to guess the word. It's guessing it as dice, not this. Because what will happen, the acoustic model will correct it and also the language model will correct it to see what is the probability of the right word over there. But as I'm getting more buffer, it is trying to interpret the words properly. And finally, you can see in the last one, it is scripting the entire word that we saw in the audio. So, basically, the streaming transcription is pretty much applicable in cases where you want to record from a microphone or you have a very large audio file. You don't want to wait. You want to see the output as soon as the transcription is happening. You can use the streaming transcription function. Thank you very much.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support