Stream Real-Time Audio with Assembly AI in Python
Learn to enhance audio streaming with Assembly AI for real-time transcription in Python. Customize settings to improve transcript delivery.
File
How to Build a Better User Experience with Customizable Real-Time Speech-to-Text
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: I would like to get an appointment for a next Monday, ideally in the afternoon.

Speaker 2: Of course, the doctor has availability at 1.30pm or 3.45pm.

Speaker 1: That was a bit of a clunky experience, wasn't it? If we just customize the end of utterance detection with Assembly AI, we can make it much better.

Speaker 2: I would like to get an appointment for a next Monday, ideally in the afternoon. Of course, the doctor has availability at 1.30pm or 3.45pm.

Speaker 1: Alright, so we can start by installing our dependencies. At first I'm going to install Assembly AI's Python SDK. And then we need to install PortAudio, but I'm on a Windows laptop, so PortAudio is already installed for me. If you're on a Mac or a Linux laptop, you have to install it yourself. For Mac, that would be brew install PortAudio, but I don't have to do it right now, like I said. The next thing that I would have to do is pip install the extras package of Assembly AI, which will be in quotes, Assembly AI in brackets, extras. This package has some additional features to make it easier for us to stream audio through our microphone. So once that's done, I can actually close this, and I'm going to not pip anymore, import Assembly AI as AAI. And then I need to set my Assembly AI API key. And that you can easily find on your Assembly AI dashboard once you log into Assembly AI. If you don't have an Assembly AI account yet, you can go to assemblyai.com and it will take you just around a minute to make the account, and you will get your free API key. One caveat with doing the streaming is that right now it is a paid feature. So normally when you make an account with Assembly AI, you get 100 hours of async transcription for free. But to use streaming, you have to add your credit card information to Assembly AI. In the making of this video, that is the case, but if you want to check what the latest status on that is, you can go to assemblyai.com slash pricing. Next I'm going to create some functions to handle some situations. So the first one will be on open. The second one will be on error. And the last one will be on close. So these are just functions to handle some events from the streaming transcription. So I'll say session opened. Assembly AI real-time session opened. And this will just inform us that the session has been opened. Session opened with ID. On error, we will be handling the errors thrown by Assembly AI. Just print the error message that was passed to us. And on close, we just inform that the session is closed. And the last function that we're going to create will be on data, which is going to be the actual function that handles all of the data that passed to us from Assembly AI, so the transcriptions. What will we pass will be transcript, and its type will be Assembly AI real-time transcript. So if there is no transcript text yet, we do not have to do anything. But if there is a transcript, what we're going to check is what kind of transcript it is. So with the real-time transcription, with streaming, you get two types of transcripts. First one is a partial transcript, and the other one is a final transcript. We will see what they look like in a second. But basically, partial transcripts are transcripts that are returned by Assembly AI every time you're saying something. So periodically, we will see a sentence that is getting longer and longer. And the final transcript is once the Assembly AI decides that the utterance is finished, it's done, it is going to send you a formatted sentence with punctuations and everything and looking a bit more polished in a way. So if the transcript is an instance of Assembly AI real-time final transcript, we can just print it, and at the end, skip to the next line. If it is a partial transcript, I will just copy this. We're going to do, again, we're going to print it, but we're only going to leave a space and not go to the next line so that we can kind of see the sentence building up longer and longer. And then I'm going to define a transcriber with all the functions that we just created called a real-time transcriber from Assembly AI. On data, we will call the onData function. On error, we'll call the onError function. On open, not surprisingly, the onOpen function. And onClose, we will call the onClose function. And finally, the last parameter that I need to pass to Assembly AI is sample rate. So we recommend a sample rate of 16,000. Higher the sample rate, the higher the quality of the audio, but that also means that you are passing more data on your network. 16,000 is more or less a medium audio quality. If you want a high audio quality, you can go as high as 48,000. If you want a lower audio quality, you don't want that much traffic in your network, you can go as low as 8,000. Let's connect our transcriber. So transcriber connect, we'll call the onOpen function to start this session. And then we're going to now connect it to our microphone. So the microphone stream will be Assembly AI from the extras package, microphone stream. And then again, we need to pass it a sample rate. Just make sure that you're using the same sample rate here as you did above. Now we can start this stream here, transcriber stream, and then the stream will be the microphone stream that we just created. And lastly, once the program is done, we can close the transcriber. All right, so this is all we have to do. It's in total not even 40 lines of code with all the gaps. Let's start running it and then let's see by default how the streaming runs. Hello. This tutorial is very easy to follow, I would say. As you can see, at first we are getting returned the partial transcripts. That looks very raw. Then after Assembly AI decides that the utterance is over, it sends us the better looking version, the finalized version. So as you saw, we connected to our microphone. We were able to stream audio to Assembly AI and get the transcript. One little thing you might have realized, we also showed the example in the beginning of this video, is that it waits a while after I stop talking before it sends us the final transcript. So by default, that is 0.7 seconds right now with Assembly AI. But if your use case requires for you to have a lower threshold for when people stop talking and when you can return the transcript, you can customize that in Assembly AI very easily. All we have to do is to go to our real-time transcriber and pass it a new parameter, which is called an utterance silence threshold. Like I said, by default it is 0.7 seconds, so that is 700 milliseconds. We can make it maybe 300 milliseconds and then see if we see a difference. Small typo there. Fixed it. All right. Let's go back and look at it again. Hello. Are you going to wait a bit less time this time around to return me the final transcript? All right. As you can see, even the gaps that I take within the sentence causes a new sentence to start. So yes, 300 milliseconds is quite short. If there is another way that you can tell the utterance is over, so let's say you are connected to a Zoom stream and you know that someone muted themselves or they dropped from the call, you might need to just end their utterance immediately at that point without having to wait 300 milliseconds or 500, 700 milliseconds, whatever your end of utterance threshold is. So for that, you would have to do transcriber force and utterance. And this way, the utterance will end immediately without waiting for any time, so no delays. And the last customization I wanted to show you is if you do not want that partial transcripts at all, you don't want to handle them, you don't want to receive them, you only want to receive the final transcript, you can just say disable partial transcript. And you can set that to true. So let's see how that runs now. Session has started. I expect to receive no partial transcripts, only the final transcripts. That looks great. It's fast. It's snappy. It takes not even a second to get the final transcript. And I customized the threshold to be very low. So it does what I wanted. And that's it. This has been how to use streaming on Assembly AI with Python. If you have another programming language that you want to use, you can always go to Assembly AI's documentation. There we have getting started guys, not only for Python, but JavaScript, Go and also Java. If you're interested in building a voice bot using Assembly AI, go and check out Smitha's video on how to build a voice bot using Assembly AI and 11 Labs. The link will be here, but also in the description. Thanks for watching and I will see you in the next video.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript