Learn Real-Time Audio Transcription with Assembly AI

Convert Your Audio To Text

4.9/5

3726 customer reviews

Discover how to use Assembly AI's transcriber for real-time audio streaming. Follow along to set up and create a terminal app or Streamlit integration.

Real-time Speech Recognition in 15 minutes with AssemblyAI

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Transcribing audio in real time can be really hard, especially if people are speaking really fast, or if they're speaking slow, or for example, if they're using a lot of filler words, or if there is noise in the background. But fear not, because Assembly AI has its own real-time transcriber. In this video, I'll show you how to use Assembly AI's real-time transcriber endpoint and also use it in a streaming application just for fun. To follow along, go to the link in the description and create your own Assembly AI account. Okay, so let's get started. The first thing that I want to set up is the Assembly AI side of things. I want to be able to have an API token that will work for this project. Very simple, just go to assemblyai.com or use the link that we have in the description to create an account. If you already have an account, you can easily sign in and immediately after you create an account, you have an API key. You have a free API token from Assembly AI and you will be able to see it here on your profile too. So once you've done that, for this project specifically, you're going to have to upgrade your account to be able to use the capability of real-time transcription. And for that, simply just need to go to billing and click upgrade. I've already done that, so I do not have that option visible to me here, but if you go to billing, you'll be able to upgrade your account. And then you will, once you go back, you will see that your API key specifies that it is on a pro plan now. And that's all we need to do with Assembly AI. Next, what we want to do is to install the dependencies for this project. We have two main dependencies. One of them is Pi Audio that will help us get the input from the microphone in a streamed way. And the next one is WebSockets to be able to talk to Assembly AI's API endpoint. So very easily, you just need to do pip install pi audio. I've already had it on my computer, so this might take a bit longer for you. One note here though, you might get an error that says port audio cannot be found. To solve that problem, all you have to do is brew install port audio. And lastly, we just pip install WebSockets. All right, and that's it. So next I'm going to create a project folder. Here I will call this real-time audio transcription with Assembly AI. All right, and I will create my first file here. And yeah, I can call this anytime again, real-time. I'm going to call it this audio transcription.py. And I'm also going to create a configure file again. And I'm going to paste my authentication key here. All right, so now that's ready, let's step-by-step build our application. First thing is to set up the microphone stream, the input stream from the microphone. So for that, we are using pi audio, as I mentioned. And we just need to set up some constant here, how many frames per buffer we're going to get, or the rates, and how many channels, and some other things that we need to set up to create the stream. Once this is done, we want to create a connection to Assembly AI, of course. The endpoint to use Assembly AI's real-time transcription is this one. As you can see here, we are also specifying the sample rate here. And it's basically api.assemblyai.com version 2 real-time. The actual transcription part of this program is going to be a little bit more trickier than the one we did last time, which was just audio transcription. That was just sending a file to Assembly AI and then getting an answer or transcription back. But with this one, because we are doing it real-time, we're going to have to use an asynchronous function or a group of asynchronous functions that will do the job. So what's going to happen is we're going to have two functions, one of them constantly sending the input that goes into the microphone, and one of them constantly listening for the transcription that is coming back. So I will just paste the whole function here, and then we'll go step by step and understand what it does. All right, so let's see what's happening in this function. So the main goal of this function is to constantly be sending what is being inputted to the microphone and constantly be listening to the transcription that is coming in. But of course, for this connection or this communication to happen, first you need to create a connection. And how we're going to do that is using the endpoint from the Assembly AI's API. We are first creating, using WebSockets, we're first creating a connection. So we are using the authentication key and filling in some other headers here, also using the endpoint variable, the URL. We are creating a connection, and the connection is called underscore WS. Before we forget, we also need to bring in some other dependencies. For example, the authentication key from the configure file, or we're going to use some JSON files and other WebSockets and asynchronous API of Python and everything. And after that, we are basically creating the session, the connection between our application and Assembly AI. And we are waiting for a response from Assembly AI to make sure that the connection is solid and it has been formed. Next, we have two functions that are also asynchronous functions. One of them is send function, and the other one is a receive function. What we do in the send function is get the data or get the audio from the stream. The stream is the thing that we set up using here. Pi audio, that is the input from the microphone. And then we have to turn it into 64 bits and decode it into 64 bits. And then we are sending it to the WebSocket that we created. If you remember, underscore WS is the WebSocket that connects to the Assembly AI API. And we are also catching some exceptions in case there is an error with the connection. Next, with the receive, what we're doing is we are getting the response from the WebSocket. We are getting the response from Assembly AI. And based on what we get, we either have, there could be an error with the connection, but if not, we are printing the response that we get. Here, we're only printing the text, but here are some other information of what is being returned from Assembly AI. You can either get, so you can find all of this information in documents of Assembly AI, but basically you either get partial results or final results. And what that means is basically while I'm talking, Assembly AI will constantly be returning the transcription of what I'm saying. But even in the middle of a sentence, let's say this is a sentence, it will be able to return transcriptions for basically more or less every word. But the moment that you stop, it analyzes the whole sentence, adds punctuation and does casing, for example, makes some of the letters that are relevant uppercase, and then returns it. And that one is the final result. But before that, you are only getting the word after word after word, basically. So that's one option, but you're basically also getting audio, start audio and some other information. What is the confidence? If it's very loud in the background, if it's very noisy, the confidence might be lower, of course, and some other information that is here you're getting. But we're only going to be using the text field for now. So that's basically what we're doing. We're going to be printing it to the terminal. And inside of this bigger wrapper of the asynchronous send and receive function, we are calling these two send and receive functions to be run repeatedly, so that we'll always be listening. And finally, the last thing that I need to do to actually run this function is to call it in a while loop. I'll just simply say while true for now. But that's it. Actually, this is a application that will already work. So let's run this and see what happens. All right, I've navigated to inside my folder. Now I'm going to run the file I created. Okay, so now it's actually listening to me. And as you can see, at first, it is not adding any punctuation or anything, just word by word returning what I'm saying. And after I'm done speaking, it is, I'll stop it for a second. As you can see, after I'm done speaking, then it does capitalize some of the letters, add punctuation like commas or dots. And yeah, then, or for example, capitalizes the I here too, to make it a bit more of a final result. But it only happens after you are finished saying a sentence. So this is very nice. I think So this is very nice. I think it's a very nice result. For now, this is definitely super usable. But what I want to do is to experiment with only showing the final results. So how we're going to do that is, so if you remember, we saw that we can filter the messages based on if the message type was final transcript or not. So I can just write an if condition to check for that, right? What I want to check for specifically is if message type is final transcript. All right, and this should work. Let's see. Now I'm only expecting to see full sentences and not partial words. That's perfect. That's exactly what I wanted. Awesome. Nice. So as you can see, it was quite simple to make a terminal application. But you might want to go a little bit fancier. And if you're interested, I'm going to show you how to turn this into a Streamlit application, where you can get input again from the microphone and then show the transcript live on the screen. And it's actually not even that hard. So at first, what I need to do is to import Streamlit as st. And then I want to add a title just to, you know, see how everything works, if there are any problems. Okay, so I have to run this separately. Of course, I need to run this as Streamlit run real time audio transcription. Okay, so it is going crazy. It is going non, non, non, non. And can you? Oh, yeah, we're not printing anything. But let's see. Okay, now we're getting full sentences again. So that's good. A couple of things I want to do. First, of course, I want to stop this from going crazy. And I'm going to stop the application too. So let's go one by one. First thing that I want to do is to deal with that non, non, non, non things. And why that is happening is because of the asynchronous functions that we have. So we have a bunch of awaits here. So await sleep, we also are running await sleep again, here, we are doing again await send. And what happens is all of these functions are actually returning none. And what happens with Streamlit is that if you have a function that returns something and is not captured in any way, it just prints it on the screen. So I'm just going to create a random variable to capture the return values for these awaits. Let's see if I'm missing anything. There is one there, there is one there. Okay, all right, looks like I have all of them covered. So let's try running the application again, see what happens. Okay, so this looks good. I'm just going to fix the title. Okay, so the next thing that I want to do, I want to be able to display whatever is being transcribed on my screen. So it's actually quite simple. What we can do here is instead of just print, streamlet, markdown, and then print whatever this is, right? Okay, I'm going to check the terminal to see if the sentences are appearing. They are, and they're also appearing here. That is really nice, but there is one problem. This is just going to keep listening to me endlessly. Yes, this turned out really poetic, endlessly. All right, so for example, if I stop this one from running, what's going to happen is that my asynchronous functions are still running, so it's going to keep listening to me. Yes, it agrees. So what I'm going to do is stop this one, and I want to have a way of starting and stopping listening, or basically a way to control when the application listens to me and creates its transcriptions, and when it stops. So for that, what we're going to do is to find a way to stop these asynchronous functions from running. So as you can see, what happens here is that I'm saying that while true, keep running this part, and while true, again, keep running this part. But I don't want to do that, right? I want to be able to control when this is being run. So one thing we can do is to use a streamlet session states to control when this should be true, and when this should not be true. So I'm quickly going to create a session state, and I will occasionally set it to false, so I don't want the application to start listening immediately when it's first run. So the next thing that I want to do is to add two buttons. I want to add them side by side, so I'm going to create columns. At first, we will have a start button that will start the listening process, and then we're going to have a stop button. But of course, just because we have these buttons doesn't mean that it's going to start and stop listening. These buttons need to also do something. So for that, I will create callback functions that will run and change the session state, and this session state will affect when these guys are run. So let me first do that. You can copy this one. This is going to be true. And stop listening will set it back to false. And I need to call them from the button. Okay, and the last thing that I want to do is to make sure these are run based on the session state. So I will go and change true to not true all the time, but based on the session state. There is also one other thing. Because Streamlit always runs the applications over and over again, we don't need to use the loop here. So I will save this, and then let's see the change. All right, I have my buttons here. It doesn't look like it is running already. Here also, it looks like it is just halting, and it's waiting for us to start listening. And then I say start listening. Ah, it started listening already. I can see it here also. So this is very nice. The sentences, by the way, can be quite long. So I would like to read a bunch of sentences from here and show you how long they can be. This is the hands-on machine learning with scikit-learn Keras and TensorFlow book by O'Reilly. Neural networks seem to have entered a virtual circle of funding and progress. Amazing products based on neural networks regularly make the headline news, which pulls more and more attention and funding towards them, resulting in more and more progress and even more amazing products. Yeah, I guess I made like too big of a gap here, so it didn't realize those were, those two were the, belong to the same paragraph. But basically, it divides the sentences based on the gaps that you leave between the sentences to see which ones belong together and which ones don't. Well, I guess I already spoke a lot here. So if I want to now, I can stop listening and it cleans up the workspace for me. And if I see here again, I see that the connection is created again, but it is being halted. And if I want to, I can even start listening again. Yes. Nice. So we see now we have a application in our hands that is a little bit more controlled. It doesn't uncontrollably keep listening to the user and we can even start it again if we want to. And thanks to assembly AI, this was very easy to make. We only needed one endpoint to send the audio that we get from the microphone. So this is awesome. That was much easier than expected, right? I hope everything was clear, but if you have any questions, do not hesitate to leave a comment and let us know. But apart from that, I hope you enjoyed this video. I hope you liked it. And if you did, maybe give us a thumbs up and subscribe so that you will be one of the first people to know when we come out with a new video. And before you leave, don't forget to go get your free API token from assembly AI using the link in the description. Have a nice day and I'll see you around.