Real-Time Phone Call Transcription with Python
Learn to transcribe phone calls in real time using Flask, Twilio, and Assembly AI for seamless audio stream management and live transcription.
File
Transcribe a live phone call with Python - Flask tutorial
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hey everyone, in today's video we're going to learn how to transcribe a phone call in real time. Here's a look at the final result. Speak to see your speech transcribed in the console. So you can see my speech being transcribed in real time and every time I finish a sentence it's automatically punctuated and formatted. At a high level the service works in a straightforward way. We call a Twilio and we configure that number to pass on the incoming audio stream to Assembly AI. Assembly AI's real-time transcription service will then transcribe the speech as it comes in and return partial transcripts to us in quick succession which will then print to the terminal. Now looking under the hood, first the user calls a number that we provision with Twilio. Twilio then calls a specific endpoint associated with this number. In our case we configure the endpoint to be an ngrok URL which provides a tunnel to a port on our local machine. So ngrok allows us to expose our application to Twilio without having to provision a cloud machine or modify our firewall rules. The ngrok tunnel allows Twilio to call a flask application that's running on our local machine which responds with TwiML that instructs Twilio on how to handle the call. In our case the TwiML will tell Twilio to pass on the incoming data stream from the phone call to a web socket that's running in our flask application. The web socket will receive the incoming audio stream and pass it off to Assembly AI for transcription, printing the transcript to the terminal as it's received in real time. And you can find all the code for this tutorial in the repository link below and you can also check out our blog if you want to copy and paste the code as we go. To get started you'll need an Assembly AI account with funds added, you'll need a Twilio account, and you'll need an ngrok account with ngrok installed on your system. And finally you'll need to have Python installed on your system.

Speaker 2: First we're going to want to create and navigate into a project directory.

Speaker 1: We'll be using python.env to manage our credentials. So create a file called .env and add the following lines to it. So first you're going to want to put your ngrok auth token here. And you can find this in the getting started your auth token tab on ngrok. And if you haven't yet configured ngrok on your system then you're going to want to run ngrok config add auth token and then put in your auth token. Next you're going to want to put in your Twilio account SID. So you can find this in your Twilio console under accounts API keys and tokens. Here you can also create an API key for the Twilio API key SID and the Twilio API secret. And finally you're going to want to put in your Assembly AI API key which you can find on your Assembly AI dashboard. Next we're going to create a .getignore file. And we're going to add the following lines to it. So we're going to add the .env file. We're going to add venv and we're going to add pycache. So this .getignore file will prevent you from accidentally tracking your .env file with git and potentially uploading it to a website like github. And additionally it will prevent you from tracking and uploading your virtual environment in your pycache. Next we're going to want to create and activate a virtual environment for this project. So you run python-m venv venv. So we're creating a virtual environment called venv. And you might have to run python3 here. And then if you're on Mac or Linux you can do .venv bin activate. Or you can do source venv bin activate to activate the virtual environment. And if you're on Windows you're going to want to do .venv scripts activate.bat. Next we're going to install the dependencies that we need for the project. So we're going to pip install flask. So this is needed to create the flask application. Flask sock. So this is needed to add web sockets to our flask application. Assembly.ai which will allow us to easily interact with assembly.ai's API. Python.env. And that will allow us to load our environment variables from that .env file. ngrok. This will allow us to open up an ngrok tunnel. And then Twilio. So this will allow us to configure our Twilio number. All right now we're ready to actually write our application. So create a file called main.py. And first we're just going to create the skeleton of our flask application. So from flask import flask. And then from flask sock import sock. So this will allow us to actually create a flask application that also supports web sockets. Next we're going to add some settings for our flask application. So we'll add port to run it on port 5000. We'll add debug equals false to set the debugging mode. We'll add the incoming call route. And we'll just make this the root endpoint. So this is going to be the endpoint in our application that Twilio calls when we receive a call on our Twilio number. And then finally we're going to add our web socket route. And this is the route for the web socket to which the audio stream will be sent. Next we're going to create our app variable. So we're going to instantiate a flask application. And then we're going to wrap it with sock which enables us to use web sockets with this flask application. Now we actually have to define the endpoint. So we'll define functions that run when these endpoints are hit. So we create an app route at incoming call route. And then we associate the receive call function with that route. And for now we'll just pass. And similarly we're going to add a web socket route. So this actually uses the sock.route decorator. And we're going to make that at the web socket route. And this endpoint is going to call the transcription web socket function. And it takes in the web socket as an argument. And then we're just going to pass for now. And finally we're going to actually run the application. So we write if name equals main app.run port equals port and debug equals debug. So we're running our app on the port that we defined above. And we're running it with the debugging mode that we defined above. And we only run this application if it's run as the top level script. So now that the basic structure of our flask application is defined, we can go back and start filling out these endpoints. We'll start by defining the root endpoint that Twilio hits when our phone number is called. So go up and modify your receive call function. And just make it return real-time phone call transcription app. So now we're going to run Python main.py in the terminal. And this starts up our flask application. So let's go to this link in the browser. And you can see a simple web page running on port 5000 that just says real-time phone call transcription. So it's just displaying the string that we return from our receive call function. So by default, the only HTTP request method available for flask routes is get. And the endpoint will respond with the value returned by the flask. And the endpoint will respond with the value returned by the corresponding function. Twilio sends a POST request to the endpoint that we associate with our Twilio number. So we need to modify this Python function accordingly. So first, we're going to go back up to our imports. And after we import flask, we're going to also import request and response. So request will allow us to access the POST request information. And then response will allow us to create a response. So now we can go back down to our receive call function. And what we're going to do is modify this decorator. And we need to specify the allowed methods, which we'll set to be get and POST. Next, we have to modify our receive call function. And we're going to do if request.method equals equals to POST, XML equals this F string, open a response tag, open a say tag. And then you have connected to the flask application. And then we're just going to close the say and response tags. And then we're going to strip the extraneous whitespace off of the F string. And then we're going to return a flask response. And we're going to return this twiML. And we're going to set the media type equal to XML. And finally, if the request method is not POST, then it is get because those are only two allowed methods. So we will just return the string we did before, the string as we did before in this case. So we now have a working flask application that will respond with twiML if it's called by Twilio. So the next step is to actually get a Twilio number and configure it to hit this application. So to get a Twilio number, go to your Twilio console and go to phone numbers, manage, buy a number. So you'll see a list of numbers that you can purchase for a small monthly fee. And you can just select one and click buy. And note that we only need voice capabilities for this tutorial. So once you find your number, go ahead and click buy. So now we'll open an ngrok tunnel on port 5000 through which our flask application will be served. So in the terminal, execute the following command. ngrok http and then http colon slash slash localhost and then port 5000. So in the terminal, you'll see some information displayed about the tunnel. And what we need is the public forwarding URL that ends in .ngrok-free.app. So go ahead and copy this value now. And back in your Twilio console, go to phone numbers, manage active numbers, and then select the number you just bought. So in the voice configuration, set a webhook for when a call comes in and make the URL equal to the public URL of the ngrok tunnel that you just opened. And then make sure that the method is HTTP post. And when you've done that, you can just scroll down and hit save configuration. So we bought a Twilio number and we configured it to point to the root endpoint of our flask application when a call comes in. And we point to the flask application through the public URL of our ngrok tunnel that goes to port 5000 on our local machine. And you can go to localhost 5000 again to confirm that the application is up and running. You'll see a 200 response in the terminal if you do so. Now go ahead and call your Twilio number and you'll hear a voice say you have connected to the flask application.

Speaker 3: You have connected to the flask application.

Speaker 1: So we've confirmed that calling our Twilio phone number actually sends a POST request to the proper endpoint in our flask application. So now we can go back and set up the websocket that actually receives the audio data from the phone call. So first we're going to modify our receive call function and modify the TwiML that we're returning to Twilio. So first just for something a little bit more fitting here, we can say speak to see your audio data printed to the console. And then we're going to add connect tags. And use the stream tag to connect to the URL WSS colon slash slash. So this is the websocket protocol. And then we're going to add in request dot host. So we're going to access the host that the request is coming from, which is just going to be our flask application itself. And then we're going to add our websocket root. And then finally just close out the connect tag. So what we're doing here is configuring the TwiML to tell Twilio to say this line. And then we connect to this websocket to stream the audio data. And so we use the websocket protocol. We access the host of the request. And then we pass in the websocket root. So this is the host that our application is running on. And then the specific endpoint at which the websocket exists is websocket root. And that's all we need to do for the receive call function. So now we can go down and fill out our transcription websocket function. But first we need to import JSON. And then we can go back down to our transcription websocket function. And we'll write the following code. So while we are receiving messages on the websocket, we're going to use JSON to load the websocket message as a dictionary. And then we're going to match the event key of the message. So our websocket will receive four possible message types from Twilio. The first is connected when the websocket connection is established. The second is start for when the data stream actually starts. Start sending data. The third is media. So this is a sequence of messages that actually contain the raw audio data. And then the final one is stop for when the stream has stopped or when the call has ended. So that message type is stored in the event key. And for each situation, we're going to do something slightly different. So for all the cases besides media, we're just going to print a simple message.

Speaker 4: And then for the media case, we are going to extract the payload.

Speaker 1: And for now, we'll just print it. So we're going to be printing the raw audio data as it comes in. So now we can go back to the console and restart our Flask application. And go ahead and call your Twilio number again. So you can see Twilio connected, Twilio started,

Speaker 3: and then Twilio started printing the binary audio data. And then you can see the raw audio data for each message that's printed to the console.

Speaker 1: So these are all just messages that correspond to silence because I wasn't speaking into the phone. So I'm going to go ahead and call Twilio again. And you can see that Twilio has started printing the binary audio data. And then you can see the raw audio data for each message that's printed to the console. Because I wasn't speaking into the phone. So now our WebSocket is up and running. And when we receive a call to a Twilio number, Twilio hits the root endpoint of our Flask application. That responds with TwiML that instructs Twilio to say something and connect to the WebSocket in our application and stream the audio data to it. And then in the transcription WebSocket function, what we're doing with that data is we're just printing the binary audio data to the console. So now that we're receiving the audio data in real time, what we need to do is define a transcriber to transcribe it in real time. So go ahead and create a new file in your project directory. And we're going to call it Twilio underscore transcriber dot py. So we're going to start with some imports. So we'll start with import OS. And then we're going to import assembly AI as AAI. And then from .env, we're going to import load.env. And then right below that, we're going to call the load.env function. And that's going to load our environment variables from our .env file. Next, we're going to set our assembly AI API key. So we're going to go AAI.settings.api key. And we're going to use OS.getenv to retrieve our environment variable. To retrieve our environment variable. And we just write in assembly AI API key. And this is the name of the environment variable that we put in our .env file. And then finally, we're going to add a Twilio sample rate variable. And that's going to be 8,000. So this is 8,000 hertz. So this is the sample rate of the audio stream that will be coming in from Twilio. Now, just as we define different cases to handle the different types of messages that we could receive from Twilio on our WebSocket, we now need to create several functions that will define how our application interacts with assembly AI's real-time transcription WebSocket. So first, we're going to create an onOpen function. And that's going to take in a session opened argument. And that's going to be of type AAI.realTimeSessionOpened. And for us, we're just going to print the session ID. And we access that through session opened session ID. So this will be called when our application connects to assembly AI's real-time transcription service. Next, we're going to define the onData function. So this is going to take a transcript argument. And that's going to be of type AAI.realTimeTranscript. So the first thing we're going to do is if there's no text in the transcript, meaning that nothing was transcribed and the audio that was sent in was just silence, then we're just going to return. Otherwise, our transcript contains text, and we're going to want to print that. So we're going to want to do two different things here, depending on the type of message that we receive. Assembly AI's real-time transcription service can return two potential types of messages. So there are partial transcripts, and then there are final transcripts. Partial transcripts are sent in real time when someone is speaking, gradually building up the transcript of what's being said. So each time a partial transcript is sent, the entire partial transcript for that utterance is sent, and not just the words that have been spoken since the last partial transcript was sent. When the real-time model detects that an utterance is complete, the entire utterance is sent one final time, punctuated and formatted by default, as a final transcript type rather than a partial transcript type. Once this final transcript is sent, we start this entire process over with a blank slate for the next utterance. If I say, hello, my name is Ryan. How are you today? The messages we'll receive from Assembly AI will look like this. We'll receive three partial messages gradually building up the first utterance, hello, hello, my, hello, my name is. And then a final message, hello, my name is Ryan, the complete utterance punctuated and formatted. Then messages are sent for the next utterance, and we get how, how are, how are you? Three partial messages that again gradually build up what's being said. Finally, we get how are you today, punctuated and formatted as a final message type. So how we're going to handle the sequence of incoming messages will depend on whether the messages are partial or final. If the transcript is a final transcript type, we're going to print the transcript text with a carriage return and a new line. Otherwise, it's a partial transcript, and we're again going to print the transcript text, but this time we're just going to end the print statement with a carriage return. So all the messages are printed with carriage returns, which return the cursor to the beginning of the line. So as we receive these partial transcripts that gradually build up the utterance that's being said, we're going to return to the beginning of the line and then print the new message each time. And this is going to give the effect of just printing the delta between messages. Finally, when we receive the final transcript, we're going to do the same thing, except we're going to go to a new line so that when this process starts over with the next utterance, we can retain the last final transcript still printed to the console. So finally, we're going to define two more functions, and these are actually optional. So first, we're going to define onError. And this is going to be called when we receive a real time error. And for now, we'll just print that an error has occurred. And then print off the error. And then finally, we have onClose. And this is called when the connection has been closed. So we'll just print closing session. And it looks like I forgot a colon here. And finally, we're going to create a Twilio transcriber class. And that's going to subclass assembly AI real time transcriber. We're going to define the initialization function. And it's going to call the initialization function of the assembly AI real time transcriber.

Speaker 2: And for each of these functions, we're just going to pass in the corresponding function. Next, we're going to add the sample rate.

Speaker 1: So we're just going to set this equal to the sample rate that we defined above. And then we're going to define the encoding. So this is going to be assembly AI dot audio encoding dot PCM mu law. So this is the encoding that Twilio uses when it sends its binary data. And now we're good to go. So we can go back to main.py and incorporate this class into our Flask application to actually transcribe the audio that's incoming from Twilio. So go ahead and go back to main.py. And first, we're going to start with some imports. So at the top, we're going to import base 64. So we're going to need this to decode the incoming audio stream from Twilio. And then from our Twilio transcriber file, we're going to import our Twilio transcriber class. And then go back down to your transcription WebSocket class. And we're going to make a few changes. So first, we're going to instantiate a Twilio transcriber as transcriber. And then we're going to use the connect method to connect this transcriber to assembly AI's real-time service. So this basically just creates another WebSocket with assembly AI through which we'll be passing the audio data. So our Flask application has a WebSocket connected to Twilio that will receive the incoming data stream. And then our Flask application connects to an assembly AI WebSocket to go ahead and forward that audio data off to assembly AI for transcription. And we can instantiate this transcriber outside of this transcription WebSocket function. But if we ever wanted multiple concurrent users, then we'd need a separate Twilio transcriber for each user so that they each have their own WebSocket open to assembly AI. Next, we can go down to the media case. So right here, we have our payload coming in from Twilio. And that's base 64. So we'll just go ahead and add underscore B64 here. And now we will decode that. So we'll add payload mu law. And we're going to use base 64, base 64 decode. And then we're going to pass in that payload. So Twilio takes the raw data. It uses the mu law algorithm to do a bytes-to-bytes mapping to put this audio data in the mu law format. And then it encodes these bytes as a base 64 string, which it sends across the wire. We receive that base 64 payload. And then we decode it into the mu law bytes, which we send off to assembly AI. So now, instead of printing the payload, we can go ahead and call our transcriber.stream method and pass in the mu law payload. And then finally here, in this stop case, we are just going to close the transcriber object and then add a message here, transcriber closed. And then the last thing is we can go back up to our receive call function. And this still says speak to see your audio data printed to the console. So we can just change this to speak to see your speech transcribed in the console. All right. Now we can go back to our terminal and run python main.py. And it looks like our application is up and running. So go ahead and call your Twilio number and start talking. And you'll see your audio transcribed to the console. So let me go ahead and do that. So there you can see the session ID. After I finish saying an utterance, it's punctuated and formatted. And that's it. And when we stop the call, we see Twilio stop there. And then we can go ahead and end the program. So now we have a working application. When somebody calls your Twilio number, they'll go through your ngrok tunnel and hit this endpoint, which we'll call this function that responds with this TwiML. And this TwiML tells Twilio to say this and then connect the data stream to our WebSocket in this Flask application. This WebSocket will instantiate a Twilio transcriber object and then use its connect method to connect to Assembly AI's real-time transcription service. And then as we receive messages in from Twilio, we will decode them and then stream the Mulad data off to Assembly AI. And this Twilio transcriber object will just print off the transcripts as they're received. So our application is working, but we can actually improve it. So currently, what we have to do is open an ngrok tunnel on our system in a separate terminal. And then we have to copy that public URL and go to the Twilio console in the browser and paste it in. And this is kind of laborious, so we can automate that too. So the first thing we have to do is go back to our .m file. And we need to add in the Twilio number environment variable. So you're going to replace this with your Twilio number. And it's just going to be a string of digits with the country code. So for example, for the US, you'd have plus 1 for the country code. And then you would just type in your number as a string of digits. OK, now we're going to go back to main.py. And we're going to modify the top of the file. So we're going to import OS. And then we're going to import ngrok. We're going to import the Twilio client from Twilio rest. We're going to, again, import load.env from .env. And then we're just going to call it. So this loads our environment variables. And now we're going to add a few more lines. So first, we're going to comment that these are Flask settings. And then we're going to go ahead and add some Twilio authentication lines. So we're going to add our account SID, API key, and API secret. And each of those are in our .m file. So we're going to be getting those from our environment. Twilio account SID, Twilio API key SID, and Twilio API secret. And you can also use getenv here. And then we instantiate a client passing in this authentication information. Next, we're going to add our Twilio number. So again, we're just pulling the Twilio number from the environment. And then we are going to set our ngrok token. So ngrok.setofftoken, os.getenv, ngrok.offtoken. And then we're going to go down and update our script's main block as follows. So first, we're going to wrap this whole thing in a try except block. And then we are going to open an ngrok tunnel. So we'll call ngrok.forward. And we will forward HTTP colon slash slash localhost. And then the port that we're running our application on. And so our ngrok URL is just going to be the URL that's returned. And then we are going to set the ngrok URL to be the webhook for the Twilio number. So first, we're going to get our Twilio numbers with client incoming phone numbers list. Then we are going to use a list comprehension. So we say for each of the numbers in our Twilio numbers, we're going to add to the list the numbers SID if the numbers phone number is equal to the Twilio number. And then we just extract what should be the only element in that list. And next, we'll run client.incoming phone numbers. And we'll add the Twilio number SID to isolate it. And then we're going to call the update method and pass in our account SID. And we're going to change the voice URL equal to an F string of our ngrok URL. And then our incoming call route. So what we're doing here is we're using our Twilio client to isolate the phone number that has this phone number SID. And then we're using the update method with passing in our account SID to update its voice URL to our ngrok URL. So this is the tunnel we just opened. And then we're going to send that to our incoming call route. So this is exactly what we did through the Twilio console in the browser. But now we're just doing it programmatically. And then again, we're going to run the app. And then finally, we're going to make sure to run ngrok disconnect so that we're sure to always disconnect the ngrok tunnel if there is an error. So now we can go back to the terminal. And you're going to want to make sure to close the ngrok tunnel that you already have open, especially if you have a free account, because you can only have one tunnel open at a time. So go ahead and stop that. And then we can go ahead and restart our Flask application. So Python main.py. And you'll see your Flask application start up. And now you can go ahead and again give your Twilio number a call. And again, the application is up and running as expected. And then if we go ahead and stop this and run it again, we can make sure that the new tunnel opens up as expected. So again, go ahead and call your number. And the application is working with the new tunnel. Awesome. And that's how you can transcribe a phone call in real time with Python. If you have any questions, feel free to leave them in the comments below. Or you can check out Marco's new video on Graph Neural Networks. I'll see you in the next video.

Speaker 5: Why do Graph Neural Networks matter in 2024?

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript