Build a Speech-to-Text App Using Assembly AI and Streamlit
Discover how to create a real-time speech-to-text web app with Assembly AI API and Streamlit, using resources like GitHub and Python libraries.
File
How to Build a Real-Time Transcription Web App in Python using AssemblyAI and Streamlit
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Welcome back to the Data Professor YouTube channel. Hi, my name is Chenin, and I'm currently a developer advocate at a tech company in the SF Bay Area, and a former bioinformatics professor. And in this video, I'm going to show you how you could build a real-time speech-to-text web application using the Assembly AI API, who happens to be the sponsor of this video. And we're going to do that using the Streamlit library in Python. And so if that sounds like fun, then you want to watch this video to the end. And so without further ado, we're starting right now. And so let's get started. So the contents of this particular project will be shared on GitHub, and I'll provide you the link in the video description. And so you're going to see here in the RT transcription folder, we're essentially going to have the .streamlit folder, and inside we're going to have a secrets.toml, which will contain the API key. And so you could replace this with the Assembly AI API. And so let me show you before we begin. So let's log on to Assembly AI to get our key, all right? And so now that we're logged in, you want to copy the API key by clicking here on the right-hand side of the panel. Once you clicked on it, you'll see that it'll say copied, and then you'll have the API in the memory. And then what you want to do now is go back to the secrets.toml, and then you want to replace it with your API key. And you want to do the same for the configure.py file. You also want to replace the XXX here with the API key. And mention here of note is that you should make it a string where you have a quotation mark at the beginning and at the end of the API, all right? And now that I have already replaced the X characters with the proper Assembly AI API, let's continue further. So here we have the speechrecognition.py and the streamletapp.py. So let me show you. So on the Assembly AI YouTube channel, there's a video called real-time speech recognition. And this particular video was created by Misra. And you can see here that it provides the GitHub code here as well. And if you click on it, you'll go to this particular GitHub repo. And so in this tutorial, we're going to repurpose both of these code, and then we're going to adapt it into our own project. And aside from that, we're going to make use of an additional resource, which is a blog on Tobritz Data Science. Let me show you. And so this great article by Georgios provide a very good breakdown of the various code blocks and snippets that are provided in this particular repo, and the one that we'll be using today in the tutorial. So you're going to see all of the description, explanation for each of the code blocks, which I'm going to briefly explain as well. So I'll provide you the link to both of these resources also in the video description. So let's head on back and let's have a look at this speechrecognition.py file. So this is the code for the speech recognition in real time. And this was from Misra. And in this tutorial, we're not going to focus on this. We're going to focus on the Streamlit app. And so the contents of this file and the Streamlit app will be the same. And the only difference is that in a Streamlit app, we're also going to add additional Streamlit code here using various ST function to make it into a web application. And so this web application will be easy to use. And so let's continue further. So in the Streamlit app.py, you're going to see that the first eight lines are importing the necessary libraries. So we're going to use the Streamlit as the web framework, and that will house all of the code inside the web application. And it provides you with input and output widgets for you to accept input from the user and also to display the output from the transcription. The web socket will allow us to interact with the Assembly AI API. The AsyncIO will allow us to perform all of the various audio input and outputs in concurrent manner. And Base64 will allow us to encode and decode the audio signal before it's being sent to the Assembly AI API. JSON will be used to read in the audio output, which is the transcribed text. And PyAudio will essentially be used to accept all of the audio input processing. And that is done via the port audio library. And that is cross-platform for all of the major operating system, like Mac OS, Windows, and Linux. And we'll also be using the OS and Pathlib for navigating through the various folders of this project and also to perform file processing and handling. And so let's continue further. So before we continue further with the line-by-line explanation of the various code block, which comprises of 140-ish lines of code, we're going to run the app and see how it looks like. So first thing is, I'm going to first activate the Conda environment from which I used to build this particular web app. So you wanna type in conda activate, and then the name of your particular Conda environment. And the one on this computer is called Streamlit. So I activated it and noticed that the base here changed to Streamlit. And now we're going to type in Streamlit run, Streamlit app.py, which is how to launch the application. Streamlit run, and then Streamlit app.py. Hit on enter. And now you're gonna see the web app open. So let me divide the screen here between the app and the code. So I'm going to minimize this panel here. And notice here that it's defaulted as a black and white because of the theme of the web app. And you could feel free to go to the settings and modify the color to a light theme if that's what you're interested in. All right, and so let's test the application here. So I'm going to click on the start. This is a real-time transcription app built in Python. Data science is so cool. And so before I continue further, let me find something to read. Let's go to this particular blog article and let's see. Let's read it. How to perform real-time speech recognition with Python. Introduction. In one of my latest articles, we explored how to perform offline speech recognition with Assembly AI API and Python. In other words, we uploaded the desired audio file to a hosting service, and then we use the transcript endpoint of the API in order to perform speech to text. So you're gonna notice that the particular application here is going to send audio that we speak into the app. And it's going to do that in chunks of speech. And you're gonna notice that each chunk will be separated by a new line. So if you want all of your speech to go into the same line here, you're going to have to say it in one long speech. Otherwise, it will be adding a punctuation to your particular transcribed text. And nevertheless, all of the text that you see here, if you press on the stop button, it will be saved into a text file. And the text file will concatenate all of the various lines of the text here into a single long paragraph. And then you could format it that later. So here we click on stopped, and then you could click on download the transcription. And then we have the transcribed text right here. So this is the entire transcription of what we have just talked into the web application. And so this is the entire speech that we have already performed and transcribed through the web app. So let's head back to the code. So a particular note here, lines 10 through 13 will describe the session state. And so the session state will be essentially kind of like a memory of the web application. And if it is running for the first time, the session state text will have a value of listening, and it will be assigning a run value of false because we haven't yet started the app. And if we clicked on the start button here, the run status will be changed to true. So let's skip a bit to here. We have the start right here, start listening and stop listening. So if run is true, it will be invoked by the start listening function. And if stop listening is the function that will assign the value of false to the run, which will be the same as when it is starting from over, from scratch, from the first time that the app is run. So here's the button, line number 66 and 67 has the start and stop button. And upon clicking on the start, it will call the start listening function. And upon clicking on the stop, it will call the stop listening function. So clicking on start and stop will trigger the two functions that I've mentioned previously, start and stop, head over back here. So you'll notice that I'll jump around the code because the various lines will be related and it might be dispersed in the text of the code. All right, and so lines 15 until 31 will be the audio parameter and the audio parameter will be here in the side panel. So st.sidebar will allow us to display the text here, .header of the audio parameters and lines 18 through 22 will be creating various variables that we're going to be reusing throughout this particular web app. And the frames per buffer will be created using a text input so that users could feel free to modify the parameter if they choose. And the rate here as well, you could play around with this and this will influence the transcription. And pyaudio here will allow us to initiate the audio stream using the above parameters that we have specified here. So this will be the audio stream. And the start listening and stop listening was already explained. And after we have performed the transcription and after we have essentially speak to the application, we click on the stop button and upon clicking on the stop button, it will trigger the download transcription to run. And in doing so, it will display the transcription text as a downloadable file via the st.download button. As you'll see here, if we click on start and then we say something to the application and then it will start to transcribing our text and then I'll hit on stop. And then in just a moment, you're gonna see the download button appearing right here. Yeah, right here. Download text and the download transcription button. And so it will be saved into a TXT file. And here on lines number 48 until 67, it's essentially the top portion here of the web application. If I refresh, it's going to be from here, the real-time transcription app header here until the start and stop button. Let's take a look at line number 49. We could use the emoji of the microphone here and then the real-time transcription app is displayed here in the st.title tag. And we're using the st.expander right here as an expandable container box. And it's called about this app, which is displayed here about this app. And if we click on it, all of the texts that are shown here underneath it will be displayed. So I've already formatted using markdown language right here. And so it is explaining what all of these individual libraries are doing in this particular web application. And you'll notice here that we're going to define two column variables using the st.columns. And then for column one, we're gonna use the start button, which is right here. For column two, we're gonna use the stop button, which is right here. And so you can see that we have formatted the layout of the two buttons. And so the remaining block of code here on line 69 until line number 136, it is the audio input and output that is being used to send audio signal to the Assembly AI API and receiving the transcribed text from Assembly AI and then displaying it on the web application. And so this chunk of code here was taken from the GitHub repo created by Mishra in the video here on the Assembly AI YouTube channel. And so we have essentially modified this code slightly and then added some visuals to the application. And so essentially the first segment here will perform, let me expand this a bit. So this is the entire function for sending and receiving the audio signal, input and output. So it's connecting to the Assembly AI using this API and we're specifying the rate, which is specified here in the sidebar, 16,000. It will be replaced here. And then it will be authorizing via the API key, which was provided in the config.py and also in the secrets.toml file. And then it's using AsyncIO to perform the concurrent input-output of the audio. So these block of code here will perform the sending of the signal and it will be encoding, decoding the audio signal. And then it will be accepting the output here in this function. And then the final transcript or the transcribed text will be made in the form of a JSON, which we read in and then we selectively take out the transcribed text and then we print it out here line by line in the web application. And then after, when we decide that we want to stop the transcription, we click on stop and then it will write everything into a file. Where we're using the AsyncIO.run in order to perform the concurrent processing of the input-output audio. And so after we have clicked on the stop button, you're gonna notice that the download button will appear owing to this block of code here. And so after we clicked on download, it's also going to remove the transcribed text so that the next run of the application will start from fresh again. And so congratulations, you have built this real-time transcription app using Assembly AI to perform real-time transcription. So there are various use cases that you could make use of this particular web app, such as you could create your essay or email just by speaking to the application. And after you're done, click on stop and then you have access to the underlying transcribed text and then you could copy and paste it into various word processing application. And so I hope that you enjoyed the video and let me know how you're gonna modify this particular web application and if you found it useful. Thank you for watching until the end of this video. And if you reached this far in the video, please drop a balloon emoji so that I know that you're the real one. And if you're enjoying the video, please also give it a thumbs up, subscribe if you haven't already, and also make sure to hit on the notification bell so that you'll be notified of the next video. And as always, the best way to learn data science is to do data science and please enjoy the journey.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript