Build a Local Speech Recognition System

Convert Your Audio To Text

4.9/5

3718 customer reviews

Learn to create a speech recognition and summarization tool using Vosk and Python, running seamlessly on local machines without needing a fancy computer.

Speech Recognition And Summarization System In Python [Project Tutorial]

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hi, my name is Vic and today we're going to build a speech recognition and summarization system. And this system will be able to take any audio file, a podcast, the audio track of a video, lecture notes, meeting recordings, and turn them into a short summary. The amazing thing about the system we're going to build is it's all going to run on your local machine and it will be able to run using a CPU. You're not going to need a fancy computer or a GPU to actually make this work. We'll start out using a speech recognition library to transform audio recordings into text. Then we'll pass that into a summarization pipeline to actually generate a text summary of the audio file. When you finish this project you'll have a model that can actually recognize the text in an audio file using speech recognition and generate a transcript. And then another model that can actually summarize that transcript to create a few paragraph summary. This is the summary of an about 60 minute long NPR Marketplace podcast. The transcription isn't perfect but you can see that the summary is pretty readable and it gives you a good gist of the podcast. Marketplace is a podcast about current economic news, so this is talking about Elon Musk trying to buy Twitter. You'll be able to use the same model on all sorts of audio files. So whether they're lecture notes, a recording of a meeting, or anything else you'll be able to transcribe it and summarize it. Let's dive in and learn how to do it. To actually do the speech recognition we're going to use a Python library called Vosk. And Vosk has support for over 20 languages, although we're going to use English. And Vosk is pretty easy to use and the models are pre-trained so you can just download them and use them on your own machine. So if we click over here on the left on models we can see all of the English models listed out. We're going to be using this model which is fairly large but is fairly accurate. And if you want a smaller download size you can also use one of the other models. And I'll show you in a minute how you can actually download and use the models. What we'll do now is we will jump into JupyterLab, which is a great ID, highly recommend. And we'll create a new notebook and we'll start coding. There are a couple of files that you're going to need and the link to these files is going to be in the description of this video. But there are two sample audio files, marketplacefull.mp3 and marketplace.mp3. And we're going to use these to basically test our transcription model. So if you have different audio files you can definitely use these, use those, but I highly recommend using these if you can download them. So the first thing we're going to do is we're going to install Vosk using pip install Vosk. Now we need to import. So from Vosk we're going to import the model and a calld recognizer. The model is actually going to load the pre-trained model for us and the recognizer is going to use the model to actually recognize speech in audio files. So let's go ahead and run those imports. And then we're going to define a couple of constants. So frame rate, if you're familiar with audio, this is actually just the sampling rate. Speech recognition works best with a sampling rate of 1600 hertz. Don't worry too much if you don't know what this means. This is just how high quality the audio is. The higher this number, the higher quality the audio is. And then we need to specify channels. So if you use headphones, you may notice that sometimes the right headphone has audio that's a little bit different from the left headphone. Like you may have drums that are a little bit louder in one headphone versus another. And what's happening there is you have two channel audio. And those are essentially two different audio tracks that are both running and playing simultaneously in each ear. Audio recognition works best with one audio channel. So we're going to have to convert our audio to just be one channel. And that's why we're defining this channels constant. And then we can go ahead and load the model. So what we're going to say is model equals model. And then we have to specify a model name. And as I mentioned, I'm going to use this model, which is pretty large. If you instantiate this, it will make a 1.8 gigabyte download. But if you want to use a smaller model, you should instead type this. And this model is only 50 megabytes. So it's a lot smaller. Then what we need to do is initialize our recognizer. So we need to set up a Kaldi recognizer, which is what we imported. We're going to pass in our model. So that's the pre-trained model we've loaded. And we're going to pass in our frame rate, which is our sampling rate of our audio. Then what we're going to do is we're going to say rec.setwords true. We don't actually need to do this. It's not 100% necessary. But this will show us both a complete text transcript and individual words and the model's confidence in those words. So this helps us if we want to correct a mistake or anything else, it gives us some information that we can use to do that. So I'll go ahead and run this. And you'll see some warning date information here. Don't worry too much about this. This is just Kaldi loading the model and setting up the recognizer. Okay, so I should have capitalized the S, which is why I got an actual error at the end of all the warnings. But now that'll work. Okay, the next thing we need to do is load an audio file that we can pass into our recognizer to actually recognize the text in the speech. So to load the audio, we're going to use a library called PyDub. So let me jump back over to Chrome and I will show you PyDub. PyDub is a Python library that lets you do stuff to audio in a way that isn't stupid, is what the creator says. But PyDub is basically a really nice way to work with audio. It's a lot easier than some of the other libraries in Python. And you can load audio really easily, and you can edit it really easily as well. So we are going to use that library to actually load our audio and then pass it into our speech recognition model. Okay, let's go back to JupyterLab and code. Alright, so if you want to install PyDub, it's just pip install PyDub. Pretty straightforward. I'll run it. I already installed so it'll say requirement already satisfied. You may need to restart the kernel. If you want to restart your kernel, you just click kernel at the top and then hit restart kernel. I found it's not really necessary for these packages. But if you want to be safe, you definitely can do it at the top of Jupyter Notebook or JupyterLab. If you're not using Jupyter Notebook or JupyterLab, the kernel restart may be in a different place. But I don't know what you're using. So I can't advise you in that case. Alright, now what we can do is we can import from PyDub, we're going to import audio segment. And this is a class that enables us to load audio and manipulate it. So I'll go ahead and run that. And then we can load in our specific audio. So I'm going to load in audio from a podcast called Marketplace, which is an economic podcast in the US. And if you looked at the readme that I sent out earlier, it lists these files, marketplace and marketplace full and gives you download links to them towards the bottom. But this marketplace.mp3 is a 45 second segment from a marketplace podcast. So this, this is a nice kind of baseline for our speech recognition. So we're first going to load the file. Then what we're going to do is we're going to set the number of channels in our audio. By default, this audio is two channels, but we're going to use PyDub to reduce it to one channel. And then we're going to set the frame rate. By default, most podcasts I think are around 4400. That's what their frame rate is. 44,000. But we're going to lower that to 16,000. So let's go ahead and run that. Okay, great. So what we've done now is we've initialized our model that can recognize speech and we've loaded in our speech recognition, the audio file that we're actually going to do speech recognition on. Now all that's left to do is pass the audio file into the speech recognition model. And to do that we call the recognizer we instantiated earlier, this call the recognizer, and we're going to call the accept waveform method, which is a method that lets you pass in audio data. And we're going to pass in mp3.raw data. So this is just, actually, I'll show you what it is. mp3.raw data. This is a binary representation of the actual data in the mp3 file. So that's what we're going to pass into our recognizer. And then what we're going to say is result equals rec dot results. And this will give us the result of the speech recognition. Now we have our result. So let's check what our result is. So this is actually a JSON file. I don't know why Vosk returns JSON, but it does. So we actually have to load the JSON before we can use it. So I'm going to import the JSON library. And I'm going to say text equals JSON dot loads result text. And this will convert it into a Python dictionary. So let's check this out. All right. And you can see this is now a text transcript of the 45 second audio clip. And actually, let's just take a look at what the whole result looks like. So you can see that because we passed this here, the set words true parameter, we now have the individual confidences for each word and the timings. So if what you wanted to do is adjust the transcript, you could look for words that the model had low confidence on. And you could see if you needed to change those. Okay, so let's hit X here. All right, there's one big problem with this, which you probably have seen. There's no punctuation. It's all one long run on sentence, which is really, really hard to read. Don't worry, we will fix that shortly. We will add punctuation in. Okay, so we now have our transcript and it's not, it's not punctuated at all. So we need to fix that. And that's the next thing we'll do. All right, now we're going to use another library to actually add in the punctuation. So I'll go back to the browser window, and I'll show you what that library is. So that library is called recase-punc. And what this does is you pass in text that doesn't have any punctuation, and it adds the punctuation. And it has a pre-trained model that helps you do that. And in fact, actually Vosk has trained their own models using recase-punc to convert Vosk output into output with punctuation. So if we scroll down on this Vosk models list, and we go down towards the bottom, we can see these punctuation models. So what we need to do is actually download this model. And it's a big file. So if this file is too large for you right now, don't worry about downloading it, you can actually skip this piece and things will still be okay, and you can continue following along. But if you do want to add punctuation, you'll need to download this. And this will download a zip file. And then you'll need to unzip that file. Let me jump back to Jupyter Lab. And I'll show you how I've unzipped it. So I've unzipped it into this folder on the left called recase-punc. And if you download and unzip it, you should see a few files here. So you'll have this recase-punc.py file, you'll have an example, and you should have this checkpoint file. So this checkpoint file is actually the pre-trained model that will add the punctuation. And you definitely need this file, because this is the file we will actually call to do our to do our inference. So you'll need this checkpoint file and this file, but all the zip file will have all of this stuff. So the next thing we're going to do is go ahead and use that recase-punc model to add punctuation. And the easiest way to do this is actually to call the script that has already been written. What we could do is try to write our own Python code or import the script and run it. But that is a little bit more complex and would require more code. So what I'm going to do is I'm going to use subprocess. And what subprocess does is it basically uses Python to run a terminal command. So in this case, the terminal command I'm running is calling the Python interpreter to run another Python file, which is this recase-punc.py file that was in that zip of the model. And then we're going to say predict, and then we're going to pass in the checkpoint, which is the pre-trained model. OK, so what we could do is we could import this recase-punc.py file as a Python module like this, like you could say import recase-punc. The problem with that is you need to write a good amount of code to actually get it to work properly. And instead of writing that code, we can just call the script they've already created. And then we're going to need to pass in shell equals true, which means run this in the user shell. We're going to pass in text equals true, which means we're going to pass in some text input. And then we're going to pass in our input, which is going to be this transcript that we just created. OK, and then we'll assign this to a variable called haste. OK, yeah, so if you're running your Jupyter notebook in a virtual environment, then what happened is you installed these packages like pydub into a virtual environment. And if that's what happened, then you need to use the same virtual environment to run this command. You will also need to install a couple of packages to get this to work. So you'll need to install a package called transformers. So pip install transformers, which you can run. You'll also need to install PyTorch, so torch. And transformers is basically a package that will let you access a lot of different pre-trained models. It connects to HuggingFaceHub, which I'll show you a little bit of later. There's a lot of cool pre-trained models there. And then torch is PyTorch. So PyTorch is a library created by Facebook that basically makes training deep learning models a lot easier. So this is the PyTorch page. And in order to run the RecasePunk model, we need PyTorch because it uses PyTorch code. All right. Let's jump back to coding. All right. So you run pip install torch. You don't actually need pip install regex. So pip install torch and pip install transformers. And then you can actually run this command. I'll change the order here. Then you can run this command to add casing to your text. And then if you type in cased, you will get the transcript with full punctuation. So you can see, for example, Fed is capitalized. Fed is the Federal Reserve. It is a proper noun. You can see commas have been added and periods have been added to break up sentences. And it actually looks pretty good. This transcript is pretty readable, and it's fairly accurate, which is great. The next piece is I'm going to write a function that can actually do all of this in one shot. So you pass a file name into the function, and it will create the transcript, and it will go ahead and add in the right casing. So this function is going to be called voice recognition. All right. So first, we're going to define the model, which we did earlier. So I'm going to copy and paste some of this code. So this code specifically. We first have our code to define our model. Then what we're going to do is, again, load our mp3 file. Okay. So I'll copy and paste that code down. But instead of marketplace.mp3, we're going to load the file name. So this function will take a file name, and then automatically do speech recognition on that file name. Now, what if you pass in a file that's longer than 45 seconds? Voss doesn't work really well on audio segments that are longer than a minute or so. Memory usage goes up. Inference gets really slow. So what we're going to do is actually batch longer audio files up into little pieces. So let's create a variable called transcript and a variable called step. And then what we'll say is for i in range zero, the length of the mp3 file, step. So the step will basically do voice recognition on about 45 seconds of the audio file at a time, and then we'll stitch it all back together. So we're going to print. We'll say i divided by len mp3. So this will just give us a progress percentage, because transcription can be really slow for long files. Then we'll take a segment of our mp3 file. So we'll say i to i plus step. So this will take 45 seconds at a time. The first time through the loop, i will be zero, and i plus step will be 45,000, and then so on. So it'll break the file into small segments. And then we'll do what we did before. We'll do rec dot accept waveform, which I'll go copy from above. So we're going to do this. Instead of doing mp3 though, we're going to do segment dot raw data. We'll get the results, and then we'll go ahead and actually load the text out of the result. And then we'll add that text to our transcript. Okay. So this loop is, as I mentioned, it's just stepping through the file. It's not exactly 45 seconds, but 45 seconds or so at a time. And then it is transcribing that segment and adding it to the overall transcript. And then when we're finished with that, we will go ahead and add in casing to our transcript. We need to change the input to transcript. And then we'll go ahead and return cased. So this function will transcribe an audio file, add punctuation, and return it. My in range. Okay. So let's try this. Let's do voice recognition. Marketplace dot mp3 and see what happens here. So first we are running, loading and running the VASC model. We actually probably don't need to reload it. If it's already loaded, we could just pass it in. That's an optimization we can make. Then it's going to do the actual recognition and then it's going to return the text. So now it's doing the actual casing piece of, piece of it, and we should get the result back. So perfect. So this function, this function can actually work with audio files of any length. So you could pass in a longer file. If I had time, I would pass in marketplace full, but that's going to take five minutes or so. And you probably don't want to watch for five minutes. Okay. So we have our, our function now that does voice recognition. The next thing I'm going to do is actually write the summarization code. So earlier we installed transformers with pip install transformers, and we're actually going to use the same library to do our summarization. So let me jump back over to Chrome and I'll show you hugging face. So hugging face is a site that has lots of pre-trained models that have been uploaded, and they also publish the transformers library. And through that library, you can use a lot of these pre-trained models. So they have a pre-trained model, several pre-trained models for summarization, and we're going to use one of those models to actually summarize our text. And if you ever want to explore hugging face, there's tons of models for all sorts of different NLP tasks you can check out. It's, it's a great site and a great tool. Okay. So let's go back over here and go ahead and do that. So in order to do that, first, we're going to say from transformers, import pipeline, and then we're going to initialize our pipeline. Summarizer equals pipeline summarization. And by default, this will use a large, a large model. If you, which is about, I think, one and a half gigs to download. If you want to use a smaller model, just type in this. Specify that the model is T5 small, and I think that's around 50 megabytes. So that should be a much faster download. Of course, it's going to be less accurate, but it will be faster to download. So let's go ahead and initialize that. And then what we're going to do is we need to split up our transcript into pieces. So it's, it's not a big concern here because I didn't transcribe a long audio file, but if you did, you would want to split your transcript up into smaller pieces because the summarization models in hugging face have a length limit and that's at that 1024 tokens. So a token could be a word, it could be a digit, but you don't want to go over that, that limit. Otherwise the model won't work. So what we're going to do is we're going to split our transcript on spaces. This isn't a hundred percent analogous to how hugging face tokenizes the input, but it's somewhat close. So what we'll say is len split tokens, 850. So we're going to step through this transcript 850, not quite words, but things that are separated with a space at a time and pass those into the hugging face summarization model. So what we'll say is selection equals split tokens. I two, I plus eight 50. Okay. So this will basically same thing we did here. We processed the audio file in batches. We're doing the same thing with our summarization and we're basically going to create a list. I didn't create a transcript. So I actually transcribed a file earlier transcript.txt. This is in the Git repo, but this is a transcript of an entire marketplace episode. So I'm going to actually go ahead and load this F transcript equals F dot. Okay. So I'm going to go ahead and load this. Transcript equals F dot. Okay. So I'm going to go ahead and read that. And then what we can do is split it up. So if we look at docs, we see, we now have a list of split up where the text has been split up. It's not the smartest way to split up text. Ideally we would actually use the hugging face tokenizer to tokenize the input and then split up based on something logical, like sentence markers. But this is a fast way to split and it works. It'll work reasonably well for our purposes, but there's a lot you can do to improve accuracy here. And then what we'd do is we would say summaries equals summarizer docs. And it would take this list of text with about 850 words in each piece of text. And it's going to summarize each one. This is going to take a little bit. All right. So now we have summaries. So this is a list of dictionaries. A little bit hard to read. So we're just going to clean this up and we're going to say summary equals backslash n backslash n dot join d summary text or d in summaries. So this is going to loop through. This is going to loop through this list and it's going to grab the inside text from each dictionary and then it's going to concatenate all of them together. So let's take a look at summary. Actually, let's print summary. And we now have a summary. And this is a summary of about a 60-minute podcast episode. So we've really condensed it down. The summary isn't perfect because of how we The summary isn't perfect because of how we tokenized and kind of how we split the sentences up. But it's pretty good. You can read, it's pretty readable and you can tell what the podcast was about. And I just love this sentence that it's in the summary. My favorite. In this episode, if you're not familiar, it was about Elon Musk's attempted takeover of Twitter. It also had some something about inflation in the US and some of the other things that are going on economically in the US. So yeah, we now have a summary. So we did a lot here. We went from an audio file. We built a recognizer that actually recognized that audio and converted it to text. Unfortunately, that text didn't have any punctuation. So we added in the punctuation using another model. And then we ended by actually building a function that can convert long audio files. And once we had that function, we were able to build a way to summarize what came out of that function. So what you could now do is build a system that can automatically summarize lecture notes, recordings, podcasts, whatever else you want to summarize. One thing that I did want to talk about, but I'm not going to have time to talk about is actually doing this live. So creating a button you can hit to record audio and transcribe on the fly. But hopefully I'll have a video about doing that next week. Autumn, can we install these local setups on Visual Studio Code? Yes, you can. I'm using a Jupyter Notebook inside JupyterLab, but you can use Jupyter Notebooks in VS Code also. Can you talk about the trade-off of the size of the model versus the accuracy? Yeah, if you look at the Vosk page, it actually lists the benchmark values for each model. So you can actually see the trade-off you'll be making. It has different datasets they've benchmarked each model on. Will this work in Google Colab? Yes, this should work in Google Colab. Although there may be memory limits that you'll hit there. I'm not sure what the memory limit on Colab is. What sound type format does Vosk ingest? It generally will ingest wave data. So I'll show you how to make that. The audio files never actually leave my local machine. So that's a great question. The model is local. All of this is local. We're not using the cloud at all, other than to download models. But once they're downloaded, we're running them on our computer. Raja, yes, you can use larger audio files. I'll show you a large audio file later. You actually have to batch it and run it in batches. So basically 45 seconds at a time is what you have to do to actually do recognition across those files. Why do we reduce the channels to one? So if you think about a speech recognition model, it's a lot easier to train a model on one channel than two channels of audio, because two channels is basically two separate audio files. And really, you just need one audio file to train speech recognition on. It's not really two files, it's two tracks, but it's kind of two different tracks with slightly different timings and everything. And it doesn't, it may add a little bit of accuracy to a model if you have two channels, but it's a lot more complex to train the model. So most pre-trained audio recognition will be done on one channel audio. And if the pre-trained model was trained on one channel, you obviously can't do inference with two channels. So that's why we're cutting it to one channel, because the model was trained on one channel. So we also have to pass in data that's only one channel to actually use it. Can I set Vosk to use GPU resources instead of CPU? Yeah, I'm using CPU here because it's just a lot easier, but you should be able to use GPU with Vosk. I don't know if you'll get a big, you probably will get some speed up on inference and you can definitely check that in the Vosk documentation. So if you're not familiar with that question, CPUs, you can train deep learning models and do inference on CPUs. It's a lot slower than if you're using a GPU, but not everyone has a GPU and configuring machine learning or deep learning frameworks to use a GPU can be really tough depending on what platform you're on. So it's usually easier to use the CPU if you can, if performance isn't super important. And if you're doing inference, usually CPU is okay. But if you're doing any training or fine tuning or anything, you definitely should use a GPU. Autumn, what do you suggest beginners to start with to come to this level? So I would recommend you have a good understanding of Python and web APIs. You actually don't need to understand the internals of machine learning and deep learning to do this because we're using pre-trained models, but it is very nice to actually be able to understand what's going on under the hood. But don't let that stop you, right? You can call all these libraries once you have a basic understanding of pandas and how to install packages and how to use requests and call different web resources. Does Vosk support live stream connectors? So no is the short answer. You need to download audio and then transcribe it. But what you can do and what I'll probably show in the next video is you can record a segment and then transcribe it. And those segments could be 30 seconds. So you can kind of fake doing it live. How would you host these? So you would need to run a web server and you would need to essentially have a backend that runs all the stuff that you're calling from a front end web service, which is a little bit out of scope of doing today, but maybe I'll do in a future video. Rabia, Vosk has a lot of different models. I don't know if they have an Urdu model specifically. Let me check. They may, but if not, there's probably a pre-trained Urdu model somewhere in some library you can use. Or you can train your own, which you can do with Vosk as well. So it will work on different languages. If you go to the Vosk models page, which I'll share in chat, there are several different languages that are listed there. You can also train or fine tune a model yourself. Peter, so yeah, so two things, like one, learning how to do this stuff on your own is interesting and you can build projects that can go into your portfolio using this. Two, Google recognition tools. So Google cloud or, or AWS, if you're going to use those are actually pretty expensive. And if you just want to build this for a hobby project and transcribe podcasts, it actually gets really expensive to do that. So doing this on your own machine can be a lot cheaper and a lot more flexible depending on which languages you want to support, et cetera. I'm not familiar with the fast AI libraries, so I don't know if they wrap Vosk or anything else. How can you train your own language model? Yeah. So there are ways to train your own language model using Vosk or other frameworks. It's a little bit out of scope of what I'm going to talk about today, but I may talk about that in a future session. Can this functionality be wrapped in a tiny AI scope of device? So Vosk does have model, a small model that you can use on embedded devices like Raspberry Pi or something like that. And the other models that we're, we'll show today, some of the other ones do as well. All right. If not Vosk, what library can do online transcription? So I'm not sure if there are Python libraries to do online transcription. I don't think you'll get great accuracy doing online transcription. Typically most speech recognition is done after the fact, right? Like you say something and then the recognition happens. If you wanted to fake live transcription, you could do it by transcribing 30 seconds at a time. Peter, I'm not familiar with tiny AI. I think you would need at least four gigs of RAM to run this, but I haven't tested that exhaustively. The recase punk script is from the zip file download. So let's go back to Chrome and let's go over here to the Vosk page. So in Vosk models, if you scroll down on this model's page, you'll see towards the bottom, there are these punctuation models. You want to download this. And when you unzip this file, the, the recase punk script will be in there. On second to last line of transcript, it says LH dealing. Okay, let's go back and take a look. So I showed you before, you can see the word confidence. So if you go up, if you look at JSON.loads results, that is what I want to do. So if we load the result with JSON, you can see the confidences for each word. So if you go down and you look through this, you can start to see there are some words where confidence is pretty low. And what you could do is go through and correct these manually or highlight them in some way when you display the transcript. So that's what these confidences are. Vosk, I don't know if Vosk works for multilingual data. I assume you're talking about an audio file with multiple languages in it. I think you would need to train a custom model in that case. What happens to the words at the cuts? Depends if there is, if they finished or not. If they've been truncated, then you will get poor recognition at the cuts. So if you wanted to improve this, what you would do is build a script to actually search for silence in the, in the audio file and cut at the silence. Utpal, so what you would need to do is extract the audio track from your video file. And I don't know if you can do that with pydub, but there are Python libraries that let you do that. Can we use languages other than English? Yes. Vosk actually has pre-trained models for over 20 languages. There are also other Python speech recognition libraries that have other languages. Cloud. Yeah. So there are ways to do this in the cloud if you want to pay, like Google cloud. Hosting this in the cloud could be complex, but yeah, that could be a good idea. Could be complex, but yeah, that could be an interesting future video. Is there a word like TTYL? Does Vosk convert to talk to you later? So it depends on how the model was trained. So if we, if we go up to Vosk, so we, we loaded a model here, a pre-trained model called Vosk model EN US. It depends how that model was trained. Those models are typically trained by passing in audio files and transcripts of those files. And those are used to train the model. So if those transcripts converted TTYL, when you set it into talk to you later, then the same thing would happen when we call the model, but it just really depends how it was trained. Thanks, Peter. Appreciate that. Why 850 again? So the maximum token length for a, for this summarization pipeline that we're using is 1024 and a token is approximately equivalent to a word, but not a hundred percent equivalent to a word. So we can take about 1024 words in each chunk that we pass into our summarization pipeline. Now we're using a really naive way to split the transcript into words. We're just splitting based on a space. So this split is different from what HuggingFace is doing. So 850 just gives us a little bit of a buffer between the 1024 limit and, and how we're actually splitting. Ideally, we would split the same way HuggingFace is, and you can actually do that. You can use the transformers library to actually tokenize your input, but that's a lot more code. And I wanted to show you something a little, a little easier to work with. Can you turn text to voice? Rosaline, you can do that. There are lots of libraries to do that. You can actually do that using HuggingFace transformers, but using a different pipeline. So if we go back to HuggingFace over here, the name always gets me, HuggingFace, and you click up here on models. You can see models that do a lot of different tasks. So you can do things like answering questions, automatic speech recognition, etc. And I think you have text to speech. Yeah, text to speech is a category here. So there are a lot of different models that can convert text to speech. And if you click on one of these models, it tells you how to use them. So that can help you. If you have a large, yeah, you absolutely can, Utpal. Of course, you will need to break the text up and batch it in order to pass it through the pipeline. But yeah, you can do that. Will this work for an audio file in other languages? So yes, let's go back to a Vosk over here. And you can see these Vosk models. So there are models in a lot of different languages, including Indian English. So that's one, a Chinese, Russian, French, I think Hindi is here too. Hindi as well. And if you can't find a model in Vosk that you want to use, there are other frameworks, other packages out there that can do voice recognition as well. Or you can train your own. Soumya, yes, you can fine tune the summarization model from HuggingFace. I don't know exactly the format of the data. But if you go to HuggingFace, you can look at the documentation on how to do that. What's the easiest way to highlight the words with CL less than one? So I did this for the COVID genome, where I was analyzing COVID genomic data. But if you there's a certain way you can output data in here. So you can basically use IPython display functions to color certain characters, or certain certain letters or words or whatever. So you'd want to use this, this display function, and this type of color print function to actually color the actual words. And then you would want to loop through that dictionary that came back from Vosk and identify any words with low confidence and actually color them. So you could you could see them really clearly in the output. How would you add a feature to this, which actually recognizes the person who is speaking? Yeah, so these are called speaker identification models. So this is an example of Vosk has a speaker identification model. So that that is what you'd want to use there. Why Vosk? So the reason I use Vosk is because Vosk is fairly high level. So what I showed you speech recognition, you can do with a lot of different frameworks. The nice thing about Vosk is it is really, really, there's not a lot of code you have to write. Okay, so with Vosk, you can actually do all the speech recognition in very little code. So you'll notice we were basically able to create the model really easily and run it really easily. Other frameworks, you have to write a lot more code. And for a live tutorial, we want to we want to write as little code as possible. Can I change from one language to another? So typically, most voice recognition models are trained only on a single language. So if you if you were switching languages in your audio, you would need to switch models. Is there a way to use this as an app? You could, you absolutely could create this as a web backend and then pass data to that backend, like pass audio files to the backend. Running this at scale on a server is going to be a lot more complex than running it on your own machine. But I might do a future video about that. Is there a model to cancel the noise? Yes, there are. The most popular one is called RNN Noise. It's by NVIDIA. And it's used to filter out background noise. You can also just try a noise gate or like something simple like a noise floor to see if you're to see if you can cut out certain frequencies. How many people can the speaker identification system support? It really depends how many it's trained on. I'm not sure about the Vosk one. I haven't used it, but you could try it out and see how many speakers it can it can identify. How would you modify the model to recognize who is talking? So you'd want to use a speaker identification model to do that. Which which you can there is a speaker identification model in Vosk. I showed you there are also other speaker identification models. JRNA? Absolutely. You can do that. So what you would need to do is extract the text from the PDF. There are Python libraries that do that. Then you would need to run TTS on the extracted text. So yeah, you can you can absolutely do that. Typically, Ashish, you're not going to get a lot of benefit by using stereo audio for speech recognition. Because I mean, one channel has most of the information that you're going to get in terms of speech recognition. It might be useful in a speaker identification model, though, to have to have two channels, because you can tell which speaker is which based on where which channel the audio is coming from. Sonchon, you can find all of our previous webinars on YouTube. So let me give you a link there. Yeah, please subscribe to our channel. Love love getting more subscribers. Nooktana, I'm referring to the Google and AWS speech recognition APIs. So they do cost you money to use. And if you're transcribing long audio like a podcast, it's going to get really expensive. So I wouldn't recommend using them. But they are pretty accurate. And they obviously do everything for you. Are there any libraries that may be able to summarize the output, Justin, so what you can do if you want to summarize the summary is you can run the summary again, on the summary to get to get even shorter. All right, well, thank you, everyone. Have a great rest of your day, wherever you are in the world. Have a good evening, morning, night.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3718 customer reviews

1/730

Verified Order

“I needed an interview transcribed accurately and I was happy with the quick turnaround. ”

Jen

Jul 20, 2025

“Very accurate transcription, fast service, easy to use and order, thank you!”

Gabby

Jul 15, 2025

“I am beyond happy with this service, which I am using it produce interview transcripts for my dissertation research. The interface is easy, the customer service was prompt and informative, the transcript is accurate, and the pricing is wonderful. I will recommend GoTranscript to anyone who is in need of affordable human-powered transcription services.”

Justin McDonald

Jun 29, 2025

“great work. quick and professional”

christian oradesky

Jun 28, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support