Boost ASR Accuracy with Google's Speech-to-Text API

Convert Your Audio To Text

4.9/5

3727 customer reviews

Learn to enhance speech recognition accuracy using Google's Speech-to-Text API, focusing on customizations to reduce Word Error Rate efficiently.

Boosting Speech-to-Text API accuracy

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: With the rise of use cases such as chatbots, virtual assistants, and increasing accessibility, applications using Automatic Speech Recognition, or ASR, are becoming more and more popular. In this video, we're going to be talking about automated speech recognition on Google Cloud with our very own Speech-to-Text API. Speech recognition models are great for generic, broad use cases, but as you start to get into specifics, certain nuances with domain-specific knowledge and proper nouns are not always captured by out-of-the-box models. Hi everyone, I'm Aneeshwari Vastava, and in this video, I'll walk through how you can boost ASR accuracy without having to train your own custom model. We'll be using the Speech-to-Text API, which allows you to accurately convert audio to text. Some of the highlights of this service are multi-speaker detection if you have a recording of several people conversing, streaming speech recognition, profanity filtering, and support for over 125 languages. Fun fact, it's available on-prem as well. To start, let's run a sample with the API to see how it works. Here in my notebook, I have a code sample set up that will transcribe the audio, then print out the results. First, I import the client library and create a client to call the API from. I grab the sample audio file, which is stored in Google Cloud Storage, from the bucket URI. Let's set a few configuration variables and call the API. Let me play you the sample audio file. Hi everyone, my name is Anu, I live in New York, and I work at Google Cloud. Now I'll run it in my notebook, and we have the transcription. Now that we have the transcription, how do I know the quality of the output? What if I need to automate bulk transcripts for a business-critical need? Here is where we need to define speech accuracy. This is especially important once we start working with audio transcription at scale. One standard to do this is called Word Error Rate, often abbreviated as WER, and this is a formula. Word Error Rate measures the percentage of incorrect transcriptions in the entire set. So this means that the lower the Word Error Rate, the more accurate the system. Word Error Rate is a combination of three types of transcription errors which can occur. First, we have substitution errors, words that are present in the hypothesis and ground truth, but not transcribed correctly. Then deletion errors, words that are missing from the hypothesis, but present in the ground truth. Then we have insertion error, words present in the hypothesis transcript that are not present in the ground truth, aka the correct answer as defined by us. When we're doing this at scale, there could be several imperfections in our transcription, which is why we have some handy tips and tools to help you improve accuracy. So customize the model to your domain by providing contextual information. So let's say you are creating a bot that allows people to order pizza. So you might want to increase the probability that words like pepperoni, olives, and mozzarella are recognized. Then tweaking the weights to address specific word phrase issues. So say you're trying to recognize proper nouns or rare words or even made up words. It's unlikely that these will be transcribed correctly initially, so biasing towards them can fix individual terms. Using context bias towards specific types of information or words. So let's say we have an IVR telephone system and you have just asked someone for their order number. You can bias specifically towards an alphanumeric entry. Let's check this out in my notebook. So here I have some code that pulls a longer recording of me talking about my favorite restaurants nearby. So I've recorded this and uploaded this to a Google Cloud Storage bucket. Let me play the audio sample for you. Hey everyone. My name is Anu and I live in New York City. My favorite restaurant is Estella. It's on Houston Street in the Lower East Side neighborhood, which we like to call LES. I also love cuisines from around the world, such as at places like Sichuan Mountain House and Balad. Here we will run the sample through the API. Then we implemented a helper to calculate and display the WER for us. In this sample, we actually have the ground truth or the transcript for our accuracy calculation written out up here. Let's see how we did. Pretty good. But I think we can boost the accuracy and lower the word error rate. How can we lower the word error rate? Well, we can do this by using the Speech Adaptation API. The Speech Adaptation API allows users to pass phrases and associated weights directly to the Speech API. These phrases can be changed with every request and allow for both quick iteration and on-the-fly adaptation. All you have to do is include the terms and the request itself as part of the recognition config. Here's an example of what a simple adaptation would look like. How should you form your adaptations? What are the right terms to provide? There are a few things to consider. First, what am I doing with this transcript? Is there a downstream system that will be sensitive to particular words or phrases? These words should get a bias towards them. Are there rare or proper nouns? These should also get a bias boost. What contextual info can I use? Do you know what words someone might say or what they said in the past? These can be biased towards to help increase accuracy, even on commonly occurring words, if you are sure they will be present. Do you have strong or weak context? You can bias heavily with strong context if you are sure the user is about to mention some specific words. You should bias less if you have weak context, meaning you know that words will occur but not exactly when or where. Let's go back into our demo sample. We're talking about restaurants, and we have some proper nouns for the locations. Let's try computing the WER score again with these boosted phrases. Voila. We were able to get the word error rate down from our previous score. With a longer transcript, you can keep iterating on this to fine-tune your results. There you go. Here's one way you can boost your accuracy on pre-trained models without having to build a model from scratch. We hope this was useful, and we can't wait to see what you do with the Speech-to-Text API. Thanks for joining, and check out the lab linked below to try boosting speech accuracy yourself.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support