Enhancing Google Cloud Speech-to-Text Accuracy

Convert Your Audio To Text

4.9/5

3723 customer reviews

Callum Barnes discusses strategies to measure and boost speech-to-text precision using Google Cloud tools, focusing on word error rate.

Measuring and improving Speech-to-Text accuracy

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hi, my name is Callum Barnes from Google Cloud. I'm the product manager for Cloud Speech. And today, I'm going to be talking about measuring and improving speech-to-text accuracy with the Google Cloud Speech product. First, I'm going to give you a little bit of an overview on Cloud Speech-to-Text. Then I'm going to talk about how you can measure the accuracy of speech-to-text on your own data, then what you can do using our tools to improve that accuracy once you've measured it. And finally, I will do a quick worked example at the end and link you out to some of the companion tools and examples that we've created to go with this Next OnAir presentation. So Google Cloud Speech-to-Text is a API which accepts audio. It identifies speech within that audio and returns the text representation of that speech. This happens in real time or in batch mode. We support both in all 71 languages in 127 local variants that we support. We also have, for each one of these languages, a massive vocabulary that goes far beyond the dictionary, covers local vernacular, proper nouns, et cetera. Today, we're going to be especially focused on the tools that we have for customizing the API and improving recognition on specific terms, as well as doing things like changing the output formatting. So let's jump in and start talking about speech accuracy. Accuracy is pretty much the most important thing when it comes to speech-to-text. So whether you're doing post-processing on the text to determine someone's intent and extract entities or displaying the text directly to users, like for captions, having a higher-quality transcript that's more accurate is going to result in a better user experience either way. There are a whole number of factors that can affect speech accuracy, from background noise to audio quality to various different accents or strange verbalizations. And all of these things can have a big impact. At Google Cloud and Google as a whole, we have tried to make our models as robust as possible. And this means that they should work really, really well right out of the box on a wide variety of types of speech and other scenarios. But all speech recognition systems are very sensitive to input data. And that means when you think about accuracy, you really need to be thinking about what is the accuracy of this system on my specific data, in my recording environment, the types of speakers I have, the types of speech it is. And to do that, you have to measure it. And to measure it, we need to have a common language for how we talk about what the quality level is. And in speech-to-text, usually that's word error rate, or WER. WER, or word error rate, is not the only way to measure speech accuracy, but it certainly is the most common. And even if you're also looking at other metrics, which can be valuable, usually they're looked at in conjunction with word error rate as well. Word error rate is composed of three different types of errors that can happen during transcription. One is the number of insertions. So these are words which are present in the hypothesis or the machine-generated, in this case, transcript, that are not present in the actual ground truth accurate transcript. Substitutions are the number of words which are present but are mistranscribed as a different word or formatted in an incorrect way. And finally, deletions are the number of words which are completely missing from the hypothesis transcript, which are supposed to be present in the source audio or ground truth transcript. You add up all three of these numbers, and you divide by the total number of words that exist in the ground truth reference, and that gives you the overall word error rate. This means that it is actually possible for a word error rate to be greater than 100% in situations where you have very, very poor quality. So to measure word error rate, we have a few steps you can follow, and we've also created a bunch of tools to make it as easy as possible. But it all starts with getting your audio files. Like I said before, if you want to measure accuracy and how the system is going to perform in how you're using it, you must have in-domain audio which is similar to what you're going to be doing. Generally, we recommend having about three hours in this test set, but you can still get pretty statistically significant results with as little as 30 minutes of in-domain audio. It's more important that the audio is similar and representative of what you're trying to transcribe than it is to have a ton of it. The second step is probably the hardest of all of them. You have to get the ground truth for this audio. This means you must have the human transcribed 100% or as close to 100% as possible accurate version of the transcribed audio in order to compare the various hypotheses against. The third step is easy. You get your machine transcription by sending it to the Cloud Speech API. And then finally, you can compute the word error rate with our simple word script where you feed in the hypothesis and the ground truth, and we will show you what the word error rate is. That script, in addition to telling you the number of insertions, deletions, and substitutions, will also give you this pretty printed HTML that will show you exactly where each error is and what type of error it is. And you can see a sample of that here. So now that you've measured the accuracy of your audio, it's time to think about what you can do to improve that accuracy on your test set and on your wider set. And we do this with a set of tools that allow us to help customize the model or customize the system towards the type of audio that we're sending. Broadly speaking, there's three different ways to think about improving speech accuracy through model customization. The first is to customize the model to your domain by providing contextual information. An example of this would be if you knew that people were going to be talking about pizza or ordering pizza, you could bias towards different types of crust, different types of cheese, different toppings, et cetera. Next is to tweak weights to address specific word or phrase issues. So commonly, this can occur with proper nouns, people's names, or maybe it's individual product names, but words which, generally speaking, occur very rarely in all speech. Number three is to use context that you have to bias towards specific types of information or specific situations. So for example, if you had a phone system IVR, you might know that a user is about to tell you an account number or a phone number, and you can tell the ASR system to only look for a digit sequence or an alphanumeric sequence. We support doing all three types of these customizations in Google Cloud Speech through the use of our speech adaptation tools. These tools, if used correctly, can be extremely powerful and can shift quality very significantly, depending on the situation and the type of bias in which you're doing. Before we get into the features of speech adaptation and exactly how to use them, though, and exactly how to use them, though, I want to talk a little bit about how speech adaptation works and how that's somewhat different from other types of model customization. In order to talk about how we customize the model and customize the results, we need to understand a little bit more about how ASR or automatic speech recognition systems work. Now, this is a extremely simplified diagram, and not all speech recognition systems work in this exact same way. But it will help us illustrate the point here. So a very simple system would take audio input in from a user into what's called an acoustic model. The acoustic model looks at the audio waveforms or the spectrogram, and it converts these waveforms into sounds which it thinks are there, also known as phonemes. The language model then looks at the groupings and series of phonemes and tries to understand from those sounds which words or phrases are being said. Most language models also take context and other factors into account when determining the text which the phonemes represent. Some models will produce a single output, or perhaps it will have n-best alternatives that could be determined from that grouping of phonemes. Now, let's take a look at the simplified version of the Google Speech-to-Text API. The audio comes into an acoustic model and is converted to phonemes that are sent to the language model, just like we talked about before. The difference is two things. One is the language model produces, instead of just a single highest confidence hypothesis or a series of n-best alternatives, but it produces an entire lattice of potential word alternatives. The other difference is the Speech Adaptation API, which allows users to send context and word hints about what the audio might contain. This information is then used to operate on the word lattice and determine, based on these hints as well as the confidences from the language model, what the highest confidence hypothesis text is, as well as n-best alternatives, if that's interesting to them. This differs slightly from other types of model customization, where you might have a custom language model, where the entire language model itself is changed to account for specific proper nouns or various different sequences, which might be expected. But the great part about speech adaptation is there's no training or retraining required, so it's much easier to experiment and much cheaper to adapt the model to a variety of different needs. Next is it happens completely in real time. So we're able to compute these changes without adding any latency to the end-to-end ASR pipeline. And this also means, because there's no training and it happens in real time, you can have a different context with every single request that you send. So one request can bias towards a digit sequence, and the next request can bias towards somebody's name, for example. All of this can be accomplished with the Speech Adaptation API that I was talking about before. The Speech Adaptation API today has three different components or options. First is Phrase Hints. This is the ability to send words or long phrases which you think may be present in the speech or in the text. And the system will do its best to look at the words and combinations of words biased towards that particular piece of information. In some cases, though, that might not be enough or might not be giving you granular enough control. And that's what the Boost API is for. The Boost API allows you to specify an actual weight value for a specific word or bigram that will then take effect on that word or bigram in the word lattice. This is especially useful for things like proper nouns or rare words, because you can significantly boost the likelihood of them being recommended. Finally, we have the Classes feature, which is essentially pre-built Phrase Hints and Boost Values for common scenarios, for example, an alphanumeric sequence or a digit sequence, being able to recognize that without having to set the values yourself. Coming soon, we'll have some additions to the Speech Adaptation API, namely Custom Classes, which will allow users to create and share their own pre-made classes that can then be used with Boost or Speech Adaptation, the same way classes can today. This doesn't make the system necessarily any more accurate, but it gives users much greater composability when they are working on these problems in the way that they can specify phrases or words that they want to bias towards. Next is Saved Context, which allows you to save one or many known good biasing configurations and specify that entire phrase list or boost list with just an ID on subsequent calls. This can be very useful in saving bandwidth overhead for users that need to send thousands or tens of thousands of words in each context. So that gives you an overview of the biasing tools and the model customization tools that we have available within the Google Speech API. We're always trying to improve these, but ultimately, these tools are just how you send and how you give the biasing information. The real hard part is figuring out what the right biasing information to send to the API is. And that's what we're going to talk about next. So broadly, some things to consider are, what am I doing with this transcript? So when you are transcribing the audio, is the result going into some NLU system where you need to extract an entity? Is it being displayed directly to a user? Are there specific things that the downstream systems are going to be sensitive to? If your goal is to capture a phone number, then you have to be absolutely positive that you're getting that digit sequence right every single time. Next thing to consider is, are there rare words or proper nouns? Rare words and proper nouns are very difficult for ASR systems because, statistically speaking, they occur very, very uncommonly in everyday speech. So it's less likely that a speech-to-text system is going to recommend that that is the word in the series of phonemes as opposed to a homonym of a more common word. This can be complicated, but this is why biasing towards these words very, very highly can cause them to be recommended at the rate which you want. The next is, what contextual information can I use? So this is, what external information outside of the audio itself can I use to figure out what the person might be talking about? So examples of this would be context or state in a chatbot application. Maybe you have a user history so you know the types of queries that the user usually makes, and you can use that to help increase accuracy on future queries. And that feeds into the final point, which is, do I have strong or weak context? So when you think about that contextual information that you have, you should be thinking about, do I know exactly what the user is going to say, and I'm pretty sure I know how they're going to verbalize it, or do I just know broad categories of what this is about? Some examples of strong context would be a IVR, like a phone answering bot scenario, where you know exactly that the user is about to give a phone number, or say yes or no, or something like that. Next would be a system for giving commands, or in a like system, where you know that the user is going to say, change the channel, or play songs by some artist. In these systems, you have very, very constrained vocabulary, and this can help you to bias and increase accuracy on that specific vocabulary. And finally, as I talked about before, important words are entities. If you have a proper noun that's very important to transcribe correctly, you're almost always going to need to bias towards it very, very strongly. Next, we have weak context, which is a situation like captions, or dictation, or perhaps a conversation between two people or multiple people in a meeting. And these are situations where you don't know exactly what somebody is going to say at any one moment, but you know broadly what they are talking about. For example, this recording is all about speech-to-text technology, and that might give you hints about what words are going to be said or are common that you can use to increase accuracy of those words. So now that we've looked at the biasing tools that are available, as well as what to think about when you're starting to think about biasing and model customization, I'm going to take you through a demo or really more of a worked example on what it looks like to do biasing for real. Now, this is going to be a very simple example. We've also created a bunch of content to go along with this Next OnAir session, where you'll actually be able to try all of this for yourself in a Qwiklab. For the purposes of this demo, I'm going to be focusing on improvement type 2 that I talked about, where we're going to tweak weights to address specific words and phrase issues, especially focused on rare words and proper nouns, as you'll see. So to get started here, I'm going to do basically exactly what I told you not to do before, which is I'm going to give this example based on just one phrase and one sentence. I don't have a whole corpus of audio here, but the goal is not necessarily to vastly improve this one phrase, but to just show you how I think about the biasing problem and the signals that you can use and apply to a larger corpus when you're doing this yourself or trying it out in the Qwiklab. So I've recorded a single sentence of me speaking here that I'll play for you.

Speaker 2: Hi, this is Calum talking about the Speakotron.

Speaker 1: So you can see I say my name, a proper noun, as well as a totally made-up word, Speakotron. Now, even with these rare words, like I said, the Google recognizer works very well out of the box. So I actually didn't have any issues recognizing this audio originally. So I had to make the problem a little bit harder for our speech recognition system. And to do that, I added some noise to this same audio recording. And this is the audio recording I'll be using for all the future examples in the demo. So I'll play that for you now.

Speaker 2: Hi, this is Calum talking about the Speakotron.

Speaker 1: So as you can see, exact same sentence with some white noise added in the background that made it so that the ASR system wasn't recognizing it right out of the box. On the right-hand side, I've set up some Python code to try out sending this to the Cloud Speech Recognizer. I've specified my input file. I'm using US English since I'm a American English speaker. This file is a basic linear PCM WAV file recorded at 16 kilohertz. I'm not sending any speech context yet because I just want to see what the bare usage of the system is. Next, I need to set up my SimpleWord script. So I'm putting in my ground truth hypothesis here. Hi, this is Calum talking about the Speakotron. I haven't included any punctuation in this case, but you could include punctuation if you're expecting punctuation in the result as well. The hypothesis will come directly from the Google speech transcript that I just set up. And then we will compute the error rate and pre-print the HTML. So let's take a look at what the accuracy looks like right out of the box on this noisy file. I had a word error rate of 37.5%, driven all by substitution error here. So we can see it got my name wrong. Not that wrong, but it is spelled incorrectly. But more concerning, instead of the Speakotron, it was transcribed as this Ecotron, which is not good. If we were really worried about the Speakotron product here, then we would have not even captured that this phrase was about the Speakotron product. So let's look at what we can do. Using the PhraseHymn API, I could do something very simple, like put the exact right transcription in as a phrase. Hi, this is Callum talking about the Speakotron. Now, this did work. This did result in 0% word error rate and the phrase being transcribed perfectly. This is always a good thing to try out if you have a single phrase. It basically just shows you that speech adaptation as tools work. We are actually operating on the lattice and changing things. But this isn't that helpful to you as you think about improving recognition across an entire corpus. Because if you knew what the correct transcription of the phrase was, you wouldn't be sending it to the Speech API to begin with. So I don't think this is that helpful. And sending the whole phrase is also not going to be helpful to other words where somebody is saying, hey, this is some other name, or they are verbalizing the phrase a little bit differently. So let's think about what we can do to increase recognition without just going for the full phrase being accurate. We could use the Boost API and look at the rare words like I talked about doing before. Calum and Speakotron here are the two rare words. And by boosting them fairly significantly, I think we can get pretty good results on those words. So let's take a look at how that performed on our noisy version of the audio. I was able to decrease the word error rate substantially just by doing this, where I've got it to spell my name correctly now. And instead of Ecotron, it has correctly recognized Speakotron. But there is still an error of it actually thinks it's saying this Speakotron and not the Speakotron. Now, this is an easy fix on the phrase side, too. But it got me thinking more broadly about what you can do to increase this, not just for this one phrase, but all phrases that are talking about the Speakotron or that are talking about somebody saying their name is Calum. And one thing we can do with Boost is to boost not just the unigrams that are important to us, but also the bigrams. We wouldn't want to boost really anything more than a bigram, because you're just going to significantly confuse the lattice, since those phrases are never going to exist like that in the lattice. But by thinking about what are the bigrams where people are going to actually use these rare words, we can greatly increase recognition. So when somebody is saying their name, they're usually going to say is name. My name is Calum, or this is Calum. And you could also think of other verbalizations. And when I thought about Speakotron, I thought people are probably going to say the Speakotron or a Speakotron, the article with it, basically. And we can use that to boost those bigrams as well as the unigrams even more strongly. And that's going to help us get even better recognition in some of those scenarios. And so in this case, I've boosted the bigram and the unigram. I've chosen boost values of 5 and 10. I think you'll have to play with exactly what boost value works best for you. But generally speaking, the bigram with the article should be boosted at a lower rate than the unigram, because you wouldn't want every single word after the to be transcribed as the Speakotron. And with that, I was able to actually get down to 0% word error rate. And we have an accurate transcript. Like I said, this is just working on one single phrase. And you would never get to 0% word error rate on an entire corpus or multiple different variations on this same phrase. But hopefully, this gives you a helpful idea about the way to think about these problems and what you can do to overall decrease word error rate across a corpus. So if you are interested in trying this out for yourself or finding out more information about improving speech accuracy for your data, you can check out our docs. We've recently updated the speech adaptation documentation to include a number of the best practices that I've just talked about here. If you want to actually try it out for real, we've created a quick lab specifically to go along with this content called Measuring and Improving Speech Accuracy, where you can try out using the simple word script and tuning the biasing with boost and phrase hints and classes and try to achieve the greatest word error rate reduction all for yourself. If you're interested in the features that I talked about before, the custom classes or the safe context, you can apply for alpha access. Calabarnes.google.com is how to get in touch with me. Thanks very much. And I wish you best of luck with all of your speech accuracy.