Unlocking Speech Recognition with Google AI
Explore Google's advancements in speech recognition, speech analytics, and new AI features for diverse applications including retail and digital operations.
File
Recognize Speech like Google does Cloud Speech-to-Text Advanced Features (Cloud Next 18)
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: We have a lot to cover with you today. We have a bunch of really cool guest speakers that are going to bring stories. And we're going to focus today about speech-to-text and speech recognition. So speech-to-text is part of our conversation group of products in our building blocks as part of Cloud AI. We have a very large portfolio of AI solutions that are all of them pretty much leading in their space. So I would definitely recommend, if you have some time, to check out some of the other sessions about the other products. A lot of cool stuff being announced yesterday and today at Next. So how many people here were at our Google Cloud Next session last year on speech-to-text? OK, a few of you. So it's hard to believe, but last year when we were here, the product was still in beta. So we actually only announced GA just over a year ago in April 2017. And long-form recognition, the product has gone through so much since then. We've added 30 new languages. We're now at 120 languages, mind-boggling. And a bunch of other features like timestamps, like punctuation, and the new models, which we'll talk about later. There's two main use cases that we address, speech for human-computer interaction and speech analytics. Human-computer interaction is typically one person talking to a computer and then a computer reacting to it, either voice search or voice actions or things like that. Speech analytics is analyzing conversations between different humans. It could be in a phone call. It could be in a meeting. So we're going to structure the conversation today around those two sets of use cases. Today we're introducing four new beta features, or actually yesterday. Language ID, multi-channel recognition, speaker diarization, word-level confidence. We're going to talk more about each of them. So we'll come back to them. But they all make speech-to-text much more useful and practical for the various use cases that it solves. So let's focus on the first set of use cases first. So human-computer interaction, when you want to use speech recognition to power different apps or solutions. So there, our new feature we're introducing for that is language ID. So like you can see on the right on that screenshot, for Google Search, if you use voice search, you can select multiple different languages. So it's not a radio button. It's sort of checkboxes, and you can check as many as you want. And then when you speak to it, it can understand which language you're talking, and it recognizes the transcription in that language. So we're now making that available in the API. You just send multiple language codes. We figure out which one is the right one, and we return the transcript in that language. It works pretty well, but it's still not perfect. So if you actually know the language code in advance, you're always better off just supplying it. But if you don't know and you have multiple, it's going to do a kind of a best efforts job at figuring out which language it was. It's currently optimized for voice search and voice command, so we recommend using it for those kinds of use cases for now. And then for speech analytics, the second set of use cases we have, that's where the other three features we're introducing today are focused. So let's talk a little bit more about those. So speech recognition, for people that are new to the space, it sounds pretty similar. You're trying to recognize what someone said. It sounds like a consistent set of problems that you're solving, but can actually be quite different. So if you look at recognizing a video versus recognizing a phone call versus recognizing a command. So an example of a video could be a basketball game, and you could have sounds of the different players that are playing, sounds of the actual basketball bouncing on the ground, sounds of the TV hosts that are sort of interrupting, sounds of advertisements that are playing. And the number of speakers could be four or more. It could be even like 20 or 30 or 40, depending on how many advertisements are there. There could be lots of background noises, and it could be many hours long. If you look at phone calls, it's usually going to be a handful of speakers. And the backgrounds there is going to be more static from the phone line, or compression that's used, or a narrower sampling rate, like 8 kilohertz. And it's usually going to be a few minutes long. And then if you look at a command, it's usually just one speaker, a few seconds. There could be a lot of background noise, but it's very, very different in nature. It's going to be other people talking, or music, or other things playing in the background. A lot of Google's speech recognition efforts were originally focused on the right-hand side, on voice searches and voice commands. But one of the top requests we got from cloud customers is to solve more of the problems on the left. And so that's where, in April, we introduced our new models. So we now have four models that you can choose from, phone call, video, voice commands, and default. And they each are tuned to perform really well for that set of use cases. So voice commands is tuned to actually filter out all of the other background speakers, because it's built to recognize one person, and it assumes everything else. That's not the primary speaker. It should be filtered out. Versus our phone call model is actually built to recognize everyone that's talking in the phone call. That sounds like they're part of the conversation. So they're kind of built with different goals. Now, our default model that's been around for a year or two, that we've gotten consistent customer feedback that it's best in the market. It's better than any other vendor that's out there. But that was never our aspiration. We want it to be not just better than others, but actually good. And so that's where we came out with these new models, what we were able to do. The video model is 64% less errors than our default model. That was considered already best in class in the industry. And so you could argue that it's three times as good as anything else that's out there. And then the new phone call model is 54% less errors than our previous model, so twice as good as what existed before. So we're very excited about these new models and all of the new use cases that they open, because now that you can actually do speech recognition for these mediums, that's pretty good. There's all of this set of analytics that you can do that you couldn't do before. OK, so today we're introducing two new features in the space, speaker diarization and multi-channel recognition. So when you have a conversation between multiple people, our recommendation is to record them separately in separate channels. And there's two reasons for that. One is the accuracy of the transcription is going to be better, because even for humans it's easier to understand each speaker separately versus when they're talking on top of each other. The other reason is it's also easier to separate the speakers, because you don't really need to do anything algorithmically. You can just figure out who spoke in each channel. So for that, if you do have the speakers separated, that's where we're introducing multi-channel recognition that can take an audio file that has multiple channels and separate the different channels and the speakers there. Unfortunately, not everyone has that luxury of having the audio file separated into channels. And so if you only have a mono channel and you still want to recognize multiple speakers, that's where we're now offering in beta our new speaker diarization. It can separate multiple speakers in an audio file, but it's obviously not going to be perfect, right? It's going to do a best efforts job, because it's using machine learning and doing it via program versus external separation. Either way, we don't recognize the identity of the person, and we don't store the information beyond the API call. So there's nothing here that's related to identity. So I'm going to show you guys just an example architecture of how this could be used in speech analytics. And then we'll do a little bit of a code lab. And you'll also see another example of it later from one of our guest speakers. So you could imagine a setup where you have audio coming in from a phone gateway. A lot of these phone gateways were built many years ago and don't really have the ability to separate audio into different channels, so it comes out as mono. And then it would go to Cloud Speech-to-Text that does the transcription and the diarization. And then if you want to do cool things, like, for example, analyzing sentiment, you really want to analyze each speaker separately, right? Because what matters is not the average of the both, but each one as they were talking. So you can analyze the customer and the agent separately, and then feed each one into some analytics engine like BigQuery, and then run different analytics on that data. So last but not least, one other feature we're introducing is word level confidence. It's been one of our top features from our ISV customers, more sophisticated customers. What it does, it gives you a confidence score. Up until now, like you can see on the top, we had confidence scores that were based on segments. Now you actually get confidence scores based on the words. So you can take more action on it, like re-prompting the user or things like that. OK. So I'm going to show you guys how to actually do this in Python. Bear in mind, I'm not a very good Python programmer. So let's do this together. Hopefully it will work. If not, we'll solve it together. Actually, before we do that, let me just show you the sample file that we're going to use. So this is the sample file. Let me play it. I hope the audio is connected. Hi. I'd like to buy a Chromecast. I was wondering whether you could help me with that.

Speaker 2: Certainly. Which color would you like? We have blue, black, and red.

Speaker 1: Let's go with the black one.

Speaker 2: Would you like the new Chromecast Ultra model

Speaker 1: or the regular Chromecast? Regular Chromecast is fine. Thank you.

Speaker 2: OK, sure. Would you like to ship it regular or express? Express, please. Terrific. It's on the way. Thank you.

Speaker 1: Thank you very much. Bye. OK. So that's the audio file. It's transcribing it now. It should be done in a second. And then we're going to see the results in four different models. So in the default model, the phone model, the command model, and video, the video model is the best one for this file. While it's working, what I'm going to do next is I'm going to go to our docs. And I'm going to go to the speaker diarization example. So it's this one here in Python. And let's view it on GitHub. OK, there it is. OK, so let's look at video. So you can see this is the transcription. So transcription is reasonably accurate. Most of the time, it's good. Like here, you can see, or let's go with the black one. Would you like new Chromecast or regular? New Chromecast or regular Chromecast? Regular. Chromecast is fine. Thank you. OK, sure. So you can see it's working most of the time. So let's go to our code editor. And we'll open this code snippet we saw here on GitHub. And let's try and run it. So Python beta snippets diarization. Let's see if this works. What this should be doing is the same thing we saw on the web, transcribing it and showing the results as well as the diarization of the results. And then, OK, here it is. So you can see there's all of the words. And each word has a speaker tag. OK, so what I want to show you is how to actually get a stream of text that's different for each speaker. And then take that stream of text, do some sentiment analysis, and print the scores. So before we do that, let's replace the speech file with a better one, dhan documents, code. No, I think it's audio. This is the name of it, .wav. OK, then it's going to open the file. Oh, sorry, this is the wrong one. This is transcriber file with metadata. Here it is, diarization. OK, we're going to use this file. It's going to read the file. Then recognition config. We don't need sample rate. It'll recognize it automatically, enables diarization, speaker count. Let's also add automatic punctuation. So all we need to do is just add this. And it will start punctuating. And let's also use the video model. So here, there's an example with a model. Let's take this and add video model here. OK, great. And now, instead of printing all the words, let's do this. Let's do speaker 1, transcript, and speaker 2, transcript. Let's initialize them. And then let's ask if basically this thing equals 1. Then what we want to do is take this speaker transcript and add to it this word. And let's also add a space. And then if it's the second speaker tag, we want to do the same thing with speaker 2, transcript. And then let's do this. Let's print transcript and let's. Let's put in speaker transcript 1. OK, and then let's do it again with second speaker transcript. Transcript 2, transcript 1. OK, let me save. There's probably going to be a bunch of compilation errors, but let's work through them together. Hopefully, I edited the right file. We'll see in a second. So what we're expecting is to see transcript 1 with all of the words that the first person said and transcript 2 with all of the words that the second person said. OK, looks like it's working. So far, so good, guys. OK, so now let's look at the example for analyzing sentiment. So this is using Cloud Natural Language API. I'm going to go to this example on GitHub and then copy some code lines here. So first, in terms of imports, it needs all of this import stuff. So let's copy it over and put it here on the top. And we'll see arc parse is duplicate, so we don't need that one. I think we still need the other one. Oh, sys and sex, yeah, they're different. OK, so we have the imports. And now, sorry, this is probably too small. OK, so let's copy this piece that does sentiment analysis. And now we'll paste it in here. And then we already have a client, so let's call this nlclient. And nlclient here, too. And then we don't need this instance thing. Documents, content. OK, so we have two different ones. So let's call this one document1 and content equals, how did we call it, speaker1 transcript. And then document2 equals speaker2 transcript. And then we'll have two sentiments. Sentiment1 will be nlclient analyze sentiment document1. And then sentiment2 will be nlclient analyze sentiment document2. OK, and then let's print sentiment for first speaker, score, and magnitude. And then let's add this sentiment magnitude. And I think this one would be sentiment1. And then we have another one for sentiment2. Second speaker. OK, let's take from here all of this stuff and paste it here. OK, so now you can see I'm not faking it. This is real code. OK, transcript1, transcript2. And there's the score, guys. Cool, so this is a good example. If I can do it, really anyone can do it. I really welcome you all to try it and combine it with whatever other APIs you want. And we definitely look forward to hearing your feedback about all these new features. We want to make them better over time. The more feedback you can share, the better. OK, so with that, let me transition over to Nikolay from LogMeIn. That's going to take it from here.

Speaker 3: My name is Nikolay Avrionov. I'm the lead architect for the GoToMeeting, GoToWebinar, the collaboration products in LogMeIn. And every month, millions of customers rely on our products to run their data jobs. As you can see, we have a lot of audio minutes, a lot of meetings. And early on, we realized that the meetings have value before and after. Customers need these meetings to write notes, to reference them further, to run their business. And what we did is we introduced recordings. And after we introduced recordings, we started to look at how to make these recordings even more useful. And the obvious step was to add transcription. So we started to look for a transcription provider. And we selected Google Speech-to-Text. I'm going to talk about what was the process that we went through to select Google. But first, I'm going to show you what they have as a UI. So this is after meeting, you get this page where you can share the results of your meeting. And we have a fully searchable UI that you can search every participant, everything that they said. When they said it, you can click on certain pieces. So you don't have to anymore wait for the entire meeting and watch it. So let's move to a demo. This was a meeting that Dan and I had a few weeks ago, just to go over a few things. And so let's say we discussed things like a conference, web conferencing. So you can see, I can find every instance of the word conference in the meeting. And I can just replay it from here. So I don't have to wait for that portion. It's only in English. And also, there is a translation. Yeah, so we were discussing this exact presentation. So what was the process for us to select a speech-to-text provider? We were not Google client at that time. We looked at all the major speech-to-text providers. We built our own use cases specifically for what we do, because we wanted to have the best provider for our use case. And what we selected is the Google Video API. We actually started to work with them when the API was still in beta. And we released our feature one month after the Google Video API went into production. So the takeaway from here is really test and find the best provider for yourself. You see these numbers that providers usually have accuracy, but it's important what works for you. How it's implemented. Before that, we were mostly Amazon shop. So we have our real-time infrastructure on Amazon on-premise. And we had to connect our own infrastructure with Google. So currently, we have a VPC between us and Google. And we combine several data sources. We take all the events that happen in the meeting. When somebody speaks, when he stops and starts the video streams, we send the audio for Google to transcription. And the combined result is what you saw on the screen. We can identify every speaker to a word. And we can click on that word and play from there. I'm showing you a few code samples. Probably nothing surprising for you. One thing that we decided is to build a set of services. Transcription, parsing of the transcripts as a services. So we can change the transcription model, even change the provider if we have to support multiple transcription results. So we have this layer between us and the providers that we can switch them at any time if we have to. So the code samples are pretty easy. One thing that I want to emphasize is that it's really important to handle all the error cases. Because sometimes, for something like speech-to-text, you can see way more errors than something like a regular product. You have a recording that nobody speaks, or they start and stop really quickly. So initially, you'll see a lot of problems that are not problems with the speech-to-text API, but just how the customers use the product. OK. We can see the results were really positive. We are getting a lot on Twitter, on other channels. We are getting a lot of positive feedback from customers. We saw that the usage of the product is actually going up, because customers are relying on the meeting to be recorded and transcribed. So sometimes, they tell us that they don't even need to take notes anymore. They know that this is recorded for them, and they just go and review what people said. What we're looking in the future, we're continuing to work with Dan and his team, better accuracy for accents. When we started to test it, it works very well for native speakers, but it really has a problem with English with different accents. Even accents like Irish were not transcribed as well as English. So we're looking forward for that. Multi-language support, that's obvious. We support 16 languages. We want to introduce all the languages to the product. And speaker dialyzation, this is very important for us, because a lot of meetings happen from a conference room, and we cannot separate the speakers using just our infrastructure. We have to use the dialyzation. We started to experiment. We've been using the beta. And again, we are applying the same approach. We are going to build our own test cases, test it, and understand what works better for us. Thank you. Thank you, guys.

Speaker 4: How are you doing? My name is Bob McKinney. I'm a speech analytics expert. I'm going to talk to you today about a really exciting project that I worked on for the last year. It's around retail. Basically, if you've ever worked in retail, you understand that retail stores get a lot of phone that retail stores get a lot of phone calls from price and availability to just checking warranty things like that. So one of the things that one of the projects that I worked on over the last year worked for two national brands working on a year-long project. We analyzed over 360,000 retail store calls which translates to about a million minutes processed. The goal of the project was to focus on phone call conversion and this is a big deal because if you use AdWords or or Bing ads or any kind of any ads out there that have location extensions on them or click to call or Google my business ads you really want to be able to track those those you know the performance of those ads and that was the the focus of the project to say hey what is the conversion of those those calls. So using speech-to-text API's from Google speaker diarization models, enhanced models, the phone models the model ended up picking, natural language processing for sentiment analysis, entity analysis and then ultimately to connect it all back to the internal system of the two brands using the internal messaging bus. Google has a pretty good version of this the PubSub but in this case you actually used Microsoft BizTalk but just a little bit about the architecture here. The company the partner decided to go with for for encapsulation of the calls DialogTek, a company out of Chicago, great company. Basically what happened is a customer would call the store and DialogTek would post a do a post call to a cloud function that would trigger off a number of events. One of those events being download the the wave file from DialogTek into the Google Cloud Storage which allowed you to do all the things that Dan showed you in the demo. You can call that cloud storage bucket from speech-to-text, natural language processing, a bunch of other places. So working through that demo and some of the things we were looking for on the output of that in BigQuery or fuzzy string matching, n-grams just the frequency of words so demand over time. So if a particular product you can watch the demand over a year period to say hey is it ramping during this season or that season or is it is it declining and that's a really big deal. Using BigQuery and Python and some of the packages in Python allowed us to do that. The biggest thing though that that helped out the most was definitely speaker diarization which allowed us to separate out the the customer from the associate at the store and this allowed us to do a lot of things from from A-B testing on the the associate side for training and then n-grams and the fuzzy text matching on the customer side. So we could say hey if they you know in this particular case in this call we're looking for phone calls where they mentioned the iPhone 7 and whether or not it was a broken screen or needed battery replacement. In this case we wanted to say hey the broken screen we add that to our list for frequency for n-grams and then we can also go down through that using that call caller ID and match it against the internal system using that message bus to say hey did that phone call actually convert into a into a sale and then if you're getting you know thousands and thousands of calls a day you can actually figure out the value of those phone calls based on that conversion rate which is I think pretty revolutionary. So I encourage you know anybody that has a kind of you know a problem just to keep working through these APIs and and you'll find a solution. Some of the other things that that we could do around that is the entity and sentiment analysis. So in this case you can see that the the consumer good being an iPhone 7 and you can see the sentiment score is a negative 0.4 which is which is negative right because they broke the screen probably unhappy about that and they're just trying to get it repaired and and you can see that in the the entity and sentiment analysis. Why we why we why we choose you know why we chose Google Cloud to speech. The accuracy of the transcription with phone models it was you know we benchmarked it we benchmarked against you know some of the major competitors out there Amazon Watson IBM IBM Watson and the Microsoft products and we just found consistently that the transcription accuracy from the Google text-to-speech or in speech-to-text was a lot significantly better than the other the other options out there. The diarization reliability just be able to distinguish between the users was a big deal for us right so between the the associate and the and the customer that way you could do training for the for the the associate and then we could do conversion on the customer. Ease of use we had a lot of a lot of services already built for using Google Cloud services so it's really easy to be able to connect them all using you know virtual private network between between the Google Cloud platform and and the internal network and connected to BigQuery and cloud functions so all that all that you know combined made it a no-brainer to go with the the Google Cloud speech-to-text product. So I'm going to hand it over to Mahesh and he's going to talk a little

Speaker 5: bit about what he's working on. Thanks Bob. I'm Mahesh Balaji. I lead the cognitive computing and data sciences lab at Cognizant. Cognizant is a leading IT service provider, helps clients across three lines of services one of which is digital operations. Among other things that we do in digital operations we also maintain and manage service text for customers which brings us to the case today I'm going to talk about today. Late in Q2 of last year I was actually associated with this engagement. Client is a retail chain based out of Australia New Zealand region. The typical challenges were there was pretty high call volumes, there was long wait times. Primarily it's a 24 by 7 by 365 setup where the concurrent calls were average around 20, day peaks would hit high 25s, season peaks would hit anywhere from 45 to 50. So we were creating on an average 75,000 tickets, unique ones, and if you add the repeat calls and the overlaps we were hitting 100k easy. We had around 1,500 scenarios or cases if you want to say that we covered as part of this engagement and because of which it took a lot of time to onboard even though we bucketed common scenarios and trained agents accordingly. It did take a lot of time for us and attrition was not helping the situation. So it was fully human dependent so there was a very high cost to it and because of all this our CSAT score was pretty average. So giving a view of the process flow itself. Users across 3,000 stores call in from for issues like printer cartridge not working, the freezer is broken, please tell me what to do. That gets to the IVR system. IVR system forwards that to the agent whoever is free. Agent listens to it, validates the user's request first and then understands the context and provides the response. That was the user and the agent kind of did most of the work there. So fast forwarding what we did, we still had the human but he was actually handling only the exceptions. We replaced the whole human handling all of it with a bunch of boxes. So now the user calls in. From a user perspective there's absolutely no change. He still calls in a toll-free number that still hits the same IVR. Instead of going to the agent it actually goes to the integration framework which is the IFX abbreviated there. That does the orchestration going forward. So we send that text to Google services, get the voice clip transcribed to text, forward that to Dialogflow, match the intent, get the response, put it back into speech and give it back to the client. And this flow keeps going back and forth till the system feels that it has to either escalate it out or the issue is resolved. So once we put this in place we immediately saw significant benefits. The average call time dropped from four minutes to one minute and because of that the call wait time came down by 20% straight. The human is still there in the loop but he's actually handling the exceptions. He's not doing bulk of the work and because of that the overall operational costs came down and the quality of service was consistent and better. The CSAT score by default jumped up. So if I was going at 2x speed there's a clock running here and I don't have time I still have to give you guys some Q&A time. So that's all Dan's problem not mine. And what did we learn from all this? I thought this is more important. This is where I want to probably spend time than the previous context. So don't bother pre-processing the audio. We went through the exercise of doing that. So as engineers we jump into the conclusion that we have to apply filters for noise reduction, we have to do that, do this. We don't have to do that. So as a matter of fact it is counterproductive. So if you apply a lot of digital signal processing algorithms up front the output confidence score is less. That's what we figured out. So if you have a proper audio clip send it because the API already does most of that work for you guys. And of course the second one was the phrase hint. So this helped us a lot in improving the overall score given the fact that it can take up to 500 words and if it's domain specific it did a phenomenal job of making sure the text gets transcribed to the most accuracy possible. And the format sampling and encoding I would like to add the model also here now that there's three new models being introduced. It's very important. When we started off again the engineering hat kind of comes in and we say MP4 is a better format, it's compressed, we kind of can move data around quicker and all that. But that's a lossy format, right? You are better off compromising on certain items and pick a lossless format and lossless encoding so that the output quality, the final intention of having the audio clip transformed into text is better. So that's what I would actually suggest. And I'm super excited about the next one which is the word level confidence. Specifically the case I was talking to you about. So it helps big time, right? Most of the conversations are short and you don't have to do a full sentence or a phrase confidence. You can go by the confidence given by the word or you can pick and choose the combination to make sure the dialogue or the conversation goes forward, right? Of course all the new things that got introduced does give a very very natural interaction between the user. It is not IVR-ish anymore and that's primarily a big difference if you are counting the CSAT scores and interacting. So if you are ever on the other side and if you actually get an IVR-ish response I'm sure all of us hate that, right? So that definitely has come down with the new set of modules that's been introduced by Google.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript