Innovation in Speech Recognition and AI Technology

Convert Your Audio To Text

4.9/5

3727 customer reviews

Explore advancements in Google's speech recognition, AI applications, Text-to-Speech, and their integration in various products and business solutions.

Optimizing Voice Commands and IVRs to Speech Analytics (Cloud Next 19)

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: The speech at Google, it's been a long journey. So we originally developed voice search for Google Search. I think it was over 15 years ago. And then we kind of took that technology and brought it to other products like Google Maps for search and maps for navigation and text-to-speech for navigation and captions in YouTube and Google Assistant. And so we invested all that effort into building a robust speech engine that can work across a large number of languages and doesn't need as much tuning as technology that existed at the time, all based on machine learning. And then we took that technology three years ago and we brought it to the cloud for the first time with Cloud Speech-to-Text as part of our developer products. Can we go back to slides? Yeah, it's OK. So Speech-to-Text and then Text-to-Speech, they are our developer products. So you can use them to power all the applications you build with AI technology. And then last year, we introduced our business solutions, which are higher level of abstraction, which include Dialogflow and Contact Center AI. There's going to be a session about those tomorrow, so you're welcome to check it out. They take some of the technologies we expose in Speech-to-Text and Text-to-Speech and apply them to specific use cases like chatbots and voice bots and the Contact Center so you could use them in more places. So as I mentioned, we have two products in cloud with the raw speech capabilities that are leveraged everywhere else. Speech-to-Text is available in 120 languages, so far more than any other speech service that's available today. And it provides service in real-time streaming as well as batch. One of the things that's special, it performs really well in noisy environments. We'll see if I'll regret saying that when we actually do some demos later. But typically, it's trained on a lot of noisy data, so it performs well even if there's traffic in the background or people speaking or whatever. And Text-to-Speech, it's the only production service out there with WaveNet. We'll see later why that actually matters. It's available in 21 languages, but there's going to be more coming later in text and SSML. So since we first came out with Speech-to-Text, there's been a lot of new development in this space. As you can see here, we took it to GA. We improved accuracy. We improved the speed. And then we introduced a bunch of new features like language auto-detect, word-level confidence, timestamps. Then in February, so two months ago, we came out with a bunch of new launches. And I'm going to talk a little bit more about them next. So we wanted to make our speech products more accessible. And we're doing that in three ways. One is we're making the premium models that were only available to select group now available to everyone. Two, we're reducing prices on some of our products by up to 50%. And three, we're increasing language support, especially for text-to-speech. So a little bit on the premium model. So as I mentioned, the core speech recognition technology in cloud was based on the technology that Google developed for other products. But what we quickly found out is the cloud developers weren't just applying them on the same use cases that Google Search, or Google Assistant, or other Google products used, but actually for a lot of enterprise applications. So like transcribing phone calls for IVRs, or for indexing videos, or adding subtitles to videos, or transcribing meetings, or all these other use cases. And so we invested specific effort for improving these cloud models. And this resulted in our premium models that were launched in beta, I think it was, April last year. So exactly one year ago. So these premium models improve performance by 64% and 62%. So this is a percentage of reduction in error. So you could argue it's 2 to 3x better than what existed before for these specific mediums. Because there was a concerted effort on these areas. And then when they launched, the enhanced phone call model was based on data from opt-in customers in cloud that gave us the permission to use their data to improve the models. So we only made it available for others that also contributed for that program. But then, and now we're sort of opening it up further. So I'll talk a little bit more about that. But first, let's switch to the demo. I can show you guys. We get a lot of questions from people about, oh, I think it's showing the wrong screen. Do I need to change something? And actually, maybe I drag it.

Speaker 2: That's the other demo.

Speaker 1: No, that's the wrong one. Switch back. Yeah, maybe if I drag my screen over here, does that work? Sorry, guys. Let me change my display settings so that it's mirrored. OK, I think it's showing now. So a lot of people ask us, they tell us, we already sort of use speech recognition, and we know that Google's speech recognition is better. But what we have is good enough for us. And so we're going to show you guys is better, but what we have is good enough for us. What's happening is speech recognition is used in more and more use cases today, and the bar is actually getting higher. So you can see here, this is an example of a conversation with a voice bot where you're getting 100% accuracy. And you can see, you can build a very delightful experience for users when everything is working properly, right? So this is an example of someone ordering a burger, and the voice bot responding, and everything works great. But then if you go down to a lower level of accuracy, even at 90%, which for harsh conditions is actually considered pretty good, that means in an utterance with 40 words, you'll get four of them wrong. And there's a very high likelihood that those four words, some of them will encompass things that you actually really care about. And the bot will fail, which will lead the users to be a little frustrated, because they actually need to repeat stuff that they've already said before. So it's not a great user experience. And this is even at 90%. But then imagine what would happen if this goes down a little bit further into 80%, right? You'll see that in a second. It becomes sort of inoperable, right? So if you were to say a few years ago, 80% or 90% accuracy for recognizing speech, people would be pretty happy. But now there's a higher bar for this. By the way, this also applies to Google products, right? So with Google Assistant, the bar for voice recognition is much higher than it used to be when we were doing voice search. OK, let's go back to the slides. OK, so the premium models for better speech recognition, the phone one was only available for people that share audio. Now we've made it available for everyone. And then for people that enable data logging as an opt-in feature, they're now all getting discounted usage. So everyone's getting a benefit, whether they share data or don't share data. We're here to provide the best service either way. Cool. Let's talk a little bit about text-to-speech. So we introduced Cloud Text-to-Speech that's based on WebNet last year. You can see here a survey we ran on the quality of the voices. And what you see here is four different voices. And the left column shows the standard voice. The middle is the WaveNet. And the right one is human. So if you look at the yellow one, for example, which is the third, the standard is 3.65. The human is 4.59. And you can see WaveNet is 4.28. So it's really, really close. Let's switch to the demos. So if we go here and we type something like, hi to the audience at the session of Google Cloud Next. And now let me play this to you guys with this is now standard voice. Hi to the audience at the session of Google Cloud Next. So you can see, you can hear what he says very clearly. You can understand every single word, but it sounds a little robotic. Hi to the audience at the session of Google Cloud Next. And now if we switch to the WaveNet equivalent. Hi to the audience at the session of Google Cloud Next. So you can see it's worlds apart. And when you use this for long text and play it, I sometimes can't tell the difference. It sounds to me like an NPR news broadcaster, right? It's getting pretty close to what you would hear on the radio or whatever. Back to the slides. So WaveNet is really cool, but the problem is it's hugely compute intensive. So in order to generate a WaveNet sample, for every single second of audio, we need to run 22,000 inferences on a very heavy neural network. And that was a huge engineering challenge. Originally, it took a long time to do. But then our team spent a pretty significant effort figuring out how to optimize it. And the Google team came up with a way to do it 1,000 times faster. And this is building on the infinite compute power we have with TPUs, Tensor Processing Units, and other things that are ML specialized. It lets you run much faster and much better. And we were able to do it at scale in a way that is not possible in the industry. And then just a few weeks ago, we announced a huge expansion of our voices. So we've now doubled the number of WaveNet voices. And if you look back to April last year, we had six WaveNet voices, only in English US. We now have 57 WaveNet voices, I think, in pretty much all of the languages you see here. And there's going to be more coming soon later in the year. So we have full coverage. So it's by far more than anything else that's out there. We're very excited about WaveNet in the future. And then to wrap it up, the other announcement we had was a bunch of features we announced went GA, which means general availability. That means that they're covered under our SLA agreement and are considered production ready. So text-to-speech as a product is now available. So text-to-speech as a product went GA last year. And then audio profiles, which lets you tune the text-to-speech audio to the medium, and a bunch of these speech-to-text features. Model selection, our premium models, multi-channel recognition, audio logging. So to make this interesting, we wanted to talk to you a little bit about new features that are about to come soon. Now, this is not something we typically do in all of our sessions here at Next. But we wanted to keep it interesting for you guys and give you a sense of what's coming next. The balance of it is some of these things might change before you actually see them officially launched out in production. So we would love to still hear your feedback if you have any ideas of how to improve these. But we're going to show you how they look like right now. So we're going to talk about the two main groups of use cases that we serve with speech recognition. So the first is human-computer interaction. That's the traditional investment of speech at Google. And then the second is human-to-human interaction. So you could think about analyzing phone calls between a customer and an agent or analyzing meetings or things like that where you're talking about speech between multiple people. So we're going to have new developments for each of these to talk about a little bit. Let's start with the first one. So we've had something called Phrase Hints in our API for a long while now. The problem is it's really hard to get it right. So I'm sorry, actually, let me first describe what it does. What Phrase Hints does is it allows you to customize the API to specific phrases. So if you have specific product names or company names or other things that you're looking for that you expect users might say, you can add that to the recognition command, and then it will improve recognition quality. The trouble is it could also increase false positives. If you add a string, it might start recognizing that string even if it wasn't said. So by default, it's actually set to be pretty weak. So that's a problem. The other problem is you need to come up with different phrases for everything you want to send. So we're going to come out with a few features that address this. First is what we call strength. So as you can see in blue over there, we're going to give you the opportunity to set the strength for each speech context. And so you can set it much stronger than what we've made available up until now. And it makes it much more useful. You'll be able to see that in a demo in a few minutes. The second is classes. So for things like month, for example, if you just tell us that you're looking for a month, we can figure out what's behind it. You don't need to spell out all of the different months, January, February, et cetera. So here's an example list of classes. And we'll share the final one when this is ready to go. But these can help you with your different recognition tasks. OK, let's switch to the demo. OK, here it is. So here, just as an example, there's a sequence of digits that we recorded. Let me play it to you.

Speaker 3: 3, 1, 3, 7, 8, 6, 5, 4.

Speaker 1: So you can kind of understand what I'm saying in this recording. Then if you look at this other one, you can see it's very noisy. So it's actually really hard to tell what the digits are.

Speaker 3: 3, 1, 3, 7, 8, 6, 5, 4.

Speaker 1: OK, so if you take the original one, let me just change this. If you take the original one and you send it to the API, then we'll see it recognizes it properly with 100% accuracy. Recognizing a digit sequence is pretty hard, by the way, because even if you get one of them wrong, the whole string is wrong. So in a way, it's like a full sentence. Now let's try the same on the noisy one. So I just sent a curl command, like you see here. And you can see it recognizes it incorrectly. So it says, we went 3, 7, 8, 6, 5, 4, because it doesn't really know that it should expect a sequence of digits. So here, what I'm going to do is I'm going to use the class operand, which means a digit. I'm going to tell it it should expend a sequence of eight consecutive digits. I'm going to send it over. And you can see it still didn't change the result. The reason is it's still using relatively weak speech context. So now when I also add strength high, like you can see here in blue, we can try it again. And you can see now it actually recognizes it with 100% accuracy. So that's speech context, strength, and classes, guys. So just as a disclaimer, it doesn't always work right. Speech recognition is kind of a little bit of a science and a little bit of an art. There's going to be cases where you want to force it to recognize digits, but it'll recognize something else. And there's going to be cases where it recognizes it too strong, right? So it falsely recognizes digits when you don't want it to. So these are the kind of things that allow you a little bit more control to play with API and reach the settings that are optimal for you. So for each use case, it might be a little different. OK. Let's talk about the second major use case, which is human-to-human interaction. And we've had streaming for a long period of time, but streaming is limited to one minute. We do have long running recognize that could use any sort of length that you want, but it doesn't work in real time, right? The problem is speech analytics usually requires long audio files. So if it's a phone call recording, it could be five minutes or 10 minutes. If it's a meeting, it could be multiple hours. If it's a movie, it could be multiple hours. So what we're coming out with is a new feature called Result to End Time. And what it does, it tells you how far the recognition happened so far, and then you can restart a new session from where the last one ended. So you can restart a session whenever you want, and still get 100% of the same sort of results as if you were going more. So let me show you how this actually looks like in real life. OK. So what you see here in this demo, I can make this maybe a bit bigger. Here, we ask it to restart every 10 seconds. So every 10 seconds, it basically closes the streaming connection and starts a new streaming connection. So right now, the API supports up to 60 seconds. So if I didn't use any timer, it would just go up until 60. But in order to make the demo more interesting, we set it to 10 to make it easier. I'm going to click Start, and now you can see it's listening to me, and it's showing these red boxes. So the red box, you can see, is turning into green once it detects an end of an utterance. It's not a good combination with me talking in the echo, I see. But you can see it's turning into green once it detects an end of utterance. And you can see the yellow boxes show where it's restarting. And so you can see the actual audio is not impacted by when it restarts. So if I keep on talking now, you can see that it keeps on the recognition, you can see that it keeps on the recognition, even though it's a whole new streaming session. It keeps on with all of the same data that it had. So you can see it carries over. So with this method, you can basically do endless streaming for as long as you want. Let me show you how that looks like in the command line mode. So I'm going to talk now, and now I'm going to talk again. And let's see. OK, you can see it did a restart, and it didn't impact the recognition. By the way, the mic is on my laptop. I was standing a little further out from the laptop, so that probably reduces results a little bit. Oh, wow. That was a really long utterance. OK, so thank you. So that's endless streaming. How many developers do we have in the audience? Who wants to see the code that's powering this demo? OK, a lot of hands up there. OK, let's do it. The only caveat is, I didn't write this code, so I'll do my best to give it justice. But please forgive me if I don't do it. So this is the coding node. Can you guys see it? Should I make it bigger? Is that better? Let me scroll down to the bottom, because that's where the action is. So here in the bottom, you can see what it does. It starts a record session. And you use a program called Rec, which is from the Socks family. And it uses that to capture audio from the mic, and then it sends it into AudioStream Transform, which is a transform stream, something that's readable and writable. And then it calls something called StartStream. So let's scroll up to StartStream. And what StartStream does, it starts an audio array, and then it starts a new streaming recognized session. So this is called the Speech-to-Text API. And then it sends all the data that's coming in to SpeechCallback. And then here, you set a timeout, so you restart the stream. This is the 10 seconds that we had in our demo. So let's take a look at SpeechCallback. What SpeechCallback does, it gets the result and time from the API. This is what we talked about earlier. And then it calculates the corrected time. So here in the demo, you can see there's a corrected time, which is you need to know that it's 60 seconds from when we started the demo, not just the offset from where we currently are in the current array. So it calculates that and prints it out. And then it looks for an isFinal. If it gets isFinal, that's the flag that says the end of utterance was detected. So you can mark it as green. Or otherwise, it keeps using the same audio buffer and then marks it as red. And so then let's also look at the callback. That's another interesting, important one. So here, what it does is it calculates what we call chunk time, which is basically the size of an audio array element. So you take the full streaming limit divided by the length of the audio array. And then you get what is the size of each element in the array, or a chunk. And then with that chunk, you calculate how many chunks you need to grab from the previous array and how many chunks you need to grab from the current array. So last audio array length. This tells you, it's like here, you can see you need this part, right? When you go after the yellow part, you go to this part. You need to grab the audio that came before the yellow. So that's what's being done here. You take it from last audio array, and then you recognize it. And then the last important thing to watch out for is this restart function. So it's really important to run removeListener. If you don't do that, then you'll have two listeners running in the background. So every time you restart the stream, you need to removeListener, not just set it to null. And that's it. There's a restart counter, so it knows how many times it did 10 seconds. And that's it. So that's endless streaming, guys. Thanks. So just to talk a little bit about kind of limitations with the demo you just saw, so it handles two arrays, right? Two audio arrays. And we set each one to 10 seconds. You can set them for longer if you want. But in this demo, if you talk more than 20 seconds at a time, it won't be able to capture that, because it's only set up to capture 20 seconds. The other thing is that it sort of does arithmetic on the audio array based on the length of time that has passed. So it works well for constant bitrate. So linear 16 is fine. MULA is also fine. Even though it's compressed, it's still constant bitrate. But if you try MP3 and stuff like that, it won't work. Luckily, there's very little real-time streaming that people do with MP3, right? Because you'd have to compress it in real time as well. And the other thing to note, diarization context is actually important. And if you restart a stream, it resets that context. So if you want to do diarization, this will impact those results. The other thing is this is available here in the Node.js client library. We're going to add it to all of our client libraries eventually, but they're not all available yet. So this is something that you'll need to check before you start using it. Cool. So let's transition to talk a little bit about speech analytics. So there's a lot of cool things you could do with speech analytics, especially in call centers. So understanding customer sentiment, understanding why users are calling, figuring out how to reduce call volume, or how to upsell. Like if a customer is asking for a product that's out of inventory or stuff like that and you want to follow up, there's all these cool analyses you want to do. But it's actually non-trivial to set it up. Actually, could we switch to the other slide? OK. Let me go forward. So there's a lot of technical steps involved. And very typically, we talk to customers. They kind of love our technology. They love our AI. But in order to actually take advantage of it, they need to go to the IT department and ask them to build something for them, or they need to hire NSI to build a visualization platform. Just having access to the API is not enough. So we want to help with that. And a way to do that is with what we are calling the speech analytics framework. So this is kind of like a design pattern of how to do analysis on speech. And we're showing an example of how could it apply for the call center, but you could also apply it in any other use case. So you can see the audio comes in into Google's Cloud Storage. And then it gets processed with Cloud Functions. Cloud Functions then uses our different AI tools, Speech API, Natural Language API, AutoML, to understand what's actually in those audio files. And then that data gets sent through Cloud PubSub and Dataflow into BigQuery. Dataflow handles sort of the streaming information. And then it gets stored in BigQuery, which can do the analysis. It can help us run queries, better understand what's in those audio files. And then we have a visualization layer. The one we're showing here was built in React, but you could build whatever you want in whatever tool you want. And Identity-Aware Proxy for security, which kind of leads you to the final result in the visualization. And you're going to see a demo in a little bit. It's pretty cool. What we want is for any cloud customer to be able to get from having audio files to this in two weeks. And we're setting up a program where you can ask for help with SIs or whatever to help you sort of get there. So to elaborate a little bit more about that, let me invite Kevin. He's going to talk lots more.

Speaker 2: OK. Well, thank you, Dan. It's great to be here and to be able to talk to you about what we've been doing at Delta Dental with the tools Dan has been telling us about. Good. Yeah, get the notes back up. All right, so we're a health insurance payer. That means we have two large communities of users. The members, like you or I, who get dental treatment, and it's covered by a plan that's either employer-funded or increasingly one that we've bought ourselves on the individual market, and the dentists themselves. We cover about 1.7 million members. The bulk of those, obviously, are in New Jersey and Connecticut, but we cover people all over the country. For example, we have 100,000 members here in California. And we're part of a larger network of Delta Dental companies. There are nearly 30 in all, which cover the 73 million Americans. And the association also provides some central national services. The obvious one is simply a directory of practicing dentists that we all need to process claims. So why the call center? Somebody said in the keynote this morning, the call center is still really the primary point of contact for many companies, for many of their users. And it's particularly true for us with members and dentists. We do a lot of calls, about 4,300 hours a month. That's about 4,000 calls a day. And 3 quarters of our calls come from dentist offices. And usually, those are questions about benefits eligibility. Does this person who's just walked in the door have insurance? Has their insurance expired? Or it can get down to the procedures. Is their children's retainer covered? When can I get my crown replaced? All of these kinds of questions come up. In other words, what's the coverage in the plans? And as we all know from our medical and dental experience, plans vary. So that's one thing we know from talking to the customer center. The other thing we know from looking at our phone logs is that most of our calls are short. That spike you see in the histogram on the right is centered on 80 seconds. And it's actually 65 to 95 seconds wide. And the interesting thing about that is that in that length of time, you only have time to cover one subject. So from the point of view of someone doing a data science and modeling analysis, that's a very nice, clean data set to work with. And the analytics results I'm going to show you so far are all based on analyzing what's going on there in detail. We figured if we understand single subjects well, we'll be able to get into the more complicated subjects in the longer calls. So the first thing we did was take the speech to text from Dan's tools and run it through the natural language processing. And natural language, the sentiment analysis, will give you back a sentiment score for every sentence in the transcript. And if you really want to test Dan's stuff, try saying that sentence fast. And what we then did was marry that back to the word timings that we get from the speech to text and plot the distribution of sentiment as it tracks through a call. And what you can see is that these are not movie reviews. There's no one number that characterizes the call. What we have, though, is a great tool for pulling apart the structure of a conversation. And we can see all the patterns that you would like to see in any kind of customer service call show up immediately in the sentiment. But it's not very useful for getting quantitative results. So what do you do? You have to get a bit more subtle. What we did was split the calls out by type of caller. So look at the top two rows first. In the left column, we have dental offices and members we talked about earlier. We have the same sentiment plots for each of those. These ones are actually three-month averages. On the right-hand side, we plot the residuals. We subtracted the average sentiment profile from the individual classes. And you can see now we actually have something that we can model. We can fit a straight line to this data. And it's a pretty good fit. The pale bars are 95% confidence limits. So two things are obvious. One is that members and dentists are different. Either dentists are a gloomy lot, or we're continually giving them bad news about how much of their procedure they want to do is actually covered. But we're doing quite well with members, and the sentiment is generally rising through the call. The bottom row, it turns out that some of our major accounts have a specific phone line for them, so we can identify a major account. We give the same number to the dentists and the members. So you'd expect those calls to split the difference, and they do very nicely. So what this is telling us is not necessarily anything we can immediately use directly, but it's telling us that there is real structure in the calls, and it is worth analyzing them in more depth. And we're starting to do one of the things we've started recently, is we've turned on the premium phone model, and we're now running speaker diarization. And so the next generation of these plots will actually separate out the sentiment profile for the caller and the sentiment profile for the agent. Talking of agents, you can do the same thing for agents. The first time I showed this plot to our customer service team leaders, I hadn't anonymized the agent names. And the immediate reaction of one of the team leaders to the top one was, she's no longer with the company. The nice thing about that is that was a huge win, because it gives you instant credibility with the business user that the kind of measurements you're making on this data that seem to be very abstract are actually measuring real things they're seeing with dealing with their employees, so nice one. The other next metric we looked at was something that isn't there, the dead air between calls. All conversation has natural rhythms in it, and there are pauses. They vary in length. This is a standard metric in call centers. Obviously, you want your people talking. We got asked to do this by the call center, because they had an agent who they noticed had a lot of dead air, a lot of pauses in the conversation. And so you can see that in the plot on the right. The other thing you can also see is that the call structure shows up in exactly the same way in the dead air. The familiar phrases, the opening greeting, and the end of the call, we don't pause very much. When we're thinking about working an issue, we do. This has been an enormously valuable metric for us. We've used it in customer service for coaching agents. We have done tests instrumenting the agent's desktop and been able to show that our agents in the long pauses when they're working an issue are looking at applications like sticky notes and Outlook. They're looking up information in their personal knowledge repositories. That tells us what we have to build next for them. We also instrumented it to test whether dual monitors would improve. So we can quantitatively see impacts in the customer center just from these very simple measurements. And that's one of the points that I hope comes out of this, that once you have all the sophisticated technology of speech-to-text and natural language, you're able to ask very simple business-oriented questions and get good answers. Another metric, we've worked on topic modeling, but we haven't yet got any very good results. And we're about to go through another iteration of that. Topic modeling for us is really hard because you saw we had benefits eligibility earlier on. That's a huge subject. We need to be able to get to the second tier and do it reliably. But there are some things you can do just by counting words. In this particular case, we had a change that had been talked about in the organization for a long time. We wanted to change some procedures. We could never make the business case. By just counting those two words across a sample of transcripts, we were able to make the business case to change the IVR system. That change will roll out at the end of this month. We expect that to cut the number of calls we have by about 20%. In a year, that will pay for our investment in this project. Now, this situation may not apply in your business, but we're pretty sure from what we've been finding looking at the transcripts in depth that everybody has some kind of problem like this. There's something there, but it's buried at the moment. And the last one, tying back to IVR systems again, is how many people here like going through long IVR menus? Right. Well, it shows up when you talk to the agent at the end of it, and we can measure that. And we tied back the sentiment scores, and particularly the negative sentences in the call, to the path through the IVR system. So keep them short. OK, that's some of the examples of analytics we've been able to do on this data. Now I'd like to go a little bit technical and talk about how we built this. We built it in a framework that's very like the one that Dan just outlined to you. We upload our recordings into the cloud, and when you upload the recording to the storage bucket, it triggers a cloud function. This entirely serverless pipeline, it's scalable horizontally and vertically. Once that cloud function finishes processing, the transcribe interface writes the output back as a transcript record into the same storage bucket. It triggers another instance of the cloud function. And we just cascade through those instances until we've got all our processing done. We found that this is a very reliable way of doing it. It's straightforward to deploy. It runs very well. We can scale it vertically, as it scales very well with increasing volumes in a single project. It's also easy to scale horizontally. If we've got a slightly different problem we want to work on, or we want to process a data set for somebody else, we just spin up another project, install all the software, and go. And because it's all cloud function based, it's a quick deployment. We can spin up a project in three hours. And it only takes that long because I haven't finished writing the automation script. So again, a little bit about how we build this. We develop in Python. We wrap all the Google APIs in our own Python classes. And we use those exclusively for the cloud and the on-premise applications. This really reduces the amount of effort we have to put in on building out a new application to handle configuration exceptions, retries, all those kinds of things. It also turns out to be really essential for handling the results. Speech-to-text gives the results back in a very natural format for speech-to-text, their word timings and their words. Natural language gives us back the sentiment analysis in the natural format. It's sentences and character offset, because when they built that, they don't have word timings information. We have to cross-reference all of those. And we found this as we've looked into other Google APIs that there's a lot of work in just integrating all of this together. OK. What I'd like to do now is run a short demo. Dan talked about the visualizations. OK, if you can cut it over. Thank you. So this is running on a data set in the cloud in one of our cloud projects. And the data volume, you can see the core volume here. I pulled a week of core data to build this top-level template. So this is a real week from about three weeks ago. And you can see we've got an average sentiment over all the calls. We've got the average agent talk time, client talk time, silence, and so on. We can go down. We can plot this. You can see the timelines. You can see that our customer service center is open from 7 AM to 9 PM. But very few people call after 6 PM. And below that, we have the call log. We can pick out a particular call. We built, obviously, it's on the Google Cloud Platform. There's got to be a search engine in there somewhere. And we have the ability to do search. This, at the moment, is just a dummy. We haven't hooked up the full text search on the database. And it wouldn't actually be that useful, because for this demo, I had to use redacted transcripts. We're a HIPAA company. And I've only got two of them redacted in here. So the word cloud works. You can see the evidence of the redaction, how the tags come in. But if you click on a word, it will rearrange for you. But with only two recordings in there, it doesn't change much. And then finally, the idea is that a customer service team leader will be able to use this kind of dashboard and drill into a particular call. And we can do that as well. Takes a moment to load. You can see we've got the same kind of parameter sets up at the top. They're just the measurements for the individual call. And then we have the sentiment timeline here. So this is something that makes specific for an individual call the aggregate graph I showed you earlier. And it's a useful tool. So here, we've got a large negative sentiment. What went on? And it brings up the sentence down the bottom. And it's not really emotional negative sentiment here. This is factual. One of the problems about doing this only on text is you don't get to see it.

Speaker 4: Only $30 were covered, and that's why we sent you that check for $30. So yeah. I think that was the only thing Delta paid. Exactly.

Speaker 2: So our agent wasn't being emotionally negative, but it was negative information. That's where we run into the limits of dealing with text, and we would need another method. There's other tools we've got in here, and I've got just over two minutes left. I'm sorry. So we're looking at the entity transcript. We can pull up things like the organization and highlight those in the whole text. And below that, obviously, at some point, you want to look at the listing of the call. And we have the diarized listing, which is tagged by agent and client. And obviously, we can identify the agent, because for us, the initial greeting is fairly standard. So one final thing.

Speaker 4: You're very welcome. You have a great day. Thank you for calling. You have a happy New Year.

Speaker 2: Thank you. And with that, it's back to you, Dan.

Speaker 1: Thank you, Kevin. Can we go back to the slides? So I see that, and I get excited. So it's pretty amazing that you could do that in two weeks, basically, go from audio to have a complete dashboard with visualization of everything that you want. It's clearly not enough to do everything you would need in a large call center. But we have a lot of partners that can help turn it into whatever you need. So if you're interested in more information about the speech analytics framework, there's a link at the top. Just go to bit.ly s-a-f. All of the components that we talked about today are all available in our larger solution for contact centers called Contact Center AI. That's the URL for that. There's also a session about it tomorrow. So please go there if you want more information about that. And then if you want to learn more about endless streaming and the new speech context features we demoed at the beginning, go to that bit.ly slash GCP dash next 19 dash speech. And if you sign up there, then we'll send you information when it's available to test. So we have a half minute. Do you guys want to try one more bonus demo? OK. Let's try and do it quickly. Can we switch back to the demos, please? OK. So does someone here speak a language other than English? Whoops. Yeah. Which language? OK. Let's test this with Spanish. Welcome, everyone, to Google Cloud Next. And I hope you're having a great week here in San Francisco.

Speaker 5: Les damos la bienvenida a todos a Google Cloud y espero que tengan una excelente semana aquí en San Francisco.

Speaker 1: How was it? It was accurate? OK. So what you're seeing here is three things. Speech to text together with cloud translation, and then text to speech all coming together to be your real time translator. And what I want to show you now, there's this cool device. You can't really see it for those that are far away, but I'm holding a small device in my hand. And it's from a company called Source Next. They released this device, and I'm going to talk to it right now. I'm holding this cool device in my hand, and I'm wondering what it can do. Anyone speak Russian here?

Speaker 6: Yeah. OK. Hopefully it was that accurate.

Speaker 1: I don't think so, but that's probably coming soon. Yeah. We haven't released text to speech in India, but stay tuned. But anyway, yeah. If you want to see more of this, you can go to the Google Cloud Next website. And you'll see that there's a lot of cool stuff Stay tuned. But anyway, yeah. If you want more information about this stuff, they launched this a while back. Whoops. Sorry. And 300,000 are already sold, so pretty exciting. Let me go back to the links. And you can see there. So while we're doing that, I don't think we have. Oh. Please leave feedback. The organizers were very clear. It would be great if we can get as much feedback as possible from as many of you. Tell us if you enjoyed the session. Tell us if you didn't, what you want to hear more, what you want to hear less. Thank you.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support