Scaling Modern Voice Technologies for Every Language

Convert Your Audio To Text

4.9/5

3726 customer reviews

Explore modern voice solutions and their scalability. Learn to integrate speech-to-text and text-to-speech technologies effectively.

Open Source Speech Technology for Everyone L3-AI 2021

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Yeah, as Alan said, basically, I'm going to talk today about voice technologies, specifically modern voice technologies and building solutions that scale. So let's dive right in. I'm going to try to go for hopefully 20 minutes and then leave some time for questions at the end. And I've mentioned basically to the organizers that if folks are interested in hanging around a little bit longer afterwards and catching up on some more questions, then I would be happy to do that. So here we go. Okay. Quick intro. You heard something about me already from Alan. Basically, my name is Josh. Formerly, I did a PhD in speech recognition. I worked at an entertainment startup doing voice technology for mobile games at Arty Inc. And most recently, I was a fellow at Mozilla working on speech technologies for a few under-resourced languages in East Africa. And now, I am a co-founder at Koki. And I spend most of my time thinking about kind of data, product, and the core technologies of text-to-speech and speech-to-text. And yeah, I've got about 10 years of experience in language technology, but always learning new things. So what is Koki? Koki is a company, but it's also the name of a frog, a very small frog from Puerto Rico and other places. But yeah, so we're creating developer-friendly speech solutions. What does that mean? We're creating speech-to-text, specifically, and text-to-speech. And we have a very, very strong core focus on making these technologies available for every language. We want to make it as easy as possible, where you've got data in your language, you want to train something, you want to send an application out there. It should be as easy as a click of a button. And yeah, our code is open source. We're at the core an open source company. We've been working on this project for a while, for years now. And the code's released under the NPL 2.0, which is commercial-friendly. And most recently, one of the big points is we've opened up a community-driven model zoo. So we have lots of folks from academia, from different enterprises. We just got a model from Translators Without Borders. We want to make it possible to share the work that people have done. And for other people to deploy that in their own applications. So we've got about 48 models for speech-to-text in our model zoo. And we've got more than 48 models, but we have 48 languages. So some languages have multiple models from different teams, or different contributors. And text-to-speech, same. We have eight languages, but we have more than eight models. You can find us on GitHub. Our organization is koki-ai. And that's where all of our projects live for speech-to-text, text-to-speech. We've got a bunch of examples, repositories for getting started, example integrations into different things. Like if you want to put our models on an iPhone or Android or some kind of applications like this, you can see how to do that. On Twitter, we're koki-ai, and then our basic website is koki.ai. So what is voice technology? This might seem like a very basic question, but actually, it's only basic if you take a very kind of thin sliver of what you mean by possible voice technologies. Since we're at the Rasa conference today, I thought I would focus on speech technologies in the application of conversational bots, conversational applications, right? And so if we look at it like this, we've got speech-to-text, which takes audio in the world, probably from a certain person, and translating it into text and giving that to the application, to the bot. It can be a real IoT kind of speaker. It can be a robot like this. It can be your phone. It can be anything, right? But you're talking. Your speech goes through a microphone. It gets to the computer, and the computer really has a hard time understanding raw audio. So it's better to use text, right? And once you get to text, then you get to the brain of the application. And this is where Rasa or some other kind of like NLP processing application comes in. And that is separate from what I would consider kind of voice technology. There are two parts of the puzzle, but yeah, the brain is kind of this NLP processing kind of part. And then if you have some communication going back to your user, you want to communicate that in text-to-speech. This is very high level, very simple, but it gets to the point of what we're talking about today. So I told you what voice technology is in this context. Now, the name of the talk includes modern voice technology. So this is important. Modern is just my way of saying neural network technologies. So every part of this cycle, you've got the speech-to-text, you've got the NLP side of it, and then you've got the text-to-speech. Every part of that pipeline is at its core going to be, if it's modern, it's going to be some kind of neural model, some kind of machine learning model, let's say, especially for the voice side of it. For our technology in particular, which is pretty indicative of most modern voice technologies, you've got an architecture that looks like this, which is at the bottom, you've got audio going in, and at the top, you have text coming out. That's it. It's pretty simple. Like you've got these bubbles in there that are different layers of the neural network, but at its core, you just put in audio and you get text out. And this is important because historically, you had multiple steps in here, and you really needed a specialist. You really needed somebody who's got a master's or a PhD in this technology, who's been working, who's an engineer, who's been working on this for a long time to go in and really understand all these different parts. But you can just put a black box over this, and you put in audio and you get out text. And so it changes the problem of deploying these applications from a, I need a specialist problem, to I need my data, I need my own data problem. Because you can pull something off the shelf, you can use some kind of API, and it'll work in the same principle. It's a black box, but you can't really update the black box. And now with modern technologies, if you want to update the black box, you need your own data, you need your own audio, and you need to have it transcribed in lots of contexts. So that's what I'm talking about when I'm talking about modern voice technology. So digging deeper into what are the implications of this new way of doing voice technology? Voice technology is at its core a neural network, like I said. But really, that neural network is a kind of a generated asset of this cycle, this machine learning cycle. This cycle is pretty common. Everywhere you have machine learning models, if you're working with NLP, if you're working with computer vision, if you're working with voice, it's pretty much the same. You have, starting with the very top here, we've got the model training. And lots of people kind of consider this to be the holy grail of machine learning. Like this is where people get really hyped up about new architectures, new neural networks, training on thousands of GPUs. There's a lot of interesting stuff going on here. But I want to drive home a point that it's a part of the cycle. It's a part of the cycle. And without the rest of the cycle, it would be absolutely meaningless. And you would have an application that people are not happy with. So really, everything leading up to the model training on the left side of this diagram here is data. And everything coming after the model training is model ops, machine learning, ML ops. And both of these are core parts. These are kind of the pillars that hold up the cycle, because I'd say that they're some of the most overlooked parts of the cycle. And cracks in these pillars show up everywhere. So data collection, data annotation, data QA. This is really the juice that you're feeding the model training. If you don't have good data, you will have a bad model. You know, garbage in, garbage out. And this is a section that is not glorious. A lot of times, if you're working in companies, and you've got, if you've hired some kind of machine learning engineer, machine learning researcher to build these models, build these, integrate these models for you, they might not want to spend a lot of time on this. But really, this is, I always say this, that if you've got 10 hours to spend improving the performance of your model, spend that time cleaning your data. Don't spend that time looking around for newer architectures and tweaking hyperparameters. You know, it's very tempting for a lot of folks. And then after you've trained the model, you want to make sure it's the best model you've actually got. You need to do some really interesting QA there. And then deploying the model, monitoring the model. These are just like good engineering practices that you need to use to make sure that if you've deployed something, it's acting the way you think it's going to be acting. So yeah, this is this is a cycle, it's ongoing, it is something if you've got a voice application, it's not the case that you trained a model, you deployed it, and then you're done, right? Like your users are constantly changing, your interactions are constantly changing, people are adapting to the model, people are changing the way they speak, right? And so this is something that just keeps going and keeps going. So with that in mind, what does it mean to scale? Another part of the title of the talk today is building solutions that scale, right? There's a lot of different ways to think about scaling. What I want to talk about right now is what it means to scale in terms of as your user base is growing, as you're putting these voice technologies, these voice experiences in front of new people. How do you know that you're scaling well, that the model is going to be performing on these new people. So we need to scale to new users, new accents, new languages, new topics, and more data. So this all becomes part and parcel of the pipeline I was showing you before, and the fact that this is modern voice technology. This is, I'm talking right here about modern speech to text. You put something out there in the world, you start collecting data, you use that data to get better for the people you're going to see in the future. So something that's very particular about these kinds of models is that they learn what you have given them, what they've been trained on, right? So if you train a model, I see somebody's asking about Telugu, if you train a model on a new language, and you've only collected male voices, you've only collected men in their 40s, right? Your model is only going to be really good for those kinds of voices. And it's going to be much worse for women and for other people outside of that original kind of database, right? And so that is one of the kind of great things about these technologies is that you can deploy them, you put the model out there, and you can start identifying where are you having issues, which accents are giving you issues, where is the model underperforming, is there a gender gap, right? And as you identify those problems, you collect more data, and you just train the model, it just keeps going, and it gets better, right? So the more people use it, the better it gets. And it's something that you really miss when you've got a generic kind of web API, because think of it this way, you plug in your application to a generic API that's sitting on somebody's servers in the cloud from some big company. The more you use that model, the more that your customers interact with that model, it doesn't mean that that model is getting better for your customers. But if you take the technology into your own hands, then you can make sure that your model is best for you and your customers. So that's what I mean by scaling. And it actually comes down to just fine tuning. This is a very popular topic, not only in NLP, I'm sure you've heard about it in NLP, maybe you have. But for any kind of neural technology, like these voice technologies, you've got a neural network, all you have to do is feed it more data. It's fine tuning, you don't have to start from scratch. So it becomes very much more resource efficient. You don't pay so much for GPUs. Okay, so before, I think I've got a couple more minutes here before I was hoping to open up for questions. I wanted to go over these kind of common pitfalls that I have definitely hit in the past, and I see people hitting all the time. So common pitfalls for deploying something and setting up a good engineering pipeline, something that can scale, right? So, like I mentioned, don't assume a one size fits all model. Your users are not one size fits all. So the model should not be one size fits all. I'm talking about accents is the easiest thing here. A lot of the voice technology that's been commercially available in the past, and even presently, to be honest, is very much focused on American English. You put that technology and put it in front of Scottish English speakers, and it's going to fail miserably because it wasn't made for them. So, your customers aren't one size fits all, your speech solutions should not be one size fits all. Know your models. This is something that comes into just good practice with deploying in MLOps is that if you sent a model out into the world, it's different from a lot of other kinds of code where you can just kind of read logs and see where things crash, where you run out of memory, whatever. You need to know how your model is performing. And that means looking into user experiences yourself. And that's, I think, hopefully something folks here are pretty familiar with. But you need to know what's going on after you've launched a model. Here's this third point. It's actually a really sneaky one that I had come back to bite me, which is whatever technologies you're using. It's a small part of a larger product, right? If you've got a chatbot for a medical domain, if you've got a transcription application for lawyers, if you've got a transcription application for bedside notes, if you've got an IoT kind of smart light bulb that you want to have turn on and turn off. There's kind of the entire product, and then there's the voice technology. The voice technology is an important part of a lot of these, but it's not the only part, right? And there's this whole kind of conversational UX, conversation flow, dialogue creation that has to happen to make a successful application. And the core technologies that you're using, the speech to text and the text to speech, they need to be a part of the creative process, because these are machine learning models. Sometimes they can do things that you weren't expecting. They can act in ways that you wouldn't have predicted. And you need to have the creative people on your team know what to expect from these models. So this is something as simple as just with speech to text, right, like, if you've got a creative on your team, getting them microphone access, it's as simple as that, right, so they can try out things in real time. And also, that's kind of a nice first pass, but it is so anecdotal. So if you're going to be doing something like this, you need to keep doing more extensive user testing. But yeah, the creatives need to be have access to these tools. Also, NLP and voice teams need to work together. In any kind of chat application, you've got speech to text, NLP, text to speech, right? That means the inputs and the outputs from these systems interact, and you can't have isolated teams. It's a very simple thing. Save yourself from headaches, make sure everybody's on the same page, know your data, knowing your data, knowing your users, knowing your customers, these are different things. The data has been generated by your customers interacting with your application. And then you take that data, you process it, you label it, you transcribe it, you do all this stuff. And then you put that in front of some of these tools for training models. And at the end of the day, your model is doing weird stuff, you don't know why, you're like, I know who my users are, this isn't what I expect. But you need to look at the data right before it goes into the training pipeline. You need to listen to the audio files, you need to read the transcripts, you need to understand what's going on, you need to understand what's going on in the system. You need to read the transcripts yourself. This is important. You need to know what's going on. Okay, knowing what you're measuring. Oh, okay, try to be a little faster here. But this is important. Knowing what you're measuring. A lot of people compare and evaluate models for voice technology in terms of benchmarking on standard academic data sets. This is really useful in academia for comparing model architectures. But you should not assume that it's going to generalize to you and your application. Case in point, word error rate, WER, is the most common metric people use to compare speech to text models. It's basically accuracy. What percent of words did the model get wrong? If you are judging your model's performance based on word error rate from audiobooks, like free speech here, that is not an indicator of how it's going to generalize to your users. Because audiobooks were spoken in a very specific way by very specific people without lots of background noise. So know what you're measuring. Make sure that it matters for you. Define your use case and solve for it. Next point. This goes along with measuring. You need to know what you care about. What's your application? What's the actual metric you care about? It's probably not going to match what kind of metrics you're used to seeing in research. But you need to find what it is for you and measure it and track it. Remember that you're working on a team. It can be very tempting to use very specific academic tools for certain parts of the pipeline, but you need to think about the entire pipeline, everybody interacting with it, the creative folks, the voice folks, the NLP folks, and make sure that everybody's on board with using these tools. And lastly, know when your model breaks versus when your NLP, NLU breaks. You need to basically just log everything. Because it can be very tricky to find when you have a breakdown in the system. Who's at fault? Is it the speech to text? Is it the NLP? Or is it the text to speech? Maybe it said something unexpected. You need to have an idea handle of this. They're not all the same. Getting started. So very quickly, understand your goals. Why do you want speech? Do you really need speech in your application? There's some applications where speech might not be the most appropriate thing. You need to know, is this something your users want? And what exactly is that? Try it out. Do some Wizard of Oz testing. Have an idea. Do people actually want to talk to, I don't know, the refrigerators? I guess people do. But there's other things that maybe it just doesn't make sense for your application. You should know ahead of time. Make a solid data plan. This is really important. Before you even start deploying something, make sure that you've got all of the kind of groundwork laid out so that you can collect data that's going to be useful for you. You will thank yourself later. And when you're ready, you can try out our tools as easy as pip install TTS and pip install STT. And hopefully just hit the ground running. you