Exploring Speech Tech: Koki's Innovations Discussed

Convert Your Audio To Text

4.9/5

3744 customer reviews

Dive into the future of speech technology with Koki's Speaker 1 and insights on off-the-shelf vs custom solutions, model deployment, and handling different languages.

Talk with Joshua Meyer on Neural Speech Tech, Head of Data at Coqui

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hey my lady. How are you? I am doing great. I am a little jealous that you're able to sit in a t-shirt though. I'm just getting over a cold, not COVID, but I'm definitely jealous of the warm weather. Yeah, you're in Russia, right? Yeah, I'm in St. Petersburg, basically one of the colder parts, farther north. It's pretty, but it's cold.

Speaker 2: Yeah, I just got back from Kyiv and it was moderately cold. It was full-scale autumn. It was raining a lot and it was definitely cold and it's quite hot in here. It looks nice. So today I invited you to speak about the speech tech in general. But we can talk about Kokia as well and maybe you can start with talking about Kokia a little.

Speaker 1: Yeah, for sure. I think it's a part of the entire kind of landscape of open speech tech. Yeah, I'm thinking that when I think about speech tech, there's lots of different parts of it. I think about speech tech from the view of you've got an application, you've got some kind of app, you've got a connected device, you've got a bunch of audio you want to analyze, you have some kind of application and for that you want speech-to-text or you want text-to-speech. Maybe you want both. Maybe you want something more specific like keyword spotting, hot words, speaker identification. There's a whole list of what speech tech is. And when you're looking about where you can get it, there's this kind of range of off-the-shelf to custom. And off-the-shelf is usually easy. It's something that you can just download or you can get an API for. You just plug it in and it works, hopefully. If you want to go custom, then you're more guaranteed to get better performance on your application. So for every application, there's different acoustic characteristics that are important. If we're thinking about analyzing speech as opposed to producing speech. If you're analyzing speech, things that are important are echo. In the room I'm in right now, there's a good amount of echo because there's no, there's just flat walls, right? If you've got an application where you have speakers who have a certain accent relative to the language, then that also needs to get taken into account. There's demographic factors, there's lots of things that make your speech special. So when you're thinking about grabbing something and using it, plugging it into your application, you think off-the-shelf versus custom. Off-the-shelf is easy, like I said. But the more specific, the more your speech application varies from the norm, the more headaches you're going to get from using something off-the-shelf and the more you want to do something custom. And so there's this spectrum where your application lies on it is one thing. And then what solution makes sense for your application is another thing because it's, yeah, usually custom is more of a headache. But it's, so I think about these kind of dimensions of when you're looking for speech tech to integrate. And yeah, there's a

Speaker 2: lot of options. What I think of it is that if your core product is the speech tech thing itself, it can be a conversational agent, then you probably have the space for the engineering effort and you should go for a custom one. But if it's just a small part of your product ecosystem, you can maybe an irrelevant company, like, I don't know, airlines. And if you want to build something to automate your processes, then you can use something that is off-the-shelf.

Speaker 1: Yeah, and it's, so that's a really important dimension. How much is your product built around speech working, right? I think right now, this is a very important dimension. I hope that in the future, this becomes less important. So right now, kind of what you're getting at is, if you can, if speech is important enough to you, that you're willing to spend the engineering resources on it, that means it's a core part of your product, which means you're able to launch something custom, right? I think that that's, it's true. And it's unfortunate right now, I think that it should be easier to launch custom speech solutions like speech-to-text and text-to-speech. And to do that, basically what you have to do is lower the barrier to entry. You have to lower the assumptions, the requirements about how much engineering resources you need. Because usually, the engineering resources, it's not even just kind of number of engineers. It's, do you have a single engineer that's got a master's or a PhD in speech technology? Can they put this together? And so it's, and I mean, this is kind of a nice segue to kind of what we're doing at Ko-Key is that's what we're trying to do. We're trying to lower the barrier to entry so that normal devs, everyday devs, full-stack folks, can just pick up speech, make a custom solution. It's, I think about it as an analog to databases. You find databases everywhere, right? There's tons of different kinds of databases. There's MongoDB and MySQL, and there's a bunch of them. And you don't need a special degree, right, to pick up a database and make a custom solution. If that were the case, that would be crazy, right? You can't imagine a modern kind of web app without a database in it somewhere, at least one, if not lots. And if you had that barrier to entry where you needed a master's in databases to launch a database, that would be crazy. And I think that in the near future, this is what we're trying to do at Ko-Key is make it so that you don't need a master's in speech tech to launch a custom solution. And if we get this, if we get this right, you're going to see speech showing up in a lot of places where you don't see it now because of exactly what you said. If speech is a core part of a product, then the company can devote resources to it. But I think that there's lots of places speech isn't necessarily a core part of the product, but it would be really awesome to launch, right? And it would add a lot of value to the overall product. And so, lowering the barrier to entry is something that I'm really interested in doing.

Speaker 2: I get it. Okay. What kind of speech tech do you offer? Is it like speech to text, text-to-speech, is it chatbots, voice assistants, is it all in one?

Speaker 1: So, right now, we are offering the two kind of foundations, I would say, of speech tech, speech-to-text and text-to-speech. As we grow, we're going to grow to more offerings, but that's the kind of core of what we have right now. And there's a lot of things that you can plug in, you can plug these solutions into and get something like a chatbot, right? So, for a chatbot, you've got, for a voice chatbot, let's say you're calling a bank and you want to open a new account I don't know about you, but I call my bank. I don't remember the last time in, I don't know, five or more years where I called the bank, and I actually talked to somebody right away. And it used to be the case that I kind of tried to figure out how to navigate their voicebot to get to the human as fast as possible. But it's gotten a little bit more complicated, to get to the human as fast as possible, but it's gotten to the point where it's actually pretty solid, and I lots of times can solve my problems without talking to a human. But that's a little bit of a digression. So, for that application, let's say opening a new bank account, I call the bank, I say something in audio, it gets translated into speech, and then they've got this chatbot, I don't know what they have under the hood, but let's say they have some intent classification, they figure out that I want to open an account, and then from there they start identifying things that are important about, you know, who I am, and my name, and my existing account number, and all this stuff. And then when they repeat something back to me, they're going to use text-to-speech. So right now, at Koki, we are offering the input

Speaker 2: and the output for voicebots. Oh, I get that. It's like you're connecting various endpoints from the speech of the person to text input, like automated speech recognition, and then there's a chatbot, and outside of the chatbot, there is this text-to-speech.

Speaker 1: Yeah, exactly. Anytime you want to convert text into speech, or speech into text, that's where we come in. Currently, we're expanding more, right? But these applications for chatbots, it makes a lot of sense when you have both products, both offerings, text-to-speech, speech-to-text, in the same product. But there's lots of applications where you only want one or the other, right? Let's say there's some train stations that use our text-to-speech for making announcements, for example. And this is a really cool application, because if you were to just have a human do this, you'd have humans sitting in the office at the train station reading all the time, like, this train is arriving, this train is going. And you couldn't necessarily just record, you know, 100 or 1,000 sentences ahead of time, right? Because, if you say, this train is delayed by this many minutes, and it's going to now arrive at this time, anytime you have kind of changing variables, text-to-speech can make a lot of sense. So, we've seen people deploy text-to-speech just by itself in train stations and kind of public

Speaker 2: transport, which is pretty cool. It's like they're generating this markdown and let's

Speaker 1: cookie read it out loud, sort of. Yeah, exactly, exactly. We right now are, there's some parsing from the markdown, but yeah, basically that's what it is now. And for the other side, there's applications that just want speech-to-text, where there's no kind of conversational aspect, where it's just, let's say, voice search. Let's say you're a, okay, so let's say an example. I've been doing some collaboration with folks in Uganda who are interested in making this kind of product to help the Ministry of Health understand when people are talking on public radio about COVID-related things, like masks and vaccines and kind of things. And they're just using straight speech-to-text to, you know, radio is coming in, text is coming out, and it's constant. It's real-time, it's going all the time. And that's another application where you've got a lot of audio, where you want to understand what's going on inside of it, or you want to find bits and pieces, and you couldn't just pay people to listen because it would be infinite, right? Just never ending. So, yeah, so there's a lot of applications outside of the chatbot scenario, which are really interesting for either speech-to-text or text-to-speech independently, which is pretty

Speaker 2: cool, I think. How do you, I mean, that can be a know-how, but how do you handle, like, what are the challenges in AS, automatic speech recognition, or, you know, text-to-speech? Like, there are so many languages, and, you know, the accents and everything is quite hard to handle,

Speaker 1: I guess. Yeah, so ASR, automatic speech recognition, speech-to-text, STT, it's all the same thing in my mind, at least. So, there's a lot of challenges when you think about scaling a core technology to all of these languages and accents. So, kind of a ballpark estimate of how many languages there are in the world, people say 7,000. I think that's probably at least 7,000. And in terms of accents, it's even more fluid, because accents and language are really political concepts. So, it's, and it changes with time, too, right? So, like, if you watch movies from, maybe they were filmed in the same city you grew up in, but they were filmed, I don't know, 50, 100 years ago, you're like, well, they talk funny. So, language changes with time, too. So, it's not the case that you can just have a really good English speech recognition model, and it just works all the time for everybody. That would be great. That's kind of the holy grail, and there are people doing research on this. But practically speaking, what really makes sense in most production settings is you acknowledge that, hey, people, different people talk differently. So, I'm going to make different models for these people. So, when you're thinking about this, these questions start to come in mind, like, okay, if I've got lots of models, how do I keep tabs on them? How do I know that they're performing well? How do I have accurate test sets for each accent, for each demographic group to understand how the model's performing? Because otherwise, what you're getting is, unfortunately, just a general picture. Let's say a single test set, and you just have, like, okay, I'm going to ignore demographics. This is a bad idea. But let's say I ignore demographics. I'm just going to have a really big test set, and I hope that that's going to be accurate. Well, then what happens is you have bias towards what's in your test set. So, and historically, what this means is having bias towards groups that have usually more money, so it's, like, white men, and, like, with American English, too. This has been a big problem. Like, everybody for speech recognition reports metrics on a single data set, at least, and it's been it's called Libri speech, and it's just a bunch of people reading audiobooks, and it's basically just Americans, and so you I'm diverging here, but there's really important things to keep in mind about not just acknowledging who your technology is for, but acknowledging who it's for, and then taking steps to understand how does my model perform for that group of people? And once you start doing that, you realise, okay, one model doesn't usually fit all. It's not a one-size-fits- all scenario. I need lots of models for all these different people, and I can just it's not that you need tons and tons of more data. You just need to fine-tune to every one of your groups.

Speaker 2: I wonder how it works for speech, by the way. Do you have, like, pre-trained models as well?

Speaker 1: Yeah, so, we've got a model zoo. It's at pokey.ai.models. They're open? Yes, they're open, and we've got a bunch of different languages, a lot of models from our community. We've got researchers, and developers, and enterprise folks, and DIY folks in our community. It's pretty cool. But, yeah, so, what I always do is, if I've got a new application that I need to make a model for, I'm not just going to start from scratch. I'm going to fine-tune, and I think speech, in this regard, is basically similar to most of deep learning, machine learning, in that transfer learning is really starting to be king, and it saves you GPU

Speaker 2: time. It saves you a lot of, you already have this gigantic model anyway, so why bother yourself? But do you do, like, for instance, for one language, you fine-tune on different accents, or do you have this gigantic one model that you are fine-tuning on? Because, I mean, the way people say vowels, you know, I don't know what they call syllables. Is it syllables? They vary a lot from language to language as well. That's why I wanted to ask it.

Speaker 1: Yeah, so, okay, I did a dissertation, basically, on this, trying to figure out how you can learn kind of linguistic facts, linguistic information from different languages, so that that information is helpful for speech recognition. I spent a lot of time on this. My conclusion is it's better to have one model per language. If you've, it's possible you can combine languages into a same, into the same model, but. Same-sounding languages, maybe. I don't know. Yes, I saw, and people have found, researchers have found that if you combine languages that are, kind of, have similar sounds, similar phonemes, you can get some kind of improvement on both languages, or all languages. However, in my experiences, that improvement is smaller than I would hope it to be. I mean, it's a really cool idea. Like, let's just throw all the languages at this model and see what happens, and maybe in the future we'll get something closer to that, but right now, practically, what I've seen is one model per language, and then you can fine-tune to different accents, and then so you have, you know, a model, a single model for American English, you have a single model for British English, for Indian English, Australian English, and that goes for all languages, and it's definitely not just geography and, like, kind of, political divisions. I mean, for, yeah, for accents within the United States, even, you can have, you know, a New York, a New York model, you can have a Boston model. It depends on how much data you have from those environments, but usually, the more, the more you know about your user and the more data you can use to help create a model that works better for that user, there's no reason not to just fine-tune. Yeah.

Speaker 2: Yeah, exactly, I guess. So, coming to, you just said about the, you know, like, the data scarcity, like, how do you handle low-resource languages? I remember this, we have joined the Hugging Faces Automated Speech Recognition Dataset Sprint for the Mozilla's Common Voice, and we were trying to fine-tune models, but the problem with the Turkish was that it was really a low-resource language, and I remember scraping, sort of, the YouTube videos and, you know, matching them with their subtitles, and it's really not scalable, in my opinion, it's hard. How do you handle this? That's a great question. Yeah, I've spent a lot of time

Speaker 1: looking for as much open data as I can find, and we've actually got a, kind of, a list of open datasets for speech on Koki's GitHub, and, yeah, so, the approach of scraping is hard. Scraping is very hard for a lot of reasons. The first one that comes to my mind is licensing, because with, you know, with YouTube, for example, there's videos that are licensed under Creative Commons attribution license, and then there's videos that are licensed under.

Speaker 2: I mean, there is a lot, but we see that companies make paid APIs with, you know,

Speaker 1: licensed code nowadays. Yeah, I know, I know, it's, I mean, and that, what you're talking about right now is super interesting. It's super interesting. And NLP, I think NLP has always been, kind of, pushing the limits on this. Yeah. And it's, I don't know if it's a good or a bad thing. It's, I mean, if you're recreating somebody's code that's under the GPL, there's issues there, obviously. But, anyway, another thing with speech data is people are much more attached to their voice than they are to their text, right? Generally speaking. I mean, if I wrote a book, right, if I wrote a book, I might be attached to my book, but if I just sent a text, or I sent a tweet, versus I recorded a message, and I sent that out into the world, I'm more attached to my voice. And I think most people are, because there's personal information that there's, you can understand how somebody's feeling, you can understand a lot of things, demographics about the person that you don't get from text, right? So speech is a hard one. And I know I'm not really answering your question, because it's a very case-by-case basis, honestly. I firmly believe that the best data for an application is data that came from that application. It's hard to, it's hard to just, okay, let's say, talk about NLP a little bit here. Let's say I want to make a language model for, let's say, just for autocorrect, just something really basic. Autocorrect, I want to make a model that's small enough that it can fit on my phone, blah, blah, blah. And let's say I want to make it for Finnish. And I don't have a lot of Finnish data from people sending texts. So I just go to Wikipedia, and I scrape Wikipedia, and I get a lot of text. And I make an autocorrect model for Finnish from Wikipedia, and then I deploy it on people's phones. You start getting issues just from day one, because people don't text like they do write Wikipedia. So the best data you can get is data from people typing on their phones. So really, the best case scenario is you can get that data, and you can get it in an ethical way, and then make models for that application that the users actually want to use. So the difference between having an autocorrect model for Wikipedia and an autocorrect model for me typing on my phone is I'm not going to use the one for Wikipedia. I'm just going to turn it off, right?

Speaker 2: Exactly.

Speaker 1: So this is the same problem that you get with NLP. It's the same problem in computer vision. And it really, I think, at the end of the day, is if you have some data from your application, that's the best data. That's the data you want. Not just because of how people talk, and not just because of the kind of people talking. It's more even deep than that. So you have, let's say I've got an application. Let's say I've got some game, a mobile game on my phone, and my users usually have this kind of interaction where they're touching, they're playing the game, and they're also talking. And a lot of the users do this when they talk, right?

Speaker 3: Yeah.

Speaker 1: Touching, playing, talking. Just the acoustics of holding a phone this close is going to make a big difference compared to scraping data from YouTube, because people on YouTube are recording with lapel mics, they're recording with boom mics. And the kind of microphone, the kind of signal processing that's maybe happening before my audio goes anywhere, just with noise reduction, that all has an effect on the... Your job is done. It's the same everywhere. It's just I think about speech a lot more than other people.

Speaker 2: Text is more like normalized somehow, and you have normalizers and such. You need to get the noise reduction in speech recognition, and that's probably a hard thing to do, I guess.

Speaker 1: Yeah, it is. I think that there's... Speech is a very noisy signal. Let's put it that way. There's lots of reasons it's noisy, but that's what makes it fun, right? I got into speech because I really like languages. I came actually from a linguistics background and found computational linguistics, and then I found speech technology and really fell in love with it and decided to write a dissertation on it, for better or for worse. And there's something that's really just cool about listening to speech, and I get to work with different languages all the time, which is fun. Yeah, text is... I like to read, but I'd

Speaker 2: rather probably listen to an audio book. Let's put it that way. I mean, also, you can... Aside from finding more data in NLP, as I said, it's more normalized, and the problems in NLP is mostly document classification and stuff where it is... I feel like it is less of a real-life application with a really wide domain. It's like comparing a simple computer vision problem, like security equipment controller, to autonomous driving. In autonomous driving, you have... It's a very open world, and it's a really hard-to-solve problem, but it feels like that, in a sense. In NLP, you have document classification. You are training with data from internet and stuff, and then you have speech, which is... People speak differently. I know for the fact that there are open speech conversational datasets. I saw one in the survey paper from Henderson, I guess, but there's not much... In the real life, there is not much, and it goes the same for the speech. You can scrape news, I guess. The host speaking about something, and then you have a person trying to speak outside. It's a challenging problem to solve, but it's cool. You just find challenging things fun.

Speaker 1: I think also, if you've ever learned a second language and had to use it in different scenarios, you understand much better than anybody else how speech is difficult. I mean, I've studied different languages. I know you've at least studied English,

Speaker 2: and you're much better. I studied German for four years, and English since kindergarten, and I studied French for one year, and Russian for one year, and those are like... Okay, so you like languages too. Yeah, I like languages. That's why I'm working in NLP. I am also having my master's thesis in morphologic language models, sort of. Oh, very cool. Turkish is a very morphologically rich language. You have tons of suffixes and prefixes, and it's agglutinative, and the way you morphologically parse a word changes the context and semantics a lot, so I'm trying to observe if it actually works that way.

Speaker 1: So, that's my master's thesis. That is super cool. After the show, I'm going to have to send you a link. I've got a very good friend who works on this for Turkic languages. Oh, awesome. Thank you. Shoot, I was thinking about Turkic languages, and they're very cool to work on, I got to say. I spent a lot of time in my dissertation working on Kyrgyz, which is a Turkic language, but... Yeah, okay, so if you learn a second language, which you have, you've learned a few second languages, right? You understand that there's a lot of factors in speech comprehension which will correlate to speech-to-text. There's a lot of factors in speech comprehension which you only appreciate, really appreciate, if you're learning or speaking or using the language as a second language. So, talking on the phone is hard. It's harder than talking to somebody face-to-face, because there's issues with the signal. Usually, phones have a lower sampling rate, so it's just not as high fidelity. There's also visual signals you get when you're understanding somebody, right? There's not just kind of lip-reading, which happens at some level, but there's context, there's just context of what's going on. So, I want to note something. In COVID,

Speaker 2: I have realized that people have been falling into fallback in this, because I was talking to someone, like, I have observed people talking with masks, and with masks, there's a lot of information when you're having a conversation. So, people literally take two fallbacks, two types of fallback. They either pretend that if they, as if they have understood something, or ask, can you rephrase this question? So, as people, when there is, you know, there is no vizems, meaning the lip matching to a vowel. You are only left with phonemes, which is really not sufficient when it comes to conversations. So, I just wanted to add

Speaker 1: that. Yeah, no, and when we're doing speech-to-text, we're just using audio, right? Usually. I mean, there's some researchers who put in extra information, but at Koki, we're just doing the speech-to-text, which is what basically everybody's doing, and you're stripping out all of this context information from the world that helps humans understand what's going on, and it's hard. So, if you speak a second language, and you go to meet somebody at a noisy cafe, it's just more difficult, right? If you speak a second language, you're talking on the phone, it's more difficult. And the same with speech-to-text. Any time you can think, okay, this is a situation where my understanding is going to take an impact, that's also more difficult for speech-to-text, which is something I found super interesting, that it just kind of putting in that mindset, speech-to-text is still limited in the same way that humans are limited, right? There's a signal, and we're trying to map the signal onto something, and it's a very noisy signal, and it's a many-to-one mapping, where many is infinite, basically. So, it's a hard problem, which is why it's fun to use. And it's also just amazing when you see it work well. I mean, I think about connected devices, for instance. So, we were talking about big models a little while back, having a big model that you can fine-tune, and actually, at Koki, we work with mostly smaller models in the open. So, we've got models that are, let's say, 180 megabytes or less. 180 to 40 megabytes, depending on if we have kind of exported with TensorFlow Lite to optimise and stuff. And even with a model that's just 40 megabytes, sitting on a Raspberry Pi, seeing that on a Raspberry Pi, in the corner of a room, it's got echo, understanding what I'm saying from the other side of the room, that's just like, it's very cool, because I think this is a super noisy problem, and this model isn't huge, right? There's lots of really interesting things you can do with big models. Technically, there's problems with big models that we know about from Emily Bender and Timnit Jebru's paper. Yes, the Bender rule. And, yeah, there's technically cool things you can do with big models. There's lots of caveats with big models. But a lot of people don't spend as much time on is working with small models that you can put on small devices that you can put. And this is something that, at Koki, we've been working on for a while. So, all of our models in our model zoo are maximum 180 megabytes for the acoustic models. The language models can be bigger because that's more kind of specific to your domain, and those are easy to swap out. But, but, yeah, we've got stuff that runs on Raspberry Pis, and even with this noisy environment, it's still performant, which is fun to see. That's another thing I like about speech is you can actually, you know, hook it up to a microphone, and then you can use it, right? Which is fun.

Speaker 2: Yeah. And also, people think that bigger models is just, you know, what if you make it bigger and bigger models? They're just, I feel like that's a very brute force kind of approach when it comes to other things, because the real trick is to make smaller models and

Speaker 1: more efficient and better models. Yeah, and smaller models are better for a lot of reasons. I mean, they take less time to train. Also, the inference speed. That's exactly what I was going to say. The inference speed. You can really feel the difference with speech technology if you've got the model sitting on your phone, sitting on your Raspberry Pi, or sitting on your whatever, your television, versus when you've got to make an API call up to somebody's cloud, right? Yeah. It's impressive, and I think that that kind of instantaneous feedback is also

Speaker 2: very cool. It's also more scalable, like, for so many reasons, it's more scalable. We have a question, by the way, from Tanmoy. How are you handling the sentiment of the speech, anger, interrogation, or other emotions? Yes, this is a hard one. It's a hard one,

Speaker 1: and I have a simple answer right now. With the models that I'm training, just ignore it. So, right now, like I said, we're training two kinds of foundational parts of the pipeline, text-to-speech, speech-to-text, and with speech-to-text, this is something that usually you don't want to know so much about the sentiment. You just want the transcript, right? So, the same sentence could be said by a thousand different people. Half of them are angry. Half of them are sad. We don't care. We just want the transcript, right? So, that's for the speech-to- text. For text-to-speech, on the other hand, it is actually super important to get this right, and it's hard. It's very hard. And so, we've been working on models that in some ways can infer the prosody, the kind of the intonation of how the sentence should come out, which conveys emotions and sentiments like anger and frustration and questions and all of this stuff. But the model is able to, it's kind of magical, really. The model is able to learn prosody just from plain text. So, it understands in some ways the sentiment of the sentence, and it makes a guess at how it should read it. Besides that, something that we're also very interested in is how you can make speech synthesis, text-to-speech, more controllable, because it might be the case that there's one sentence, you know, Josh walks the dog, and I want to synthesize it as a question, Josh walks the dog, and you want to synthesize it as an exclamation, Josh walks the dog, whatever. And there's no way currently, without any kind of extra knobs, to do that. The model is going to basically do a one-to-one mapping, right? So, this more kind of controllable synthesis is something that we're also interested in. And I mean, I could talk about this for a while, but there's, yes, emotion is important for speech synthesis. For speech-to-text, usually, we just kind of ignore it.

Speaker 2: Okay. That makes a lot of sense. But it's also a hard-to-solve problem. We have also another question. What models are used for speech recognition and text-to-speech in edge devices?

Speaker 1: Yep. So, we've got, for speech-to-text, we've been using a modified version of the deep speech model, which was published maybe some five years ago, which has got some recurrent layers in it. It's a relatively small model. It's like six layers, and it's got a single LSTM, unidirectional, because bidirectional, you can't do streaming inference. And we really want to make it possible to do streaming inference. So, that's what we've been using for speech-to-text. We're very much interested in expanding to new architectures. And for text-to-speech, on the other hand, we've got a bunch of architectures there. And I don't know all of them off the top of my head. They've got funny names like Speedy Speech and Fast Speech. And Aaron, our text-to- speech expert, head of text-to-speech, knows a lot about that. But if you're interested in looking, you can definitely get information on this from our release notes. So, if you go to github.com slash koci-ai, you can see the different repos. Go to the one for speech synthesis, and you can see all of the architectures that are released. And they all work on-device,

Speaker 2: which is pretty cool. Those attention-based models, have you ever tried such models?

Speaker 1: Yeah, we've got attention-based models for speech-to-text. Sorry, we've got attention-based models for text-to-speech. I say TTS and STT so often, I often mix them up. Yeah, we've got attention. I think that that's our best models for speech synthesis are attention-based models.

Speaker 2: For the speech recognition, because the sequences are, you know, people speak for short durations, maybe attention-based models are a bit overkill. Is it the reason

Speaker 1: you're not using them? No, I mean, so attention-based models for speech recognition recently have been shown to be very good, to be honest. There's one, okay, so the reason we haven't gone to them yet, they're on the to-do list, let's put it that way. We've got a toolkit that is not made just for researchers. We've got a toolkit that is supposed to get you from data, to training a model, to deployment, as easy as possible. So we've spent a lot of time building all of these things around the training pipeline. So we've got a native client that's really efficient, and we've got bindings to a bunch of different languages. So we've spent a lot of time working on all of this kind of what do you do when you have a model compared to making something that's really easy for researchers to mix and match different layers, right? But we are expanding to attention architectures. We want to make it easier so people can just, if they think of an architecture, they can code it up. But some issues I've seen is that the models can be really big, which don't make a lot of sense for on-device. You know, if you've got a model that's a gigabyte, it's not something that people are going to put in their app, right? But besides that, I mean, they're interesting. Besides the and the, yeah, besides the size and making it streamable, which I think is a technical challenge that's not impossible, there's some people I've seen working on this, but, yeah, that's, they're interesting architectures for sure.

Speaker 2: Is there any other challenge you think that is more important than that we have previously spoken on?

Speaker 1: So we talked about scaling to different languages. We talked about scaling to different accents and different demographics. We talked about data. I mean, I think there's, there's a lot of things that we talked about before you hit deployment, where you hit, when you get into the MLOps world, right? And that's where things also get really interesting, right? If you've got, if you really want to have, let's say, a different model for every user, let's say you've got a user-facing application. Let's say, for instance, you've got a language learning app. And in that case, also, it's really important to get accents right, because if somebody's learning a second language, they're going to have an accent that's different from the native language of the first language, right? So, you know, if you're, if you're, if you're, if you're, if you're, if you're, if you're learning a second language, you're going to have an accent that's different from the native language of the first language, right? So you want to, ideally, in the best scenario, you've got a different model for each user that's just fine-tuned to that user. And we've seen people do this with our, with our tech. What gets really interesting is, what do you do when you've got, you know, 100,000 or more users? You've got all these models that you have to juggle, right? So, yeah, I think that's a really interesting problem. It's, here's a quick little shout out, we're hiring for MLOps roles at Koki. So if this is something that's interesting to you, we're very interested in it. And we've got three open positions for MLOps. So, yeah, but it's, it's, there's a lot of really interesting things that happen after you've trained a model, right? Training the model is, maybe it's because I spent so much time doing it. But it's, it's the part that people talk about a lot, like, oh, I've got this new architecture.

Speaker 2: Yes, exactly. Like 80%, I feel like the 80% is, if you are making a POC or something, or the initial version, 80% is the data itself, and 20% is training model. But after the production, it's mostly the production and the data drift and model monitoring, and it's really hard. Yeah, I know. How do you handle it, by the way, the data drift? Do you record things?

Speaker 1: I'm sorry. Yeah, data drift is important. I think that the only way to stay, to stay up to date with data drift is to keep having a pipeline of data coming in, and keep updating the model. This is another point where it's not like you train a model, and you're like, okay, I'm done. I'm going to deploy this. I'm going to deploy this, I'm going to send it to my users. And then six months later, it's still going to be a great model. No. Once you send a model out into the world, you need to keep updating it. So it's, this is also a hard problem. And it's also related to what we're hiring MLOps people for. But yeah, I think that the middle part of training a model, honestly, is it's what people publish research papers about. It's really kind of research sexy and all of this stuff. But where the real hard work is, and where you get real benefit is from spending time on your data, spending time on your deployment, and deployment, I mean, monitoring as well.

Speaker 2: There is not much difference between model to model if you have quality data. Anyway, it's like, I don't know. It's at most like 10% or something. If you have changed embeddings or something else, you know, like that's what I have observed. And, you know, tracking the fallback cases. And I'm also a chatbot maker myself. But tracking the fallbacks, tracking, you know, if people give you a feedback or something, that's, that's more important. I feel like because that's, you should see whether if your model is matching the outside world versus, you know, your

Speaker 1: training data, basically. Yeah. And, and I think a lot of exactly what you're saying that if the data is, if the data is good data, two different architectures are going to be pretty similar for that, right? So something I think if I've got 10 hours to make a better model, I'm going to spend that 10 hours cleaning the data. That's, and then so then what happens is, a lot of times the question of what architecture do I want to use, isn't driven by how do I get a tiny increase on my accuracy metric, right? It's more driven by where am I deploying this model? What, you know, do I want something that's convolutional? Do I want something that's recurrent? Does it make sense for my hardware, right? It's, it's something that you miss a lot of when, when you see these, the research papers, which there's lots of cool stuff coming out of research, obviously. But I think it's unfortunate that there's so much fixation kind of on the, on the accuracy when there's so many more interesting metrics that you can track and, and metrics that are important, right? Like a lot of people don't think about how much is it going to cost me to train my model? If I've got, if I've got one model that cost me $20,000 to train on AWS, and I've got another model that cost me $100,000 to train on AWS, and the accuracy difference is like half a percent, I'm like, I'm just going to train the cheaper one, right? It's, it's something that's, it's important. And it's unfortunate that it's not tracked as much in the literature, because you see people, and I've done this too, like, I just start training a model. And then I, and I realized like, oh, wait a second, this is the loss is going down a lot slower than I was hoping. How much is this going to cost? And it's, it's a real concern. I think it's a real concern that you feel, and it's painful when you, when you get out of the research world. But it's cool. I mean, it's, it's, it broadens, it opens up the things that you can optimize, right?

Speaker 2: Yeah, exactly. And I feel like that's where the real work lies in. Because you are trying to make this a little smaller, efficient. It's also damaging the environment. And I don't like it, to be honest. Like people train trillion, billion, something, because they have money. I just don't like it. It has tons of carbon emission due to the cloud. Yeah, I completely agree, by the way, if, if the, if it's not going to matter so much, like, when I have observed my chatbots, the chatbot I was working on, I wasn't using anything, just simple embeddings of the, of the training data itself. And I have introduced some embeddings, and it's increased the accuracy by or like F1 score by 20% or something. And that was a dramatic difference. But it changes the how your, how your API is launching, like you have to load those embeddings and stuff, you have to keep them on the cloud. And it's, it's, it takes way longer to train your model.

Speaker 1: Yeah. And if you're talking about word2vec, then those embeddings are big, right?

Speaker 2: Yeah, I, yeah, I was using something even bigger. I mean, but the architecture matters a lot. And when you have something really simple, but if you have, if you already have something advanced, then switching architectures and training again is too much overkill. It's like cutting a bread with a lightsaber or something. Cutting bread with a lightsaber, I like that. Yeah. I, I, I really, I really agree with you with most of these points, actually. And I feel like the valuable work lies in low resource languages. And it's the hard work. And if you can accomplish something in there with smaller models, and, you know, more, less time, less inference time, less training time, then it is, it is, it's a breakthrough, in my opinion. Yeah. So, yeah, I don't have any more questions. And I don't see any more questions. If you would like

Speaker 1: to talk about something else, we can go. Oh, I think we covered, we covered a lot of ground.

Speaker 2: This was, this was fun. It's been almost a half hour. I like to talk in a, you know, like a podcast-ish way, but I don't like, you know, asking too much questions and stuff.

Speaker 1: Yeah, yeah, for sure. I gotcha. I mean, I think, yeah, we covered a ton of ground. This has been, this has been really interesting and, and fun. I appreciate the, the invitation. I'm, yeah, I'm, I'm honored to be the first one for this season, right? This is a new season.

Speaker 2: Yeah, this is a new season. And also, I would like to have this again in like one or two years or so, since, you know, the state of art will change and I really wonder where it's going to have, maybe we can have it again. It's, it's a really interesting area to work at.

Speaker 1: Yeah, for sure. It changes very fast.

Speaker 2: So, thank you so much for coming today. Yeah, it's been nice being here. It's going to be available for the people who couldn't watch it. It's going to be available in my YouTube channel, so you can watch it again. So, yeah.

Speaker 1: Awesome.

Speaker 2: See ya.

Speaker 1: Yeah.

Speaker 2: Bye-bye.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3744 customer reviews

1/736

Verified Order

“I loved it”

Ivy

Oct 29, 2025

“Price is fair, accurate transcriptions and user friendly.I would recommend.”

Robert

Oct 20, 2025

“I am delighted I chose your service. The human interpreter did all I needed. I chose GoTranscript because of the time I saved by having this done. Thank you.”

Alfred

Oct 16, 2025

“So far, OK ”

Steve

Oct 15, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support