Exploring Mozilla's Open Source Community Initiatives
Join Mozilla's community call to learn about open source projects like Common Voice. Engage with community managers and discover new ways to contribute.
File
Community call Offline voice message transcription in Signal Desktop - July 7, 2022
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hello, hello everyone. Welcome to welcome to our community call. So, this is the first time that we actually have a community call in this format, and where we are trying to welcome everyone in the Mozilla community or everyone who is interested in the Mozilla community. So this is an experiment, and we are happy to have your feedback and your, and your suggestion after this call. So, before we start, I just want to bring you through some point of orders, and I will start sharing my screen. Just a moment. Can you see my slide? Can you see the slideshow? Yeah. Great. So, as I say, welcome to our community call. And so, before we start, I'll say just a few information. So, you, if you're watching this on YouTube, please know you can also watch it in Airmo. Both of this link are in the event page that you can find in the community portal, community.mozilla.org. If you have a question, also in the event page, there is a link to a tool that will allow you to ask any question you want, and then we will ask them live here. And you can also ask questions both now and after if you're looking at this as a recording in our matrix room, and that's in the Mozilla instance of matrix, and it's hashtag community room. There is also the link there. And just a quick reminder that all Mozilla spaces, both physical and virtual, are governed by our code of conduct, and you can find our code of conduct in the Mozilla.org website. That said, let me introduce you to the people who are going to talk into this call. So, there is me. I am Francesca. I am a community manager for Mozilla. I'm in the community programs team that works to foster community engagement. Then there is Hillary. Hillary is also a community manager here at Mozilla, and she works for the awesome Common Voice project. Then we have Alessio, who is the star of this call. Alessio is in the data Mozilla team, and he likes working on engineering, science, and problems in general. And then we have Josh Mayer. Hi, Josh. Sorry, I didn't have a picture of you. And Josh works at Coki, and he's here to answer questions about machine learning, any question about machine learning, clearly. So, before I pass the word to these other people in the call, let me just give you a quick welcome to the Mozilla community, especially if you are new to this space. So, in the Mozilla community, everyone that loves the internet is welcome, and there is something to do for everyone. We have many, many projects you can participate in, and I'm sure you can find something that you would like. These are just some projects. They are not all of them. I didn't forget the others. For example, you can help us test features and products when we have our Foxfooding campaign. You can help localize various Mozilla products. Our internet community is the reason why Firefox is localized in so many languages. You can also give feedback and ideas at Mozilla Connect. There are so many projects. I know it can get a bit overwhelming, but I would advise you to just start exploring. Also, if you've been here for a while, there is also something new to discover. However, just as a quick guide, as a place to start, you can go to our newsletter, comunitamozilla.org slash newsletter. There, we will give you information, for example, when this community call is, and about campaigns, and about things you can do to contribute. You can also check out our contribute page, mozilla.org slash contribute, and there you can see various projects you can participate in. If you want to talk with other community members, new community members, and old-timers who can show you around, you can go to our community room in the Mozilla instance of Matrix. Now that this is done, I am going to pass the word to Hilary, who is going to talk about Common

Speaker 2: Voice. HILARY POPE Thank you so much for the introduction, Francesca, and also for planning this with us. As Francesca mentioned, my name is Hilary. I'm based at Common Voice team at Mozilla. Common Voice is an initiative to crowdsource voice data to make a voice dataset which is free and available to download on the Common Voice website. In fact, this week, we launched our 10th release. The dataset aims to support the technologists in creating voice-enabled tools so that people can create tools in a variety of languages spoken across the world, from Kiswahili to Kinyarwanda and more. There are multiple ways in which you can both contribute, build, and collaborate. One of the ones I want to highlight today is our Our Voices diversity model and methods competition, which includes a 20k prize pot. There's a variety of categories, such as the competition features, four categories, the first being focusing on gender. For example, speech-to-text models for an under-resourced language that perform equally as well for female speakers. They're in an accent, so like accent classifiers, methodologies such as benchmark corpuses and dataset audit methodology, and an awesome open call. Josh is also one of the judges on the competition, so if you would like to say a few words about what you do.

Speaker 3: There we go. There's the mute button. Can you hear me okay? Okay, great. Yeah, my name is Josh. I previously worked at Mozilla, working with the common voice folks, working with deep speech, language technology folks. I really like working with language technologies, especially for under-resourced, marginalized language communities, and democratizing those technologies. Currently, I'm working at CoQui, which is a startup with open-source core technologies built around voice technology, specifically for text-to-speech, speech-to-text. I'm happy to add to the conversation in any way I can.

Speaker 1: Great. Thank you very much, Josh. I think now we should finally pass the word to Alessio, so he can start his presentation.

Speaker 4: All right. Hi. Can you see my slide? It should have passed on it. Yes. Okay, awesome. You know, as a human being, I like spending time with other human beings, most of the time, and especially over dinner, you know? So, some weeks ago, I organized a dinner and asked my friends, what do you want for dinner? Now, the response to that was a barrage of voice messages, and that really, you know, surprised me. I'm not joking, like 20-plus messages just with, I want this, I want that, like, would you kindly buy this? It was an awful experience, because I had to listen to them all and to compile a list, which I forgot to do, by the way, on that day. So, a week later, the day of the dinner, I had to do that again, go through all the messages and compile a list. And I honestly would not wish anyone to do that, to go through that experience, like, asking a question in a group with people communicating with voice messages. So, I ended up, as a pleasant host for dinner, to the side by myself, and randomly pick up stuff. Sorry, friends. Now, I really felt that my friends were kind of special. I don't know, because of my area, they were just using voice messages, because maybe they were lazy, or maybe, I don't know, other reasons. But turns out that they were not. So, I stumbled upon this article on TechCrunch, and turns out that people are sending 7 billion voice messages on WhatsApp every day. So, there's a thing there. And I get it. Like, voice allows to have more expressive messages. It grants you some, I would say, expressivity bandwidth that you just don't have when you have text. It allows you to sing in voice messages, to tell jokes that people can get, because you can, you know, have accents, make voices. So, there's something to that. But I kind of personally believe that there's a trade-off. We are trading off the sender's time, so the time that they save, and the emotional bandwidth, with the time that the receiver has to invest in listening to the messages. And we're not even considering, you know, re-listening, in case, you know, there's important information that you need to search for within the voice messages. Digging a bit deeper and following the links on the internet, I stumbled upon this other Wall Street Journal article from 2017 that even signals that how voice messages are important in what they call emerging markets, because voice, as they say, allow to potentially overcome some cultural barriers across the people communicating with each other on instant messaging applications. So, nope, it was not my friends. There's actually something there with voice messages. But yeah, maybe it's time for a proper hi. I'm Alessio. As Francesca said, I work at Mozilla on the data collection tools. But this is a passion project of mine that has nothing to do with my daily job. My academic background is in machine learning, specifically natural language processing, and some computer vision. You know, and after experiencing the frustration of searching twice and listening to twice to the voice messages first hand, I, you know, I try to think of ways, okay, you know, can I actually solve a problem with engineering? And it really intrigued me, like, what if I could transcribe the messages and then search them without listening to the messages ever again? I mean, trusting that the thing would work. So, given that this was a shared problem, I wondered, can I modify a widely used instant messaging application to support transcribing the voice messages and then enable, you know, this feature I talked about? And what should my tech stack look like? It was an easy call for me. I'm a Signal user, so Signal desktop was the most natural choice. Technically speaking, the desktop application, it's a open source, it's an electron based application written in TypeScript, which means it uses web technology, which is a huge plus. It makes it easier to just change it. And so I went for that. Moreover, because of my, you know, I had the privilege to attend and listen to many common voice talks and presentations by the Mozilla Italia community. So, I knew that there was a fairly decent model trained on Italian common voice data, and Italian was the language I was targeting, because my friends are Italian, most of them. And after some pre-investigation, I found that there is a very, really easy way to use the Italian model trained on the common voice data using the libraries from Koki. And as Josh said, Koki is a Berlin startup that produces text-to- speech and speech-to-text technologies. And the amazing thing about them is that everything is open source, or the core of what they do is open source. So, I could just go in there, take a peek, and then try to use that in the Electron app. Because the really neat thing is that their speech-to-text libraries already have out-of-the-box Node.js bindings, and very conveniently for my use case, Electron bindings. So, enough talk. This is what this really looks like at the end of the day. This is Signal Desktop. It's an animation I captured. You click on the conversation on the left after you load the application, and it automatically parses the voice message, transcribes it, passes it to the Koki APIs. And then, under the voice message, it highlights the... It shows the transcription. Now, of course, the accuracy of that varies depending on the speaker, the model, and the noise, which is highly common in voice messages in such an environment. But I have to say that the initial results were quite promising. And one thing that is really worth highlighting here is that everything here is completely offline. So, everything was processed locally on my machine. Nothing was sent to external servers. So, the whole transcription happened on my machine. And I believe that that's really, really cool. Now, with that said, the rest of this presentation will kind of showcase how we can get from downloading the code to getting the demo and the thing working. It's really easy. You would be surprised. Or actually, you wouldn't be, because I'm showing you. So, step one, get a hold on the source code, which either means cloning the Signal Desktop source code from GitHub, or you can fork my fork of their GitHub repo, because it already contains all the code changes that are required for this proof of concept to work. Now, after you download the source code locally on your machine, there's a file called contributing.md in the root of the directory. It contains really, really, really detailed instructions on how to build Signal Desktop on each platform. And depending on your environment, it has some specific, you know, frequently asked questions to get you running in case you stumble on problems. Like, really, the first objective you have, it's to download the source code and just compile it without any change, just to make sure that you can get access to a voice message that you would require in the next steps. And to get a voice message after you download and build from the source code, you have two options. You either test in what they call standalone mode, which is basically Signal Desktop running after you compile it and connecting to the staging servers that they have, so not the production ones. So, you have to, you know, create a new account and register, record a voice message, and then you can use your voice message to test your integration. Or you can test in what it's called the staging environment mode. So, if you are a Signal Desktop user like me, you can point your development profile to the existing profile and just use that data for testing, which is the option I fold for this proof of concept, but it doesn't really make any difference, as long as you have access to at least one voice message. Okay, now, at this point, we should have a build of Signal Desktop ready and we should be ready to change things. Now, what we should do next is add the CoQi speech-to-text dependency to Signal Desktop, which, because they use Yarn, it is as easy as just running Yarn add stt add the version of the CoQi speech-to-text library you need to use, which in my case was 1.3.0. And that's it, it does a bunch of things, it will show in the terminal a few things, and you'll be good to go. After that, you will just need to download the model to use with your voice messages, which depends on the language used in the voice messages. Now, the CoQi website has a very convenient model zoo, if I recall correctly, which you can go to and just browse and search for your language and download the model. So, it's important here to download the model under the models directory of your repository, and that's it. With that ready and done, your model on your local machine, you should now be able to just use it within your build of Signal. But hey, I mentioned model a few times already without over-specifying what a model even is. I will give just a quick and dirty, super high-level overview of this because we have experts in the room to answer other questions if there's any. But in the context of this presentation, a model generically refers to two models, actually. One thing called the acoustic model, which is a thing that takes the audio samples and then transforms that to text. And then there's a scorer, which is also known as a language model, which is basically a list of likelihoods of words appearing after each other. For example, it tells the APIs that pie is likely to appear after cherry so that you can have cherry pie, but it's unlikely maybe to appear after car, for example. So, the models we refer to here are generated most likely on common voice data by using either the Mozilla Deep Speech tools or the CoQ ones. Okay, so now to the code part. All it takes to have a proof of concept of this is to create one file, which I conveniently created into the TS subdirectory of the repository and called it speech-to-text.ts because it's in TypeScript, imported the CoQ speech-to-text APIs at the top of the file as shown in the second point here. And then I defined one function to load the model and then preprocess, create some preprocessing functions that we need to preprocess the voice messages. And finally, calling this function when a signal desktop starts. Now, all of this boils down to this kind of chunk of code. The initial bunch of lines, as you can see in the try block, are defining a CoQ speech, instantiating a CoQ speech-to-text model and passing to it the path to the model we downloaded earlier. Then I'm also enabling the scorer the same way by passing to the path to the scorer file that was downloaded. At this point, for all that the CoQ STT APIs are concerned, we're ready to use the model. There's one caveat though. We need to make sure that the samples we provide to the model are in the same shape of what the model expects. So, if the model was trained with samples in a certain sample rate, we need to make sure that we process our voice messages to match that. So, given the powers of the web platform, what we can do here, it's just use what it's called an audio context. It gives us the ability to just have this, I will call it thing, to have this component that you can initialize with a given sample rate from the model we loaded and use that automatically to do the conversion for us. So, no third-party APIs, no new dependencies, no anything. We just use the power of the web platform to achieve our goal. So, that takes us to loading the model and creating the preprocessing utilities. The next step, it's actually doing preprocessing the voice messages. Now, I would say again that the really great thing about this is that this is web platform within the Electron app. So, we can use URLs to refer to messages. We can use fetch, a standard API to download the raw voice file and get a buffer out of that. And then, finally, pipe that into the audio context that we created in the previous slide. This sounds complicated maybe, but it really isn't. And it's even less lines of code than the previous one. So, again, at the beginning of this function, we call in the fetch function with the URL that's given us by signal. We get the buffer out of this response that we get. And we let the audio context we created earlier to decode the audio buffer. Now, the magic of this is that it will automatically perform the resampling as needed to match the model without any effort from us. So, just calling in the APIs. Now, two more steps here. The output of this decoding step gives us a set of floating point values while the co-key APIs expects integers, 16-bit integers. So, we do a bit of conversion here. And the other caveat is that we only take the data from the left channel of the voice message. I could have gone and, you know, added a new step to the decoding layer, still using the web tech to do the, I think, mixing of the channel. I tried that. It worked, but didn't add any, didn't improve the accuracy. So, I just dropped the complexity there. And finally, we call in the co-key APIs. Here, I used the stream, but they have a much more convenient API I found out later, which is just called speech-to-text. So, you pass in the processed data buffer that we created, and that's it. It will eventually return the text that's transcribed from the voice message. The last step of all this effort, it's really just wiring in the module we created to the UI of Signal Desktop. So, I have to admit that I'm not very familiar with React, which is the framework that Signal Desktop uses. So, I just basically used whatever that section of code was using. And looking at message audio.tsx, which is the implementation for that little cloud that has the play button in Signal to play the voice message. So, looking at that, I found that I could just use what they call effects to automatically apply the transcription when loading the messages, which basically means calling the APIs we just created. And while this might seem a bit more code than the previous slides, it's just boilerplate code. And the really, really, really important part, as you can see, is that we are only calling here in this around line five or six, cokestt.getText, and pass it the URL of the attachment of the message, which is actually the URL to the voice message. And that's it. After we have the transcription, we set, we call the setAudioTranscription function to set the text in the UI. Now, the really, really, really thing that surprised me a lot was that all that took was just about 30 lines of code to get this hardwired into Signal Desktop, which is basically nothing. So, huge kudos to the co-key folks and the Signal folks. And to the common voice community who made the voice samples available and the models. OK, but I did that. Cool. And now I have a custom built Signal Desktop that does stuff with a model. And I thought, OK, you know, Signal has a certain reach. So, what if I wanted to do the same, apply the same technology to, for example, Telegram? And, you know, I thought that before I knew that Telegram Premium would allow for that, actually, because that's now available. But before then, I thought about this. And the problem was that Telegram, the web, it's a web application. It's a progressive web application. So, it does not work within an Electron standalone web platform environment. It runs on the web platform. It's a website. And as such, I couldn't use the bindings offered by the co-key speech-to-text APIs. So, in general, this boils down to how can we use this technology within a simple web page? So, after, you know, joining the co-key community that, by the way, is super friendly and welcoming, I joined a group of like-minded volunteers that were attempting to port the co-key speech-to-text APIs to WebAssembly. And some magic happened after that. So, we were able together to build a working proof-of-concept version of the WebAssembly-powered co-key speech-to-text API. What this means is that with that, if it will get merged, you could use the co-key speech-to-text APIs to craft a web page and do things on your own, like a web page and do transcription or other things. All of this is public on my GitHub profile. So, you can just look at the code. But now we are basically challenging the live demo gods and trying to do a live demo. So, let me stop sharing my screen. And try to share my other screen. Okay. So, this is just a test file, which I hope you could hear. Prova. It just says Prova, which means test in Italian. And it's my recorded voice for that. And this, it's a sample test page that we put together to try the WebAssembly version of this. So, let me refresh. It's really, I mean, this is showcase that this is not loading anything outside of stuff from my machine, from my local web server. It's basically just downloading the WASM worker and some WebAssembly file. So, back to the console. Let me load the model. I will load the Italian model. It's loading the model. It's loading the call key speech text APIs. It's showing them in here. So, they're cool working. Let me load a scorer because we have that working too. This is a bit of bigger file. So, it will take a bit. This is proof of concept. It's not at all optimized. So, it takes a bit of time to do the initial loading. Hopefully, it will tell us that scorer is loaded unless, oh yes, it worked here. And the last but not least important part, it's actually loading the test file, which is the Prova file. And loaded that and transcription is done in 1.5 seconds, which is roughly, well, no, Prova. Yeah, maybe. It's roughly the size of the audio file. Yay, live demo success. So, let me share back the other slide deck. So, this was really, I think, the most beautiful part of it. Like, getting together with people I don't really know at all and contributing together towards a common goal, the speech-to-text porting. It was just a blast. Open source is great. Common voice is great. It's just a blast. And that's it. Thank you. My Twitter profile, it's dexterp37, as my Github profile. Please, please, please contribute to Common Voice because that's super helpful. Having the models ready and the voice samples was just great. Huge thanks to the Koki folks and the Signal folks for making their products just great. So, that's it. Thank you so much for watching.

Speaker 1: Great presentation. Yeah, that was great. We actually have quite a few questions. We just put them there in the chat, but I will keep keeping an eye on the various channels where you can ask questions. Hilary, you want to go ahead and ask them, and then we can see either Alessio or Josh can answer them. So, yeah, Alessio, that was an amazing presentation,

Speaker 2: and thank you so much for creating this. It's so exciting to see how you've been able to utilize all of these open source tools to create such a cool thing. So, we've got a few questions from Slido and on the community channel. So, if there's any questions that you've missed, if you're watching this, please feel free to submit them, and if you're watching this afterwards, you can submit them there as well. So, there's a question for Josh. What's the progress on Koki so far since the departure from Mozilla, and does the new version of Common Voice datasets help,

Speaker 3: for example, with more languages? Yeah, so the second part of that first does do new versions of Common Voice help? Huge. It's hard to understate how much they help. Common Voice dataset for all the different languages is really unique in terms of voice datasets because of the diversity of voices, the diversity of ages, genders, accents. It's a real dataset. Lots of other speech recognition models are trained on a small set of people that have a certain accent, and they sound like they come from a certain place, and they talk about certain things. Common Voice is just... It makes it possible to create speech recognition models that really anybody can use. And as the dataset gets bigger, more voices come in. It's huge. And so, in terms of... That's kind of the importance of Common Voice. And also, when a new language gets added to Common Voice, that's basically the barrier to entry to creating a new speech recognition model for that language. Otherwise, we have nothing to go off of. And in terms of general progress on Coqui, we're doing both speech-to-text and text-to-speech, following those out. People like Alessio are working really hard on just picking up the code, running with it, going in this direction that we hadn't... It wasn't on our radar, going into Wasm, for example. But I've popped in and out of that kind of sub-community, and it's really impressive. Just a group of less than 10 people, probably, hacking over a weekend, got this to work. It's very cool. I know the project's still going, but it seems like, from what I saw from the Telegram group, that just in a weekend, there was already huge progress. So, yeah, that's what I would say for that question.

Speaker 2: Cool, thank you. Alessio, the second question is, besides Signal, have you tried the speech-to-text on other environments, such as Firefox extensions, or apps, or bots, or on Telegram, or Discord?

Speaker 4: That's really a great question. So, I have another weekend project of mine, or other night project, depending on my weekends, which is actually live captioning of YouTube videos using Firefox extensions. So, I am trying to build that, and that's why I was really pushing hard on the Wasm thing, because I wanted to do that. I don't have anything that's worth sharing right now, like nothing concrete, just a bunch of things hacked together on my computer. But yeah, that's a direction I'm exploring, and just for the sake of, you know, trying to understand about what's possible, rather than, you know, doing something precisely.

Speaker 2: Oh, just another follow-up question. Is it possible to clarify what Wasm is, because you could plug, like, the project, so that, like, you're working with other people as well,

Speaker 4: or it's an opportunity, yeah. Yeah, so, the whole, the WebAssembly portal, sorry, can you repeat your question, please? So, sorry if I spoke too fast. I was saying,

Speaker 2: can you clarify what Wasm is, and how people can join, so that, like, as you mentioned, like, at the moment, you're still working on it, you could plug other people. Sorry, I realized that in me saying that question, I'm like, you think colloquial slang, so, yeah.

Speaker 5: No problem. Actually, my fault for going ahead and trying to answer, and then realizing,

Speaker 4: I didn't get the question. So, yeah, now, that's a great question. There's a PR, actually, opened on the, I think, on the Koki STT repository about merging the WebAssembly changes, and I think that's just the first step, and it's just an exploratory thing that the great Koki folks agreed to take. There's much more work to do, for example, optimization, because if you have longer audio files, it might take a lot of time, so there's plenty of opportunities to do, for the community to engage with that, if they want. There's a bunch of limitations that we are currently trying to figure out, and one of them, it's kind of, sadly, not fixable, which is the amount of memory you can address within the browser when you use WebAssembly, which is limited to four gigabytes, but there's a WebAssembly community effort to fix that in the long run, so that's going to be solved. But, yeah, there's plenty of opportunities to join, and the current progress, it's listed in the Koki speech-to-text repository

Speaker 2: in the active PRs. Cool. So, a third question that we have is, how good is the current speech-to-text model out of Common Voice? I do want to clarify before this question gets answered. So, Common Voice, the team, does not create or generate models. We only create space and steward the data set, and the data set is crowdsourced, so people can go onto the platform and contribute voice by reading out sentences that are also crowdsourced. So, they're crowdsourced by people writing them in a sentence collector, or through working with radio stations to authors who dedicate text to Creative Commons 0, so it could be excerpts of their work, et cetera, and if you're interested in learning more about this, definitely check out the Common Voice About page. So, maybe a way to edit that question is, I'd love to hear both from Alessio and Josh on this, especially reflecting with the 10th release this week, what would you like to see in the future for the next 10 releases for Common Voice to support your work? Yeah. It's okay if Alessio goes first, and Josh, sorry. Yeah.

Speaker 4: Okay, yeah. So, I, for example, would really like...

Speaker 1: Can I just interrupt a second, just to say that when we put messages in the chat, people can see them, and I think they disrupt the image a bit, so I'm going to put them in the moderator, and then Hilary can ask for their, just letting you know.

Speaker 4: Yeah. So, one thing that I noticed is that, for example, the community-made Italian model sometimes performs great if you don't have any accent, and sometimes performs like, you know, in a suboptimal way if the environment is noisy, and if you have any specific regional accent, which in Italy, it's a thing, because there's many regions, and you would be surprised about how the accent changes between one town to another town 10 kilometers away. So, I would really, you know, encourage folks to provide more data with specific regional input. That would be great. Like, parsing the voice messages from people from other towns would be much simpler

Speaker 3: for dinner, in my case. Yeah, I'll add to that and say it's I think Italy is a really great example of linguistic diversity like that, but that same problem is everywhere. It's, you know, even, you know, in the United States, people in California versus people in Texas speak very differently. Usually, we can all understand each other, but not always, honestly, and so the more speech we get from different locations, the better the model performs, and I would say that. So, this question, there's the data side of it, like Hilary's addressing, which is, you know, the convoys team is creating a space to create models, and they're creating data that are necessary to create the models, and then the community, different folks are out there taking tools from Koki, taking tools from different how do you, speech toolkits, and creating their own models, and it's, this question is, it's very, it's very case and language specific, because there are times where you can take a model for a language that doesn't, there's not a lot of data on common voice. Let's say there's, you know, 30 minutes of data on common voice for some language. You can still use that to create a speech recognition experience that is still really engaging and useful if you constrain the kind of application. So, there's, there's folks, we had a kind of Koki common voice sprint, and they were, folks were creating a chess application where the vocabulary, the space of possible things you can say is very limited. Like, you can say all the pieces of the chessboard, right, like king, queen, rook, whatever, and you can tell them where to move, but that's it. But for a chess application, that's all you need, right? So, it's long form transcription, like what Alessio's doing here for Signal is a hard thing to do, because anybody could say anything with any accent, right? But if you, if you narrow the domain, you narrow the application, you can do really cool things, even if you think, oh, this model was only trained on whatever, 25 minutes of Kyrgyz, so it's not going to work very well, but you can actually build some cool experiences with that.

Speaker 2: Sorry, I lost the mute button. But anyway, that is a really good response, Josh, and thank you both for sharing your thoughts and reflections, and just to smally add, yeah, on Common Voice, if you create a profile, you can choose an accent or write in an accent, so definitely encourage you to, like, share your accents. And also, Josh, I noticed you also responded to the fourth question, which was how many hours of recording per language are needed to build up a good model? So yeah, I feel like we don't necessarily have to answer that, because you already have. Okay, I'm going to quickly jump to some of the questions towards the end, and then come back again. So, thinking about porting to mobile devices, what are the computing requirements for running an interface? Okay, this is a big question. So yeah, probably should have read it

Speaker 5: first. Yeah, like, I can give my uninformed opinion on this, which is, it is, due to the way

Speaker 4: models are currently built, and due to the way the APIs are currently shaped, it is a resource intensive task. I've seen that there's models specifically built for the mobile environment, for example, that are smaller, because you have, you know, both memory constraints and, you know, CPU power constraints. So you need to be very mindful about that. For example, if you were to use the same WebAssembly thing that I'm using here in a browser on mobile, that would not probably work the same way it works on desktop. It would have a much different

Speaker 3: performance. Yeah, and I'll add to that, which is the model. So Alessio is doing this kind of open domain transcription task. And so there's, there's two models that are used. There's this acoustic model, and there's this language model. I won't get into them. But there are two parts that go together to do speech transcription. And if you have a very large vocabulary, if you have open domain, like anybody can say anything, then those models get bigger, right? So if you constrain the domain, you can get models that are tiny, like, there's a model in the Koki model zoo, that's like a few kilobytes. But it only recognizes yes and no. So that's like the extreme example of, of constraining the vocabulary to only two words. But once you do that, it's tiny. It's like, you know, the size of a smaller than a Word document. But yeah, if you want to do open, open transcription, you need some bigger models.

Speaker 2: Cool, thank you for both responding. So I want to do probably do like, like four more questions, just to make sure we have enough time to wrap. But if we're depending on how long the responses are, anyway. So another question is, could you map recordings, so like users for signal, then would it be feasible to make a speech to text model per speaker? I'd like to hear Alessio's response to this. Yeah.

Speaker 3: Yeah, this is something I've been thinking when while implementing this.

Speaker 5: Yes, like you could overfit maybe a model single speaker, I guess. But you know, where would you

Speaker 4: train that? Like if you were on a mobile thing? Would you train the model on the mobile device? And then it would run into the same, you know, the same model? Would you train the model on the mobile device? And then it would run into the same resourcing resource problems that we talked about earlier. But what if we approach this kind of in a different way? Like, well, thinking about this problem, I thought, okay, we integrate this in an instant messaging application. And then we give people the opportunity to rate or actually fix the transcription themselves. Like if they go and say, yeah, you know, this was wrong. The real transcription was this other sentence. Maybe there's a way to do one shot training and actually send, you know, some representation of the difference of the model with a server and kind of integrate all of this feedback into a newer version of the model. Like I've read some paper about this, but I've not really dug into much into how this would work. It was just, you know, forward thinking about this. I'm really curious about what Josh will say about this.

Speaker 3: Yeah. So there's a whole kind of field machine learning, federated learning it's called, that looks into this. And it's exactly what Alessio just said. Like you can, you know, kind of calculate how the model should be updated on somebody's phone. And then you actually do the updating on a server. And it's, I mean, there's a lot of really interesting ways you could do it, but it's definitely possible. I mean, it's in, people do it in practice. But it's something that would require some kind of orchestrating of, you know, model updates and rolling out models and stuff, but it could be very cool. This is something that I've thought about, you know, if everybody, let's say everybody who was using Signal or Telegram or whatever, had their own model on their device and the transcription is happening only on their device. And then the transcript gets sent to their friends in the group so that, you know, everybody only has one model on their devices and everybody's able to kind of update their transcripts in real time. I think that that could be a, that could be a cool thing, you know. That's a great approach. I didn't think of that.

Speaker 2: Cool. So another question we have is actually related to the previous one. So can you sync transcriptions from desktops to mobile? You kind of touched it on a bit before, but if you could clarify, that'd be great. Yeah. Like in the context of Signal,

Speaker 4: my understanding is that you can add metadata to the messages and they, I think that to some extent they get synced. So to that, to that, you know, to that extent, yes. As long as you save the transcribed message in the message metadata, you can get that. Now, and this kind of actually uncovers my bluff because I talked about searching messages, but I haven't actually saved the text into any message on Signal. So yeah, good catch. I was bluffing, but it's achievable. It's easily achievable. Well, it still is. Visual search, just scrolling is a lot easier than

Speaker 3: scrolling through a lot of waveforms of audio files. So it's still easier. Definitely. And plus like the project's going to develop like, and the code is open source,

Speaker 2: so people can help out with, with what you're trying to do. I want to ask a few more questions, which is quite similar to this previous point. What should we do to make it easier for people to access the audio file? So I can say this Italian model was made possible by the Italian

Speaker 3: Mozilla common voice community. There's also a similar French common voice community out there that's very active in making models and pushing them publicly. Honestly, the only thing that needs to be done is people to get engaged because the code's out there, the data's out there. The one thing that's not, it takes a little more time to find is the compute resources. Usually, usually you want to have a GPU. But for a lot of languages where you don't have a ton of data, you can use Google Colab. So yeah, as long as you can find a GPU and they're out there, they exist, you can get them for free. There's everything else is out there. Just need initiative.

Speaker 4: Yeah, I just want to add that one thing. I believe that I should double check before I say that, but I believe that the computing resources, so the GPUs that Josh mentioned were made available for the Italian model by the University of Turin, if I'm not mistaken. So there was some involvement of a university to actually grant processing power to make this possible.

Speaker 2: Thanks both for the responses. And I think it's definitely important, highlighting that equity and computational resource can be a barrier to entry, even if you do have the data and have the models. And I think it's very important that we acknowledge that. And also, I think it's cool that you highlight the collaboration because then looking at how the university is partnering with contributors that are developing resources for their language, it's a good thing to see. Just following up on this, I think I actually

Speaker 1: see a question that touch on GPU resources. Say, given the variety of model available and GPU resources needed to train, would it make sense to create shared pre-trained models for later

Speaker 3: transfer learning? Yes, 100%. I think that that's one great thing with, I'm not sure if the model waits for the Italian models are open, but there's for French, I know that they do that. And we do that for English. And we've got a really nice kind of base model for English. That's good for transfer learning to other languages. Practically speaking, whenever I'm working with a new language, I just take the English model and I fine tune it to the new language. And that works really well. Yeah. Which is on, I'll say it's on Cokie's GitHub. So you can find the model weights under the releases.

Speaker 2: And it's a quick question. Alessio, do you want to come in?

Speaker 4: Oh, no, no, no, no. I was just double checking if the weights were available. I'm not sure. I couldn't find out for the Italian model. Cool. Francesca, I should probably have two

Speaker 1: or three more questions. I just want to make sure time-wise. Yeah. Yeah. I would say let's, yeah, we have just six minutes more. I would say I can put some of the other questions in, also in the matrix room, and maybe you can answer them there, Alessio, if you want to have a look there and maybe you can answer them there later, if you have the time, obviously.

Speaker 2: Cool. I'll probably then just ask two questions just on here. One is, I think, a quick one. Is it possible to use Common Voice and Cokie technologies in Python projects? The answer is yes. Yeah. Yeah. Is it possible, Josh, to explain a bit? And I know you had the

Speaker 3: read the docs documentation as well for Cokie. Yeah. pip install stt is the way to get started.

Speaker 2: Yeah. Cool. And one final question. If you had all the time in the world, this is a personal one, because I realized I didn't ask that question. If you had all the time in the world and resources,

Speaker 4: what would you do with this project, Alessio? I'm a Star Trek fan, so I do a recorder-like thing, live translation of things with voice. I actually tried to stitch something like this together last week, piping the speech-to-text to the text-to-speech from the Cokie APIs, and then using another thing that Mozilla made, which is WebAssembly-based machine translation. So I tried to put all these things together to talk in Italian and have the machine speak in English. Yeah. I think that something like this would be amazing, especially when traveling, if you don't know the language, which it's something I face frequently.

Speaker 2: Boom. I wish I could speak more languages. And with that, thank you so much, Alessio, Josh, Francesca, Andy, Constantino, who's in the channel, and everyone that supported this event. We want to thank everyone that's watching, and we want to encourage you to definitely check out documentation of the things that we've mentioned, from Common Voice, to Cokie, to Alessio's project. We hope that this has inspired you. Francesca, do you want to say

Speaker 1: any last words as well? Yeah. If you have enjoyed, just to think, I added some of the questions that I think were not answered in the matrix chat. So Alessio, you can go and answer them. And Josh also, if you want, I can send you the link to that matrix room. And the last thing, as I say, we are thinking about making this as more regular thing. So if you have feedback, and if you have ideas about what we could do next, please share them with us in the matrix room. We are very happy to hear from you. And yeah, I don't have anything else to add. And thank you very much, Alessio, Josh, for coming. This was great. And let's hope we can do something like that soon again.

Speaker 3: Yeah. Hats off to Alessio. This is an awesome project. I think it's something that needed to get done. Very cool. Thanks, folks. Bye, everyone.

Speaker 1: Bye. Bye.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript