Introducing Sylveax: Speech-to-Code Enhancements

Convert Your Audio To Text

4.9/5

3723 customer reviews

Discover Sylveax, a fork of Sylveus, enhancing speech recognition for programming with optimized modes and unique features in Ubuntu.

Code by Voice with Kaldi

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: So, in this video, I wanted to introduce this speech-to-code program I've been working on. I'm calling it Sylveax. It's a fork of the Sylveus project, which uses CAUTI on the back end. And I'm going to do a demo and go over some of the things I added to Sylveus. And before I get started, I want to point out some things that are going on in this top bar here. This is on Ubuntu 18 with GNOME, it's the default Ubuntu, and I added a GNOME extension for displaying the Kaldi transcripts in the top bar here, and also this word on the left is the current mode that Sylviex is in. So I added the ability to change which decoder Kaldi is using on the back end, and basically Basically what that means is I can have the speech recognition bias a certain subset of words and sequences of words and you get a big boost in recognition accuracy when you do that. And so a mode pairs a decoder on the back end with CAUTI with a different parser on the front end. And so the same words can have different meanings, produce different outputs in different modes. And so the words here are the transcripts that are getting sent back. And when it turns blue, that means that Calde marked it as a final transcript. Basically, Calde has to pick some point to stop transcribing the text, otherwise it will send back just a giant document from the time you started the program. And in Sylvius, it waits for the final transcript before it produces any output. And I thought that was a bit wasteful, so I tried to make use of the partial transcripts coming back. And so, at least in some contexts, you're able to just write out that keyboard output as soon as the words are coming back, and you can basically use backspaces to keep it in sync. And I call that optimistic mode. You'll know it's on because the word over here, the mode word, will turn green, and And you'll also see the cursor jumping around as it makes edits on the fly. And when Optimistic Mode's not on, this word will be pink. And other than blue, the transcript over here can turn red when there's a parse error. And it also outputs sometimes the output from another program I'm running called SOPRay, which is short for Sound Pattern Recognition. I trained it on a few sounds that I use for switching between modes and toggling the program off and on. So you'll hear me making like whistling noises and clicking and blowing into the microphone and that's what I'm doing. So I guess I'm going to get started. Jumpy. Timex 6. guy gum, oak, spike, import spay, brace, spay, camel use callback, spay may, spay from, spay quote, react quote, sem. So notice I'm using the word spay a lot to insert spaces. You can also use snippets and things to get around that. I like having the space word forced to use a word to insert it because in programming a lot of the time you're just making edits to code, you're not just, you know, using a template. Spike, spike. Yoda export default function. Spay camel use on enter. Sue callback comma spay inputs. She spay brace. Spike. spay, camel use callback, sue event spay, quill grain, spay brace, brace, spike, if spay sue, event dot key, spay quill, quill, quill, spay quote, rock enter, quote, she spay brace. As you might have noticed there, it accidentally popped into alphabet mode. Soboray, the problem with it is, at least when I'm speaking loudly and stuff, it sometimes has some false positives. I wrote another program with TensorFlow and Keras, which has much better recognition accuracy and fewer false positives, but it's kind of in a half-broken state right now, so I'm not using it. And it's also a lot more annoying to set up. You need to make like 300 samples to train it, and Sobri works well enough, and it only takes like 10 samples because it's only using like a sort of unsophisticated method of sound recognition. So I guess I'll show the the alphabet mode, so if the word isn't in the lexicon that you trained the language model on, it'll have trouble with outputting that word. So you can always switch into alphabet mode and then just type something out. Spike. Event.caml prevent default. Sushi. Spike. Callback Sue, Ch, E-V-E-N-T, She, Cola, Cola, Scratch Sam, Skate, Sock She, Sock, Air Sam, escape, sky gum, oak, slash slash, I-spey, C-A-N-spey, S-P-E-L-L-spey, M-U-C-H-spey, F-A-S-T-E-R-spey, T-H-I-S-spey, W-A-Y, escape. So yeah, I prefer just being able to spell it out like this, and I trained it on, basically what I did is, well first off in programming mode, what I did is I took the 30 most popular GitHub projects for JavaScript, and I converted the code to what would be the equivalent spoken text within Sylviex to produce that output, and I have that loaded up, I think. So this is Angular, which is one of the repos, and I convert it to the spoken equivalent, and then I train the language model on this, and then all 30 projects combined, all the code combined and mapped to my language, it was like a giant file, I think it was like 170 megabytes, and there's a lot of projects in there that give it some good generality. So it's like 3.js for graphics programming, and I think Node.js and Electron were in there, so like some systems level stuff seeped in, like fork and spawn process, words like that. And then also all the web stuff, so like Angular and I think Vue or something like that. And also just, one of the projects was like algorithms in JavaScript, so all sorts of algorithm and data structure words also got mixed into this language model. And so, aside from JavaScript, it works pretty well. I tried it with, I was using it to do some C++ with Unreal Engine, but there are some words that it has trouble with, you know, like domain-specific words. I did the same thing with the alphabet mode, so instead of mapping it to words, I actually split on letters, so the probability of sequence of letters isn't especially evenly distributed in English. So, having a trend on that helps a lot with being able to just spell things out. And if it's overcorrecting, you can just slow down so that it's only using the one grams and then you can depend on the acoustic model to create that output to get a specific letter. Another thing you can do is use a...I created kind of like the equivalent of registers in Vim. So I map a subset of words to whatever you want. I guess I'll show that now. Spike, Oak, Scratch Dose, Regi Sun Pittsburgh, Chi, Sky H, ASH, Sky R-O-U-T-E-R. there, escape, jumpy, pittsburgh.camelsumfunction, sue, pittsburgh, she. So you can see I used the word hash router because I'm looking at the top part now and it's actually showing hash router, but the other day it kept, there it goes, it outputs like dash router or cache router for some reason. So you can fall back to registers for words that you're having trouble with, or you can spell them out. I'll also show I added the ability to undo by backspaces, so not just using a Vim functionality, but you can undo by a full transcript at once by just saying the word shank, or you can undo by a token. So you'd say shank and then a number, and then it would undo by word. So I'll show that. Semspike. Camel, some really long function name that I got wrong. Camel is some really long function name that I got wrong. I should probably explain that. So if the final transcript is a parse error, then it just undoes the whole thing. And that's because in normal mode, when optimistic mode isn't on if the final transcript's parse error doesn't produce any output. And so to make it consistent, I have it set up that way. That's something I'm considering changing, but I kind of like it that way because in practice it doesn't really do that all that often and it makes the code a bit cleaner. So you can pull out this optimistic mode stuff if you don't like it. camel, some long annoying function. Shank one, shank dos, shank. So that's that. I think I'll show how optimistic mode, how it's switching between these, how it's toggling it off and on without me having to explicitly make any sounds or anything like that. So, this is my vimrc, and what I do is I use vim as this idea of auto-command. Basically, you can register a callback to an event, and one of the events is inserting and leaving different modes. when I go into insert mode, I have it call this bash command, which sends a UDP packet to port 5005, and I'm just listening from the Sylveex end, and I just read that into the main loop, so I'm able to control whether optimistic mode is off and on. And you don't have to only use Vim for this. I set it up also for VS Code, although I do use the Vim extension. So for Vim, what I do is I add a task for turning Optimistic Mode off and on, and then from the Vim settings, from your settings.json, I just map the keys that would put you into Insert Mode. I have them call that task and then enter insert mode and then replay the original letter. And that works pretty well. C doesn't work perfectly because as soon as you hit the letter C, it jumps into optimistic mode. It doesn't wait for you to actually enter. So if you hit C-I and then you wait for a moment to figure out if you want to clear within a parenthesis or square brace or whatever, it might jump into optimistic mode and start spamming different characters, so that's something to be aware of. Another thing it does that I'm not crazy about is it outputs the task that you're running in this lower status bar thing, and it overwrites what would be telling you which mode you're in. track that down. It's like, it's in this file, I think it's source-config-remapper.ts-roundline-500. And it would be a pretty easy fix. All it is, is it's mapping over the, it's looping over the config that you have set up. And for each command that it runs, it calls this statusbar.settext method. And that'd be an easy fix. All you would have to do is add an extra key to this command object maybe call it like silent and then you could just check if that was if silent mode is on then you can skip that set text method so you can use VS code you're not forced to use them so what else is there to talk about so I also implemented some things like you can like hold down a key until it hears a noise or can repeat a key for each noise and then to jump out you just make some long extended noise. So I guess I'll show that. Ink, Joker Air, Spike, Pete Air. So you see that to get out you just make a make some long extended noise. What else is there? I implemented like a basic mouse type functionality. It's kind of like Easy Motion. So another thing is within Vim itself to get around I just use like there's a plug-in called Easy Motion and it works well with just any mode where you can output letters and numbers and the same thing for jumping around within files. I like using this thing called Nerdtree and Buffer Explorer. They're things that you can make use of that Easy Motion plug-in from within them too. So what was I going to do. Oh yeah, the mouse. Moose. AirVic. Biz. So you see it moved the mouse there. And at any time when that overlay is showing, you can say a number and then that number controls whether you click or double click or third mouse or second mouse button click. And that's okay for like a one-off, it's not great, it's definitely nowhere near like the ability of the real eye-tracking software stuff, but not so bad for a one-off type of thing. What else do I want to talk about? I guess I'll show that also within this vim instance. So this is actually a separate terminal that opens when you start the program, and I kind of use it like a makeshift user interface. So I have it set up so you can send messages to the Sylviex main loop. So if you ever get stuck in some weird state, you can use this, or if you're debugging mostly is what I use it for. You can actually turn off the Kaldi backend, and the frontend still works, it just tries to connect, and in the meantime you can basically replay the messages that Kaldi would send to put it into weird states. And I also use it for modes that I'm not, that I haven't added to the actual code yet. So I'm going to show this one called Phones. Basically I made a language model where each word is just what Kaldi expects for the, for the phonemes, and you can use that to figure out what the acoustic model is up to. So I'll show that. Air. This is a test. So I made this thing I think like a long time ago, maybe like two years ago. I might have goofed up the language model on this one. I think I only fed it like bigrams or something, so it kind of biases that, but at least for individual sounds it will show the phonemes. This helped me a lot with the alphabet mode because the programs that kind of guess what the lexicon should be for your vocabulary isn't always perfect for, you know, your style of speech and how Caldi is interpreting what you're saying. So, like, for, like, the letter A, when I say A, the phoneme it expects is E-Y, but for some reason, whatever the grapheme to phoneme training program that I was using was outputting some other phonemes. So, um, This thing can come in handy, especially if there's some word that That uh, it's having trouble recognizing and you're not sure why it's good to see if the what the lexicon has matches with what? with what the phonemes are coming out in in this weird mode and I think I've covered about everything I wanted to talk about so I'm probably gonna write up Some more documentation on how you actually get this stuff installed. I wrote like a docker file for the for running Kaldi and the backend server, so it should be much easier to work with. And I have scripts that create the actual decoders, so you only have to feed the script a language model and a lexicon, and it will create the decoder for you and even move it into the right directory and modify the configuration. which is um Sylvia's actually had some of that stuff set up I just modified it slightly and I also have a bunch of scripts for for creating the language models it's not quite perfectly streamlined but I have a file that's pretty well documented because it's something that I just revisit every couple months so it's not always um so I kind of have to refresh myself on how to to how to create it so I can kind of you can check out the notes in there and walks through you know the steps you have to take to create the language model to map over the the code files with your specific vocabulary and and then train it and then you can mix language models and do different things and all that stuff's documented and so um you know if you want to try it out go ahead and let me know how it goes. Good luck.