Creating an Offline Mycroft Voice Assistant Setup

Convert Your Audio To Text

4.9/5

3723 customer reviews

Explore the setup of an offline voice assistant using Mycroft, Raspberry Pi, and local servers for STT and TTS functionality.

Custom Speech-to-Text (STT) and Text-to-Speech (TTS) Servers for Mycroft AI Digi-Key Electronics

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hey, Jay, I know that we've been working on your robots, and you want to have the chat bot, and we have the wake word, hey, Jorvan, as well as controlling some hardware. And I think the next thing that we were talking about is wanting to get it so that it's completely offline, right? Yeah. Okay. And I know that a lot of smart speakers, like our Alexa devices or Siri, they require back-end components. And a lot of it has to do with doing that speech-to-text and text-to-speech machine learning to interpret your voice, as well as give some sort of audio response. And right now, a lot of that's done online. My hope is that we can take this to something that's either local, running on the Pi, or more likely going to be running on a laptop. Are you okay doing something like that?

Speaker 2: Of course. I don't really casually wear robots outside normally. This wouldn't bother me at all. This would make it actually more fun to see, especially how it will react with the outside people.

Speaker 1: Excellent. So I think I can get speech-to-text and text-to-speech running on, say, a local laptop. You would just connect via an Ethernet cable or Wi-Fi, and that way you could just throw a laptop in your backpack and walk around, or be up on stage and have the laptop right there. I don't know about getting the Mycroft backend stuff, because there's a bunch of things where it's like, oh, it has to connect to the backend to get weather data or to verify your device. I don't know about that quite yet. But I think I can get speech-to-text and text-to-speech. That's the first step. Are you good with at least getting those right now?

Speaker 2: Of course. Of course. As we both know, I am not the best programmer. So the fact that you've been helping me with this is 100% appreciated.

Speaker 1: Excellent. I'm excited to get a chatbot up and running. I want to see one on stage or out in the wild and you wearing it and having people talk to it. So let's get started. Here's a basic look at how Mycroft works. There's a Wake Word listener running on the Raspberry Pi. From the last episode, we have the Wake Word set to, hey, Jorvan. When that Wake Word is heard, Mycroft streams raw audio to a speech-to-text server across the internet. By default, Mycroft uses Google's STT engine. The audio is parsed and converted to a string. Mycroft compares that string or intent to the trigger phrases in the available skills. If there's a match, the skill runs its code. Often, this involves fetching information from the internet, like the time or local weather. A response is provided to the skill. The skill parses that response and sends a string to a text-to-speech server, which converts that string to audio. That audio is streamed out over the speaker as a response to the user. As you can see, most smart speakers rely heavily on remote servers to process information. In an attempt to make Mycroft less dependent on the internet and possibly a little more secure, we're going to run speech-to-text and text-to-speech on a separate laptop. There is a TTS service called MimicOne that can be run on the Pi, but it's a bit slow and robotic. We're going to use a fork of Mozilla TTS to give us a more natural-sounding voice. Speech-to-text is notoriously difficult to perform. While it might run on the Pi, everything I've read said that it makes things incredibly slow. And one reason to run things on a local network is to make response times faster. Finally, Mycroft is hopelessly attached to a number of back-end services at home.mycroft.ai. This online service manages skills and accounts, and it helps provide responses to skills that need information. Mycroft is open-source, including the back-end, which is called Selene. It's possible to run it locally, but it requires a lot of effort and is honestly a bit overkill for what we're trying to do. A few projects, like this one from Open Voice OS, have tried to create a dummy back-end for Mycroft that prevents it from phoning home. I've tinkered with it, but I've not had much luck getting it to work completely offline. It's likely new versions of Mycroft circumvent some of the hacks here. If you're able to get a completely offline Mycroft or other smart speaker working, please let us know in the comments. It's something I'd like to keep working toward for future videos. If you head to mycroft-ai.gitbook.io slash docs, you can get a lot of information about how to customize Mycroft. Let's head to customizations, which can be found right here, and we'll go to speech-to-text. Here you can see all the different STT engines that Mycroft supports. We'll be using Mozilla DeepSpeech as it's open-source and relatively easy to set up. Next, head to text-to-speech. You can see that Mycroft defaults to Mimic 1 or Mimic 2, depending on the voice you select. However, we'll be using Koki TTS, which is a fork of the Mozilla TTS project. It comes with its own server, which is perfect for our needs. To start, you'll want to install Ubuntu 18.04 on some computer. This will run our STT and TTS servers. I highly recommend using something with an NVIDIA graphics card, as we can enable the machine learning engines with CUDA to make them much faster. I'll put a link in the description that points you to a guide showing you how to configure CUDA on Ubuntu. I'm going to SSH into my laptop. It just makes it easier for me to screen capture what's going on, and we don't need a GUI anyway. I've actually disabled X Windows or X on my laptop just to save some processing power. I'm going to remote in with SSH. If you installed CUDA, which I'll make sure there's a link in the description to walk you through that process, you should be able to run NVCC dash dash version to get an idea of what CUDA version is required. Note that for these exercises, you really want CUDA 10.1. The STT and TTS servers seem to really like those, otherwise you'll be fighting with versions and dependencies, which is why I'm specifically sticking with Ubuntu 18.04, not that version of Python. Let's try Python 3. Python 3.6.9 seems to work along with CUDA 10.1. There's a whole dependency chain that you have to take care of and make sure everything is happy and matches. I'll put that in the written guide because that's a really boring walkthrough. I want to get to the interesting stuff, which is installing the servers for speech-to-text and text-to-speech. We'll be using Mozilla DeepSpeech, which is an open source project to perform speech-to-text. So go ahead and create a folder to hold our DeepSpeech project. In this case, I'm just going to make a projects directory to do that. If you don't have virtualenv, you might have to install it, but I can run it to create a virtual environment inside of this directory. That's going to keep all of my dependencies, Python versions, package versions, everything separate from my main system so that I can have different versions of different packages running for my different servers. So I'll use virtualenv-p, set it up with Python 3 to create my new virtual environment. I'll then call source DeepSpeech vnbinactivate to activate that virtual environment. You should see my virtual environment name in parentheses here. Mozilla maintains a version of DeepSpeech for CUDA, which is called DeepSpeech-GPU. That will allow you to use CUDA on your graphics card and DeepSpeech will use the acceleration in your GPU to make STT a lot faster. If you're not using the CUDA version, this would just be DeepSpeech without the dash GPU. We also want DeepSpeech-server so that we can create a server that runs as a service on our computer and then we're going to have Mycroft connect to it over Wi-Fi. Press enter and let those run. This may take a moment if you don't already have it installed like I do. Once it's done, we can download a model into this DeepSpeech directory. I'm going to paste it in because it's quite a long URL. We're getting it from the DeepSpeech GitHub repo and it's the pbmm model. We're going to be using 0.9.3. When that's done, we need the score from that same location. So instead of that .pbmm, we're going to get the .score. The .score is a language model and I suppose you'd have different languages that works in conjunction with the acoustic model and that's the .pbmm. When it's done, let's download a WAV file to use as test. Once again, that DeepSpeech repository has some test files for us. They come zipped up in this tarball. So we'll download that and we will untar those and you can see it's just some WAV files with people speaking. You'll notice that since we've installed DeepSpeech, we now have this DeepSpeech command that we can use. We'll feed it a model. That model is the .pbmm. We'll give it the score, which is that language model. Both of those we downloaded in that folder and we'll give it a test audio, which should be in this audio directory we just unzipped. And let's give it that 2830 something something something .wav. If all goes well, it should fire up. You should see the speech to text. So that WAV file gave us this string as an output. Experience proves this. If you listen to that WAV file, you'll hear somebody saying that. And if you're using CUDA with your graphics card, you should see your graphics card being used during inference and take a look at the time to make sure that looks pretty good. Here it took less time than the actual audio file itself. That's promising. What I've noticed is that it has a habit of running faster after you fire up the TensorFlow engine or whatever inference engine you're using, especially on graphics cards, because I think some data is cached and the second time you run it, it's going to be a little faster. I'm going to use an example configuration settings from this main row deep speech server. This is an example configuration server. So we're going to be using this as our example. In fact, I'm just going to copy this as it is. If we look in this directory, we can see we have our models, the audio example files and the virtual environment. I'm going to create a config.json file right in this directory. I'm going to paste in that example we just saw. We're going to create a systemd service that's going to run. And one thing I've learned about systemd stuff is it does not like relative paths. So I'm going to give it absolute paths to the model files. In this case, it's going to be in my home directory projects, deep speech, and then the name of the model. And if you remember, we downloaded 9.3 or 0.9.3, not 0.7.1. And I'm going to do the same for the scorer model. I don't like their default port, so I'm going to put it on 5008 so that I can put my TTS server on 5009. Just easier for me to remember where those are. Everything else, I believe you can leave the same. So control X to exit, yes to save, and save that file. Next we're going to run deep speech server. With that config file set up, we can call deep speech-server and feed it that config file, which should fire up a server for us that we can connect to from, let's say, another computer. I'm going to open up a browser on my laptop, point it to my other laptop, and that's going to be at 192.168.1.209 for me. And I can get that. In fact, let me cancel this server. If you do ifconfig, you should be able to see that I recommend setting a static IP, whether it's on your laptop or giving a preferred IP address to your Mac address on your router, which is my preferred way to do it. But here you can see where I'm getting that .1.209 from, from the laptop that I'm SSH'd into. So let me fire up that server again, and I'm going to refresh this page. You can see that 404 not found means that I need to interact with the server in a different way. It's not serving up web pages, and that's perfectly fine. In fact, what we're going to be doing is sending commands to this address slash STT, and method not allowed, meaning it really doesn't want to communicate with my browser. That's fine. We're going to use other methods to send it audio data, but this means that it is working. And in fact, what I can do is open up another SSH window into that same laptop. I'm going to go into that deep speech folder. You can see it's the laptop I'm working in. I'm going to send it that WAV file we just tried. And even though it's sending it to itself, you can see how you would interact with it across a network. And if you open up the window showing the output of the server, sure enough, you can see the STT result experience proves this. And that's how we're going to send audio files, or we're going to stream audio data directly to that server. Let's close down that session and we can exit that server with control C. I'm going to call deactivate to get out of my virtual environment, which you can see that string has gone away in front of my command prompt. I'm going to create a system D service known as STT for speech to text dash server dot service. And I'm putting that in system D slash system, and I'm going to create my unit file and I'm just going to fill out the basics of this. I won't go into creating unit files here, but know that this allows you to run software or services as soon as Linux boots up. And as far as I know, this is probably the preferred method of doing things on boot for Ubuntu. There are other methods of running programs or applications when you start Linux, but this seems to work well for my purposes. I'm going to do this on two separate lines. Remember, I have to use absolute paths for everything here, and I'm going into that virtual environment binary folder to run the deep speech dash server tool. And I'm going to pass it the config file that we created. And this wanted by multi-user target basically says run this pretty much after everything else and you're waiting for a user to log in. This ensures things like networking is up before we attempt to run this service. Control X to save. Save that unit file. We need to call systemctl daemon-reload so that systemd knows where to find that unit file. It just rereads all the unit files and says, oh, there's a new one and loads it in. Then we're going to enable it so that it will run on boot. We're going to start the service right now, which means the server should be running and we can call service to see if it's running. It looks like it is, which means we can bring this up. Let's refresh. And sure enough, these pages are working. If you try to go to, say, the wrong port, your browser shouldn't be able to connect to anything. So that's how I know that the server is running. We can get out of that and we can use journal control to follow the output of that server in real time. So I'm going to take this window and I'm going to hide it. But first I'm going to bring up a new SSH session because we have one server running at the moment. And I'm going to log into that laptop again so that we have a new interactive shell. So this old one, I'm going to put this window away for the moment and we'll come back to that when we want to see how Mycroft is interacting with our server. Now let's do the same thing, but for our text-to-speech server. So these are two different servers that are running on one laptop. We'll be using Mozilla TTS, but it's going to be encapsulated in that forked Koki TTS server. This should look familiar. We're creating a TTS folder inside of projects. We're going to create a virtual environment for this project and we're going to activate that virtual environment, which we're calling tts-venv. Lucky for us, this TTS project comes as a pip package. So we just need to do python3-m pip install tts. And this is the Koki TTS that we saw earlier. When it's done, there is an example server you can use by calling python-m tts.server.server. If you have CUDA enabled, you can use the parameter dash dash use CUDA space true. Let's fire that up. It should download a couple of default models for us. When it's done, you should see a URL given to you, assuming everything fired up correctly. This is using IPv6. If you want to use IPv6, I'm going to use IPv4. And let's refresh this page and hopefully everything runs. It does not. And that's because I'm using the incorrect port number. So it should be 5002, which will change in a moment. So if we go to 5002, you should see the Koki TTS. And I'm connected to my laptop to get this. I can type in some message. And let's play it.

Speaker 3: Hello, this is a test of my text-to-speech server.

Speaker 1: It's pretty close. It's definitely less robotic than the default text-to-speech on Mycroft. So this will be fun to use. But you're welcome to use either text-to-speech engine. I'm just showing you how to set up one that's fairly easy to instantiate on a separate server somewhere and will connect Mycroft to it. Let's Control-C to exit that server. We'll do TTS dash dash list models. You can see all of the speech models and the vocoder models. The speech models takes text, converts them to essentially a spectrogram, which isn't really speech, and then the vocoder takes that spectrogram and creates the actual sounds from it. So you need a combination of both. If you only list the TTS model, it should automatically pick the best vocoder to use based on whatever the Koki developers found or the Mozilla developers found. You can also head to Mozilla TTS wiki released models, learn about some of the models used at least for the Mozilla one, not necessarily the Koki fork, but you can read about some of the models being used here. If you scroll up, you can see that Tacotron 2DDC is the TTS model that's being used with English language support, and it was trained on the LJ speech dataset. The vocoder model is this HiFiGAN, which is a generative adversarial network. V2 also trained on the same dataset for the English language, and it's usually good to match the dataset and the language. Can't guarantee that even if you match those, it'll work, but these two models seem to work very well together. Let's create the TTS-server service, and we'll do that by writing a unit file. Just as we did before, we will run the actual command from the virtual environment directory, which is tts-venv-bin, and we'll feed it some parameters on multiple lines here. For my purposes, I am going to use it with CUDA, but feel free to leave this off if you're not using CUDA. Instead of 5002, I want this running on port 5009, because it lines up with my 5008 from earlier. If you'd like to specify a particular model, you can do that with the model underscore name parameter. Here, I'm going to use that same model. It seems to work pretty well. So you would specify the TTS model versus the vocoder, the language, then the dataset it was trained on, and then finally the name of the model. All of those are separated by slashes, and again, we will have this wait for that login prompt so that we know networking is running, and this and the STT-server should start up on reboot. Feel free to test that, but I'm going to manually start the service now. I'll call system control in order to reload the whole systemd, daemon, or unit files. I'll enable the service we just created, and then I'll start it. Finally, I'll follow it so that we can get updates in real time. The good news about this is it does show me we have problems. So it's telling me that it failed to execute command no such file or directory. It cannot find TTS server, so I must have messed up a name here. So let's get out of this. Yeah, there is a TTS server. Let's see if we misspelled something. We'll go back into our unit file, and sure enough, this should not have capital letters. Let's get out of this, and I'll call system control restart on that service. Oops, I forgot to call reload first because that needs to happen if we change those unit files. Then we will run restart, and then we will do the journal control to see what's going on. There we go. It is indeed running. Let's go to my engine. If I try port 5002, I shouldn't get anything. So if I try 5009, I get Coqui TTS again.

Speaker 3: Hello, this is another test.

Speaker 1: It works. All right. And then we are going to log into the Raspberry Pi running Mycroft, which I've called Pycroft because that's the name of the distribution. I should get a login prompt. Remember the default is username py and default password is mycroft, and that should bring us into the CLI little GUI that they've created. I don't know how graphical it is, but I'm going to take this TTS server output, and I'm going to put this window aside for now. We'll come back to this and the STT once we are done setting up Mycroft. So from here, control C to exit that CLI client, and I'm going to call Mycroft stop just for good measure. Once that's done, let's clear the console and we'll start from here. We want to run Mycroft config edit user. That's going to bring us into this temp Mycroft.json configuration file. By using that wrapper command, it will also check this file to make sure we've appropriately formatted the JSON stuff. I actually find it very helpful. This should look familiar from the last episode where we configured the wake word to be hey Jorvan and we're using the precise wake word engine. We're going to add a couple of entries down here. The first one is the STT module, which is speech to text. This is telling Mycroft where to go for speech to text. We're going to override whatever is default by using this configuration file. We're going to set up that deep speech server, which we're defining at the address 192.168.1.209. Just like we saw earlier, remember we need that slash STT because that's the interface that is expecting audio data to be transmitted or it's received on the server and then it gives us text or strings out as a result. And I believe this is a Mycroft thing, but you have to specify this module. Deep speech is something that's supported in Mycroft. You just have to tell it where that server is located and it will no longer be using the Google STT, instead it will be using its built in. I know how to work with deep speech, but I just need to know where the server is in order to send that audio data. And that's what we're doing here. As it turns out, Mycroft doesn't have a specific Koki TTS module that we can use. Instead, Koki TTS should use the same thing as Mozilla TTS, since those back ends are essentially the same. So we're just going to use Mozilla TTS for our JSON key values here. And the other one's URI and this one's URL. I'm not exactly sure, but it seems to work. Maybe URL would work for both. I have not tested it. If it works and you try it, let me know. And remember, there's no slash TTS or anything for this. The server's just running at port 5009 on our laptop. And I probably didn't mention it before, the laptop and your Raspberry Pi must be on the same network. You can run an ad hoc network if you really want to make this somewhat portable, but you still need something of an internet connection for the back end of Mycroft. It needs to reach out for whatever skills and configuration it needs. With our STT and TTS modules set up in Mycroft, things should just point to those servers on our laptop. Let's control X. We're going to save this. If you messed up some of the JSON, you'd get a notification here that there was an error with the JSON syntax. And we're going to Mycroft start all since we stopped those services. And we're going to start up that CLI client. I'm also going to turn on my speaker that I have plugged into the Pi. And we're going to wait while this loads up. I'm going to bring in my Windows to show what's going on on my laptop. The top is DeepSpeech, which is my STT server. The bottom is my TTS server. Whenever a request comes in for one of those servers, you should see the output update in those windows. I apologize if the text is somewhat small for this, but this way you get an idea of what's going on and you can see that STT is running and TTS is running whenever Mycroft makes requests. Let's give it a shot. Hey, Jorvan. What's the weather like? My speaker takes a moment to power up because it's a...

Speaker 4: Right now, it's overcast clouds and 23 degrees. Today's forecast is for a high of 27 and a low of 19.

Speaker 1: So something isn't quite right. You can see that the STT worked because you saw the request coming here. What's the weather like? And go back out. However, nothing came in for TTS, but it sounds like Mycroft is still using the default mimic because we don't have the female voice and there's no requests coming in to TTS. So let's figure out what's going on. So I'm going to stop this server and let's go into that config file to see if there's anything that's been misspelled. My first instinct of something being wrong is I misspelled something here. A quick look doesn't show that. What's nice is that I can look at some of the log files. If you go to bar log Mycroft, you can see the log files here. The video usually is things like TTS and STT. So I'm going to look at that. If we scroll up, yeah, you can see it spitting out errors here. If you look through the errors, we find something related to TTS. The back end couldn't be loaded, falling back to mimic. I suppose the better way to do this would be to grep something like TTS, which is not giving me matches anyway. Anyway, we found it by looking at the raw output. So it has something to do with that. If I go back to Koki TTS, it says use configuration Mycroft, kind of the same as using the Mozilla TTS configuration. Oh, we put Mozilla TTS because I got that information from a different site. So the module name should just be Mozilla. So let's go in and edit this. We're going to update this to Mozilla instead of Mozilla TTS. Save that buffer. We're going to call Mycroft start all restart, which should restart all of these services. And we can start that CLI client once more. Give it a moment here to load all of the skills, or at least the ones that can. I know that a few of the skills probably don't work 100%. This is where some of that back end needs to be connected to the internet. Not only is it trying to update skills and some configuration, you also need things like Wolfram Alpha, weather information, and so forth. So some of those things I'll have to figure out if I want to run this completely offline. But for now, we can test our STT and TTS servers. Hey Jorvan, what's the weather like?

Speaker 3: Overcast clouds and 23 degrees. Today's forecast is for a high of 27 and a low of 19.

Speaker 1: And sure enough, you can see the speech to text server gave me the result, what's the weather like, as I asked it. And my TTS server gave the response encoded here. Well, it's two responses that were spoken out through Mycroft. And it's my speaker that takes a moment to kick into gear because it's going to sleep, essentially. It's a battery powered speaker. But with this, you can see it working. We have an STT server running locally. We have a TTS server running locally. And when I say locally, I mean on a separate computer on my own network. But at least I have control of where my voice and words are going instead of going across the internet. I hope this gives you a start running your own speech to text and text to speech servers. I can imagine there are a lot of fun things to do with these outside of just using them for Mycroft. I plan to continue trying to get Mycroft or maybe some other smart speaker application running fully offline. Until then, happy hacking.