Speaker 1: Hello there, I hope you're having a wonderful day. My name is Kevin Lewis and I'm a developer advocate here at DeepGram. Now a couple of weeks ago I built a small project and published it on Twitter thinking a couple of my friends would like it and it turns out a whole bunch of you were super interested in this live transcription and translation badge that I built. So of course that made me want to build more and more features into it and this video is where I'm going to run through what the parts are, how the software works and how you can build your own and use it in your day to day. If you have any questions at all, please feel free to reach out to us. We love helping you build cool projects with voice and with that, let's get started. These are all the parts you'll need for this project. At the heart of it is the Raspberry Pi. This is a fully featured desktop computer similar to the larger desktops you may be familiar with except it's tiny and it's quite inexpensive too. This specific model is the Raspberry Pi Zero 2W. It's quite important that you get the W models because it has Wi-Fi and Bluetooth on board and as this project will require an internet connection, it's preferable to have that built into the board. The Raspberry Pi Zero denotes the size of this board which is really light and small and because this is a wearable device, this is the one I would recommend. This is a fully featured desktop computer but it has no onboard storage so you will also need a micro SD card. This micro SD card has Raspberry Pi OS version 10 Buster on it and because of the, at the time of recording at least, because of the screen's compatibility, Buster is the version I need even though there is a newer version of Raspberry Pi OS available. That just pops in there like so. The next part of this project is the HyperPixel 4. This is a 4-inch touchscreen by Pimeroni. It's really nice, again, quite light, big enough that you have enough of a touch surface and what you'll notice here are these holes on my Raspberry Pi. You have these pins here. Not every Raspberry Pi comes with the pins pre-installed. It may come with it as a separate piece and you have to solder them. I'm not good at soldering so I bought a Raspberry Pi with pre-soldered header pins. What you do is you just marry these up and you just kind of, while being mindful of the screen and not wanting to put too much pressure on it, you just kind of push it down like so until it doesn't go anymore. That's now a fully featured computer with an operating system with a touchscreen. Just as a note, Pimeroni have an amazing setup guide for this screen but you will have to install some drivers and in order to do that, you're going to need to plug the Raspberry Pi in to a more traditional screen like a TV or a monitor in order to do that first time setup. But once you've installed the drivers, it's basically plug and play. To plug it into a more traditional screen, you'll use this mini HDMI port here and ideally you'll get a mini HDMI to HDMI cable or adapter. This also does require power so I have my battery pack here and a cable for that battery pack. I do also have a smaller one that's sized like a credit card, about the size of this device, but while developing I want a bigger battery so I've been using this one. The final part really is a microphone. This is a little lapel mic because the Raspberry Pi does not have a microphone on board. So this is a little one. I'll link it in the description and then I have a little USB-C to micro USB adapter on it because this mic is actually a USB-C mic. So you would put power into one of them, the mic into one of them, and then use ideally a wireless keyboard which you can configure with a wired keyboard, I guess, in order to type into the device. But for most things, the touchscreen is just fine. So now we're going to get on to talk about the software that runs on this device. The application that actually runs on the screen on the Raspberry Pi is actually a web application running in a full screen browser. And that gives us the advantage of being able to develop for it and test it and run it on any device. So here I am on my desktop computer and I'm going to show you how the application works, even for developers out there, I'm going to talk you through the code and how it's put together. And then when we regroup after the run through, I will tell you how to deploy your own version of this project. So here it is in this emulator. So this is the kind of dimensions of the screen. We have this badge mode here, which is meant to just be static information that you can wear on you. Then we have transcribe mode. There's two variants here. The first one is wearer only transcription. And this one, even if it detects the second voice, will only show you the first. Group transcription is a little harder to demonstrate with just me here. But if a second voice was detected, that would be in a different color. A third voice that would be in a different color as well. And so on. Then we have the translation mode. You pick a target language from this list and it will go and transcribe your voice and then translate it and show you the translated version of your voice. There we go. And then, yes, we have the badge mode. So now I'm going to talk you through how this is put together. Before I do that, I want to talk to you about the third party services that we're going to use for this project. The first is DeepGram. DeepGram is a speech recognition API that can return fast and accurate transcripts in real time. And I'll show you how that works in a moment. Then we use iTranslate for the translation API. They have a whole set of supported languages, which is where the list on the badge comes from. And reasonable documentation that makes it clear how you can go ahead and make API calls. So what you need to do ahead of time is you will need a DeepGram API key. So you can sign up for a DeepGram account. You get quite a lot of free credit. You create a new API key, which has admin rights and doesn't expire. And you'll need that key. You'll also need your project ID here. I forgot to say this when I was originally recording, but this next big chunk of the video is going to be talking through the code that runs this software. So if you're not a software developer and you're just interested in getting this project up and running on your own device, then skip to this time. So let's talk about this Glitch app here. So it's a Node.js application on the back end and a Vue.js application on the front end. And there were several ways I could put together this video. We could build it together. But I actually think the easiest way to do this is to just talk you through the finished code and then provide you with the code in the description so you can go and take a further look. But of course, if you have any questions, you can reach out to us on Twitter. You can reach out to us via email. And we are more than happy to help out and clarify any further questions. So the first thing I want to talk to you about is the back end application. And this is the entire thing. This is an Express.js web application where we have required an initialized Express, DeepGram, and the Axios HTTP library. And all this exists to do is two things. The first is to generate brand new DeepGram API keys that have minimal permissions and that only work for 10 seconds. And that's enough time to initially connect with DeepGram. And then if someone gets hold of this key, it's useless after 10 seconds. The second thing that this server-side application does is translates phrases. So you can make an API call to it. You can make an HTTP request. And you can specify the text and the target language. And it will return, in turn, the translated text. And that's all this exists to do. Most of the heavy lifting is done on the front end. So as I mentioned, the front end is a Vue.js application. And this is how it works. On the kind of high level, we have this quite short web page, which has two parts, main and aside. Now, to show you how that translates here, this is the aside here. And this is the main here, this kind of darker piece here. So the aside is used just for navigation. Now, what I want you to note here is in the URL, it says mode transcribe. When you click translate, because it was less code, basically, it just refreshes the page with a new query parameter, translate. And if I hit badge, it will do the same with badge. So we use that there. And that's important because based on that URL query parameter, we will display a different section, either transcribe, translate, or badge. Or badge. Let me undo that. Or badge. OK, let's talk you through the Vue.js code here. So, again, it's not terribly long, but we will take our time and work through it all. When the application is started, we do two things. The first thing we do is set the mode based on that URL query parameter. So if it is provided, we will go ahead and set settings.mode to whatever it was. And if it's missing, if you just go to the URL without the mode, we'll just default you to transcribe. The next method that matters here is navigateTo. Without going line by line, what this will do is replace that query parameter or add it if it doesn't exist, and then refresh the page with that new query parameter. So that's basically a very low rent router for this project. The other thing I want to show you in this first section is getting the user mic. This is supported in most browsers. You can ask the user for access to their microphone. You've probably seen those prompts before. And then it creates a new media recorder, which in turn lets us get raw data from the mic. In some browsers, this isn't supported. At the time of recording, Safari doesn't support this without toggling it on, which you can't depend on users to do. So if it isn't supported, we just pop up to the user that it isn't supported in their browser. Next, we're going to talk about how transcription works. And to start off, I want to show you the transcribe section of the HTML. So straight away, when that section loads, we present the user with two buttons, either where only transcription or group transcription, where only or group. And when you click those, it runs the begin transcription method with a different argument, either single or group. And in turn, based on that, the results will be shown in this div or this div. And there is a slight difference. For example, we need to add some indicator of who the speaker is so we can style it. So when we click the buttons, the begin transcription method begins. Begin transcription does actually probably the most heavy lifting in this application. The first thing we'll do is just store the type of transcription so we can change which part of the HTML is rendering. We go and get that brand new DeepGram token from our server side and extract the key from it. We'll talk about this line in a moment. But then what we do is we establish a WebSocket connection with DeepGram using our key. Every quarter of a second, we make data available from our mic. And when that happens, we send the data to DeepGram. And in turn, when data comes back, we hand it off to another method, which we'll talk about in a moment. This line of code here actually does something quite simple, but I didn't initially realize I needed to do this. As more and more words are said, in fact, let me give you a demonstration here. As more and more words are said, obviously, the amount of text that is displaying on the page will be taller than the page itself. And by default, it won't automatically scroll. By default, it will stay at the very, very top of the page and you can't see new words that are being said. So all that line of code does is constantly scroll to the bottom of the page. And as the page gets longer, we move with it. So that's what that does. It does it every hundredth of a second. So now let's talk about what happens when the transcription comes back from DeepGram. What we do is we send data into the phrases.pending array. The phrases.pending array. And you may ask yourself, what is pending? What is pending? What is this is final? And what is this phrases.final? Why are we talking about final? So DeepGram, when in live transcription mode, will actually send data back to us quite rapidly with an interpretation of the words that were said. And it will continuously do that for any given phrase that you say until such a time it becomes confident in what it has said. And it will move on to the next phrase. So what we want to do is show data to users as quickly as we can. But that pending, before it's marked as final, when that phrase is still pending, we do still want to update it until it is final. And this allows us to navigate that kind of data that is returned. So we have our pending data. And if it's final, we push it into the final array, meaning that phrase is no longer going to be altered by DeepGram and we empty out the pending array. Additionally, if we are translating code, this is the point where we will also go off to iTranslate and begin translating. But we'll talk about translation in just a moment. The other things I just want to draw your attention to here are a couple of computed properties. The group transcript just adds together the final and the pending arrays. And the single transcript, what this does is only returns words spoken by the initial speaker. So if I'm speaking with someone else, DeepGram will pick up their words and it will return the words that they said. And what this computed property does is say, hey, if this isn't the first speaker, don't bother returning it. Right. Let's talk about translation then. So we have here whenever data comes back from DeepGram, if we're in translate mode, which is denoted by translate here in the URL. Then also go off and translate the phrase with iTranslate. iTranslate supports just a string. So we will take the array that comes back from DeepGram and we'll just turn it into a space separated string. And we'll also indicate whether or not this is the final utterance. So here we have translate phrase. We're going off to our server side translate route handler. We're specifying the language that is clicked when a user clicks a button. More about that in a moment. And when data comes back, we will push it into the array. And in turn, we will display it to users. So let's talk a little bit about language selection then. So first thing I want to show you is this languages JS file that I created. iTranslate is wonderful. They have this lovely long list of languages they support. You can specify what language you want by providing a short, what, two or five digit code for that language. But nowhere do they give you a structured document that allows you to put the labeled languages like Bosnian with its two character or five character code. So I've done that for my application. I did it manually. You're welcome to take this away as well. But here we have the codes that we need to provide when we make an API call and the label that a user would understand and want to click on. I also manually added this RTL property to Hebrew. This means right to left. Hebrew is a language read right to left. And we need to factor that in when we display it to users. Over in our index.html or rather over in our script.js. Just as a note, in case you missed it, we're loading in the languages here so we can refer to them with view directors. Now over in the translate mode, we put a different button in for every language in that array. We show the label of that language. When you click the button, we begin the translation with the code, the code that the user doesn't need to see or care about. But we need as developers. You also see here that we have this direction styling that's being applied with text direction. We'll talk about that again in just a moment. So we begin translation. Beginning translation is setting the code in a place where we can access it later. And then it's just beginning normal transcription because that's the first step, right? We need to get the transcription. Once it's returned, we then want to go and translate it. So then we go through this whole process of, you know, beginning transcription, transcript transcription results. Except this time, translate phrase will be called as well. Then we'll go ahead and translate the phrase like normal. Final things to note here in these computed properties, we have translated transcript. And what this will do is add together the final, all of the final phrases and the pending phrase. And then finally, there was this text direction computed property. If the language in the languages array has an RTL value, then we will go ahead and set the return value to RTL. Otherwise, it will be LTR, left to right, which is actually the default. So if there were other languages supported that read right to left, we can go ahead and just add that to languages.js. And in the UI, that will be automatically applied. The final mode, of course, is badge. This one's really brief. These are just hard coded values. You can be more fancy if you want and source these values from elsewhere. But I've just hard coded them directly in here. So that's a summary of the application. Once again, it's a Node.js backend with a Vue.js front end. We're using DeepGram for transcription and iTranslate for translation. When we're translating, we first send our voice data to DeepGram. Then we'll take that return text and we'll send it off to iTranslate. Right. Now it's time to talk about how you deploy this project for yourself. If you are a developer and you want to host this elsewhere other than Glitch, please do refer to the code link in the description. You can take the code and deploy it wherever you wish. But by default, we're going to deploy it on Glitch because it's free and it's easy to take the code from where it is and deploy it for ourselves. So you want to go to the description and you want to go to the Glitch remix URL and you want to click that URL. Once your project is remixed, you will want to go to your .env file and you will want to update your DeepGram API key. Once again, you can get that from the DeepGram console. You can create a new key with admin permissions. Set it to never expire. Your DeepGram project ID and your iTranslate translation API key. So you will want to put those values in this .env file. You will want to come to public index.html and you will want to update the values on lines 38 through 40. I may well add to this project it will be around this line count. And just to clarify what this looks like so you know what you're updating. It looks like this badge mode. So you can go ahead and update those values there inside of these angled brackets, inside of these angled brackets, inside of these angled brackets. Then you want to hit preview, open preview in a new window. And I want you to take note of that URL. It will be different for you. Once you've remixed, you'll get a new URL. The other thing I want you to bear in mind is if you sign up for a Glitch account, this will be hosted for free. If you do not sign up for a Glitch account, this will also be hosted for free, but only for a limited period of time. And then it will be destroyed. You won't be able to access it on that URL anymore. So this may be the opportunity to sign up for a Glitch account again for free and it will host your application for you. But I want you to take note of this URL. Once you've done all of those steps, jump back to your Pi. Join me in the next section where I will show you how to set up your Pi. So you can launch the browser on the Raspberry Pi, type in your full Glitch URL. Don't forget HTTPS at the beginning of your URL, not just HTTP. You'll see a prompt to allow access to the microphone, which you can use the touch screen to confirm. And then you can go ahead and use the application. Now you'll notice, of course, there's all this extra stuff up the top. You can go in here and hit the full screen button. And now we've just got our live transcription badge working. Not a problem. Now, again, that's, you know, a few extra steps and you may want to do. So what I'm going to encourage you to do is look in the description where there is a link to a blog post on how to automatically launch Chromium in kiosk mode on the Raspberry Pi that is published on our blog. So I encourage you to do that and run through that setup. And yeah, and that is basically it for this project. So now you have a fully working badge that will transcribe your voice, a group's voice or translate. Or, of course, there is the badge mode here. Yeah, hopefully you found that interesting. Hopefully you can go away and put this together for yourself. There aren't that many parts. Once again, there's the Raspberry Pi with the SD card, the screen, a power bank and a mic with an adapter. So it actually goes in. All of that will be in the description. The way I attach this to my body is actually by wearing thick denim dungarees overalls and the bib just kind of pops in there, though I'm in the process of having a case designed for this device. If you have any questions, do feel free to reach out and hopefully you can be walking around and being live transcripted for those who need it. See you soon.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now