Speaker 1: Thanks for your patience everyone tonight. Setting up custom hardware in a foreign environment which is called meeting room or seminar room up at university is always a bit of a challenge. But we're finally there so thanks for sticking with us and I hope everyone had dinner by now. I will hand over to Rod who will basically present how he's using Google voice text in Google documents in order to speed up his writing compared since he just like so many of us aren't touch typists, me included, that can be actually quite useful for especially when the eyesight is failing, make it a bit easier sort of like writing longer documents when you actually can just use audio interface. And maybe slowly Star Trek and Star Wars voice interfaces are slowly coming to become reality and it's gonna be a little bit more useful especially for seniors like Rod for instance. So without much further ado, I will hand over to Rod.
Speaker 2: What I was wanting to do is to show you our audio module that the key to the whole thing is that Google needs to have control of the microphone or input otherwise it can't work. And it turns out that the audio module is monitoring the audio input or the HDMI input. And so you have to turn that off and then the Google can get control of the module that it needs, the driver.
Speaker 3: Maybe Rod just point out that at the moment you've got a Logitech camera and microphone on top of this monitor. So that's the USB based microphone that will pick up the audio, or is picking it up now.
Speaker 2: So this is the volume control module and it's now set up so that it can monitor the microphone, the webcam microphone. But in order to get it there, what we had to do was to go to the built-in audio, HDMI output and actually turn that off. And then what it does is it monitors the dummy output. So you can't get rid of the dummy output. Why it does that, I don't know, but it does. So, but you can turn it to silence like that. So that now frees the HDMI and allows the Google text to video to last take control of them. Yeah, yes, take control of the...
Speaker 3: Just click on the screen, I think.
Speaker 2: Yeah, yes, that's right. Yes, so now we should be able to show you...
Speaker 3: Just point out that you've had to log into your Google account and select Google Docs, right?
Speaker 2: Yes, so, and then this is the Google Docs. Yes, this is the Google Doc. It only works in Google Docs. So, and you have to use Chrome browser. So then if you go to Tools, Voice Typing, and then click to speak.
Speaker 4: So we can't go and see the bottom of the menu.
Speaker 3: Oh, okay. Do everything up here.
Speaker 2: Yes, so that should start translating my speech to text. It's quite a trick to speak clearly and yes, in a way that it can easily recognize.
Speaker 3: I'm just gonna reduce the width of your screen a little bit.
Speaker 5: It seems to guess how long a word goes in there.
Speaker 2: It's picking up yours.
Speaker 5: Yeah, yeah, but you know, you see it needs a gap and then it fills in the word that exactly goes in that gap.
Speaker 6: Not wrapping. That's okay.
Speaker 4: We're down now. Right at the top, scroll up. Up, up.
Speaker 6: Up, up. No, I don't get it.
Speaker 3: No, that's it. Clear it out. I'm gonna start again. We will start again. No, the microphone's turned off. Delete all files. There we go. We turned the microphone on. Okay, so we're away. Yes.
Speaker 2: So, now it's working again. Will, we're now.
Speaker 5: Initiate self-destruct. Yeah, yes, it's.
Speaker 3: No, it's my voice, your voice.
Speaker 2: It's getting in. Yes, so. So, now let's start again and just demonstrate.
Speaker 5: Yeah, it's already got a guess.
Speaker 2: Yeah, it's not scrolling down. The way we, oh yes, it is.
Speaker 5: It's like, it knows it's gonna go there. It's just pretending to do the, I don't know.
Speaker 2: Yes. So, the trick is to speak slowly and clearly and it should work the way it's supposed to. Yeah. One thing I found was a problem is that it wouldn't recognize the punctuation marks the way I wanted it to and it turns out that I was looking at a how-to, American one, and it said, commands, comma, full stop.
Speaker 5: Period.
Speaker 2: Period, yes, comma, period.
Speaker 5: Exclamation point, not exclamation mark.
Speaker 2: So, it turns out, Google goes to type and it's clever enough to get the location here and it uses full stop instead of period.
Speaker 5: And put in a dot, full stop, in the speed of the period.
Speaker 2: Yeah, yes. It's just, so, to start again. Yes, it's surprising how well it picks up things.
Speaker 3: Especially from the audience, sorry.
Speaker 2: Yes. New line. Let's try again.
Speaker 5: Anti-disestablishment tyrannism. Can't pick it up, no?
Speaker 6: I'm thinking about it.
Speaker 5: I was thinking about you, Leo.
Speaker 2: Yes, it's a ...
Speaker 5: Can I do anti-disestablishment tyrannism?
Speaker 2: Yeah, well, it's best to stick with the ordinary things. Ordinary things until you get used to it. New line. Let's carry on. I think we've scrolled off the screen.
Speaker 4: We've scrolled off the screen.
Speaker 5: Gold got the screen. Gold. Gold. There's scrolls of them there. Bills. Go, gold.
Speaker 3: I think you can say stop recording and ... Good. We'll stop, and then you say resume.
Speaker 2: Yes. Well, let's see if I can find the documents that I was using that are the most useful. I'll just have to start another file.
Speaker 6: So ... Hold on. Yes, hold on. Now, you think ...
Speaker 3: Open a file. You mean a new document?
Speaker 2: Well, one that's in the My Drive. That's right. There we are.
Speaker 6: And ...
Speaker 4: Click on hide at the bottom there.
Speaker 5: You want to show that screen that we've got.
Speaker 6: Yeah, I understand, right? I think that's bringing that border down there. Let's go again. What am I doing? Scroll.
Speaker 5: Was it click on the name or not click on the name? Was there a name just below the name? Just below? If you scroll down a little bit, there's a name showing. Yeah, I see it in between. In between.
Speaker 6: Yeah, that's right.
Speaker 5: You click on that to open the document?
Speaker 2: I wanted to leave it.
Speaker 5: What is that one? See if we can still back up. Back up, rather. There's a name underneath each document.
Speaker 6: Yeah.
Speaker 2: Yes, I had it all worked out in time, but yes.
Speaker 6: Oh, no, shh. So we've already got it, haven't we?
Speaker 5: That's the same one. That's the same one, yeah.
Speaker 2: Yeah, that's right. It says. What I want is the.
Speaker 5: Even though it wasn't saved, it's auto-saved, isn't it? Yeah.
Speaker 6: Yes.
Speaker 3: According to this, the stop and stop recording and say clear, delete the last sentence. Delete the last sentence?
Speaker 2: Yes, there's a whole series of Google test.
Speaker 3: Did you ever say smiley emoji?
Speaker 2: Yeah. Which is that one you've got there?
Speaker 3: Oh, I just did a Google search on that.
Speaker 5: Commands. Is that the AT&T document again? So if you say stop, it doesn't, because you know, Telegram speaks to all the stop. Right. Do you want to go on the tab, VTT, top right? That's the document you were looking for.
Speaker 7: Yeah.
Speaker 5: Yeah. Click on that. And then we can turn on the.
Speaker 3: Turn on the microphone. Turn on the microphone. Do I type this in? Click some. Should I clear all?
Speaker 5: Clear all. They're hurting.
Speaker 6: They're hurting.
Speaker 2: Yeah, it's easier said than done. Stop. It is easier said than done.
Speaker 5: Do you have to use Google? Google stop.
Speaker 2: No, it's not the way to do it.
Speaker 5: What's YouTube?
Speaker 2: Yeah.
Speaker 5: It doesn't like me.
Speaker 2: Maybe, thank you.
Speaker 5: I'll conjecture, okay.
Speaker 6: I can get that. Yeah, it's in there. I'm going to use that.
Speaker 5: I wonder if this is running on a custom GPU PC, but I'm not aware of that yet.
Speaker 4: I mean, possibly. It's not necessary. There is, you can run TensorFlow Lite in JavaScript. Lineside, potentially.
Speaker 5: You can do the pre-train. Yeah.
Speaker 4: But, depending on how sophisticated it is, it might have to run on some larger things in the cloud then.
Speaker 1: But, I'm quite impressed, to be honest.
Speaker 6: Yeah, it's...
Speaker 1: How well it's actually picking up, and that it takes the context into account in order to determine what will be the most likely words. So, not just plain, oh, here's word after word, but really sort of like the context that we have. Not always right, but we can always fix things up afterwards if you at least get 90, 95% of your text right.
Speaker 4: That's already quite a big saving if you can just talk.
Speaker 3: Well, it's recording me, so you can transcribe automatically, I'm meeting.
Speaker 2: So, this is one of the how-to documents that I found. It was very useful, except for the putting me wrong about saying period instead of full stop, because that's what made it difficult for me to get the punctuation commands to work. You've got how to use voice commands, cut more period when appropriate. Say new paragraph to start a new paragraph and keep speaking. New line to continue speaking. You could see that one was working when I was dictating. And the other one, another one is important one is select. So, say select and then say what you want it to select. And then you can do various things to it. For example, you can say select such and such and then delete or various other things. Now, you see, you couldn't create a document
Speaker 5: like this with speech and text, could you?
Speaker 4: Yeah, it's a little bit difficult.
Speaker 2: Yes, actually, it's surprising how much you could do once you really got it going. So, though... So, this is how to use voice formatting. How to use voice formatting. Yes, you can say select is the key thing. So, select and then you say bold to make it bold. Unselect, then say underline to underline and so on. Apply heading one and so on. Oops, I don't know what's happened there. Yes, so, there's more about how to do that. And I found that this was a very useful pro tips for using voice typing. Speak at a clear and moderate pace so that the algorithm can pick up your words and commands. Try to write a paragraph at a time and then go back to read it words and phrases later. Use a good microphone. Well, it's amazing how well it's done with this thing. And so on. And this one is particularly useful. Make the undo command your friend. Get comfortable with using select and the unselect commands. And finally, if you don't have to stop using the mouse and keyboard, you can... You can challenge to try and get as much as possible done easily with the voice to type but keep on using the other. Now, yes, and the other one is... Yeah, keep Google's documentation. So, this is the other one. So, it's got pretty complete way of using all the commands. So, that's select, make select your... Get really familiar with it and that's telling you all the things that you can do with it. Similarly, format your document.
Speaker 5: So, you can select the map. So, you can just say the word map and then you can select it.
Speaker 6: So, you can use formatting commands,
Speaker 2: supply heading, one to six, normal, subtitle, title, blah, blah, blah. And then you can do just about anything if you use select and... Go on doing it. You can even add tables and things. Move around your document. Yes, you can go to end of paragraph, start of column, start of line, start of row. Go to next and so on. And so on. So, you can really do just about anything if you've got really going on with it, I think. Anyway.
Speaker 3: It also seems to have, for example, I've demonstrated to my daughter and she was doing it on her laptop and she sort of said, you know, type this out or something like that and that came out. And then she giggled because it came out exactly what she said. But it didn't try and put giggle, giggle, giggle. It knew to... Okay. It knew to ignore it. And then she went ha, ha, ha and it showed up and put ha, ha, ha. So, it seems to consider background noise and ignore it.
Speaker 2: Yes, it's really very good at ignoring background noise. In general, I accidentally left the radio on quite loud but it seemed to be, well, it did stop it from being as accurate as it was, but it managed to carry on in spite of that.
Speaker 3: Great. Cool. Excellent. Just one other thing is, it also works with Chromium. You know, if you...
Speaker 1: Yeah, right. Since Chrome is downstream from Chromium, yeah. But it doesn't in Firefox. That's right.
Speaker 2: No, you have to be... Chromium has to be... It has to be using Chromium browser and it has to be a Google Doc.
Speaker 5: Yeah. How is it that the browser is specific at all? Is it only a... Because it's using...
Speaker 3: Rather than OpenStandard? If you go with Firefox into Google Docs, right, then when you click on the tools menu, there's no voice typing option there.
Speaker 5: What if you lie about your user agent?
Speaker 3: I mean, you can probably fool it.
Speaker 2: I doubt that, because it's using Google's artificial intelligence. It's got to be a Google Doc.
Speaker 5: Yeah, which is accessible from any browser. What browser APIs are they using? Because all the smarts is on their server side. Right? It's getting audio from your browser, which any browser can do. But with Google Docs, they can control
Speaker 4: whether you can have access to it or not.
Speaker 5: Yeah, but Google Docs is a server side function, yeah. That was the client side. But that... That was client, yeah. There's a lot of JavaScript running client side. Yeah, which works with Firefox. Right, with Google Docs, you can use Google Docs, with Firefox. Yes, but I'm not sure whether you get the full set, actually.
Speaker 4: I mean, I've had it now multiple times that sites start to look a bit funny sometimes with Firefox.
Speaker 1: With everyone jumping ship with Chrome, because they are starting to move away from ad blockers. Well, Firefox might rise again to the ashes. Anyhow, still very much impressed with what you can do. And since Rod was showing off the flashy things, a couple of months ago, Ian was presenting.
Speaker 4: If you have a few microphones, I'll turn mine on.
Speaker 3: Turn off the microphone.
Speaker 6: If you go behind here,
Speaker 3: go back to... Yeah. Right on the left.
Speaker 6: And then turn the microphone off. All right. Can you do it while you've got the feedback?
Speaker 1: I'll make myself present, don't worry. I am. Peter will present. Yep, Peter will present. Okay, thank you. Thank you. Yep. So that was, I presume everyone can hear me again. Ian can probably test whether he can hear me. Yep. So that's sort of like the flashy side of things. A couple of months ago, I think Ian was presenting at the Python Koki STT, which is a open source framework that...
Speaker 3: It's supposed to do both.
Speaker 1: Yeah, there's Koki STT and TTS, but since we're doing speech to text, I'm just looking at that. So that's sort of like a deep learning toolkit that mainly sits on Baidu's DeepSpeech, which Firefox used to develop as well for a while, and then that whole team got laid off, and then I think Koki STT is sort of like the more or less successor of that. At least Mozilla is still pursuing, bless you, is still pursuing their common voice audio recordings and whatnot, that's basically collecting data that you can actually build these models, and they usually need to have a lot of data. Um... So, Ian, do you want to switch over projector to the room PC? So it might be maybe a bit of a bigger screen than for the audience here. You're good. Yep. Okay. So yeah, Koki TTS allows you basically to do things like that. And I've been actually using that at work for text-related things, but today, I'm going to talk about Koki STT. In sort of like an open source land. So I did create Docker images for Koki STT and the other ones. And so there's basically two, one for actually training models, which we don't need to do because Koki also makes an English model available, which you can actually use. It's actually a checkpoint, so you need to actually convert it to a TensorFlow Lite model. But that TensorFlow Lite model, you can then actually use in theory in your own application. But to keep things simpler, I actually just used a Docker container for that, so I don't have to worry about any dependencies. And so what actually happens, I was looking on the weekend for libraries where I can actually do audio recording in Python, which is something I've never actually done before. So I have basically over here on the right, I have my Koki STT Docker container, which can actually transcribe WAV files into text. And then I need something that I can use for actually recording through a microphone. I came across the sound device library in Python. I know it's a little bit more Python than actually Linux, but we're running the Docker container on Linux, so there you go. And the communication basically then runs via Redis in-memory database or a message broker, depending on how you look at it. So the user interface written in GTK in Python, I know, basically records the audio, sends it via Redis, the Docker container listens on a particular channel, grabs the WAV data, transcribes it in theory, and then sends back the transcript, and the GTK GUI basically then displays that. So that's the rough overview of that. So I'm not quite 100% sure whether it will all work with audio and whatnot. So I'm just making some fonts here a bit larger.
Speaker 5: Is Redis an easy way to do that? Because if you just watch a directory, I wrote a Python library for Ion Multi.
Speaker 1: Yes, you can do these things, but in order not to wear out SSDs when you're doing sort of like thousands and thousands of these, in-memory is sort of like the better alternative, I would say.
Speaker 5: Well, you can make a temp file for a file system, which is in-memory anyway.
Speaker 1: Which you can. Right, so up here, I'm basically, running basically my Docker container. Something that I wrote sort of like that, it just sort of like listens, it's just, that's sort of like the name of the Docker image that I'm using, that's the script that I'm running, and I'm basically just listening on that particular audio incoming channel, and I'm sending basically whatever I can transcribe basically on another channel out. And that's the sort of like the English model that I've converted and used. So we can basically see, and it's in verbose mode, so we can actually see if actually data's coming through and whatnot. Down here, I'm actually listening also on the transcript channel, whether anything's coming through. You can see that I also tried Hello and Hello World already before. To make it easier, first test, I'm basically just sending an audio file in, so I am just sort of like, this is a little tool that I wrote for broadcasting on the audio channel, a binary file, which I've previously pre-recorded, and that I believe just says Hello World. So if I send that through again, you can see it's processing up there, and yep, it really did Hello World there. So that's sort of like the basics that the Docker containers integrated, but we wanted to have a user interface, and I have a little GTK one for that, and that is the interface. I'm just clearing that. So I'm not quite sure whether it works with BBB recording as well. So we'll just see what happens. So I've set it up that it basically records three seconds of audio and then sends it off. How are you? Thank you. Well, thank you, but not thank you, but close enough, sorry. It's a bit tricky here, making the font larger, but it's getting there, but it's not as flash, so I basically have to say for how long it will record things. The font is too small. Yes, not the font, but maybe it's, I'm talking with a mask on, maybe it's a bit muffled, and I can't really understand that properly, but I thought that was sort of like something relatively easy to do. Unfortunately, it took me a few hours longer than I thought it would with sort of like running multiple threads and whatnot that actually things don't, that they actually work nicely together with GTK, so that was a little bit trickier, but at the end of the day, I had to, it's like the same problem that we had with Rod as well, sort of like the audio setup is always a bit finicky, and you have to make sure, well, at least with this one, I actually had to crank up my audio gain quite a bit that it would actually record that loud enough. Quiet voices are bad.
Speaker 5: Did you check the game in Audacity or something?
Speaker 1: I'm not using Audacity because then I have to reboot my laptop. Really? It screws up my pulse audio completely. I have no audio afterwards, so I try not to use it anymore.
Speaker 5: Because they're waiting for this, you know? Yeah. I mean, you've got 100%, so. Yeah.
Speaker 1: Yeah, so basically, I'm just going in my sound settings. So in the input, I actually usually operate a lot lower, but then a lot of the other applications do an automatic adjustment of that particular gain. So even though I probably use, in practice, a much higher gain, I actually don't see that with my settings, so they fiddle with it later on themselves. But yeah, so having, and I also had to always hold up my microphone that I was actually speaking properly into it rather than just having it dangling or whatnot. So that was definitely something. But anyway, that was sort of like something where I was spending a few hours this afternoon on, trying to get something going. But yeah, so I thought this is the, oops. Not quite the flash side of running things. Another interesting sort of like tool is Whisper from OpenAI. They've released a multilingual sort of like tool where they can actually also basically translate the audio, not just basically transcribe, but at the same time sort of translate as well. And they have also made some sort of like Python code available for using those models that they train. Unfortunately, they've only basically made the use of the trained models available and not how you can actually build your own models, which would have been far more interesting, but I guess that's their secret sauce, so they don't really want to give that away too much. But anyway, I thought someone else might maybe want to play around with it and see what other languages they can actually handle. So there are various models available. I'm not sure whether you really need to have a, yeah, that's on the GPU, whether you really have to actually run it on the GPU. Some of these models you can actually run on CPU only as well. So if you actually probably start with a tiny one, which has relatively few parameters, so variables that the network has to learn, so you can might get away with that. So yeah, as you can see, you can get quite memory hungry, and you can imagine what actually Google's fancy, whoops, Ian, you popped up a menu. You triggered a menu. Thank you. So Google's back end, if it's most likely running at its back end, will probably have some beefy hardware running there, and I'm not sure actually how many users it can concurrently actually handle in its data centers. But anyway, that is sort of like my little spiel here, showing that you can do basic things at your end as well with what's out there. It's not as flash, but on the other hand, it does not run in the cloud. So this is something then that you might maybe consider when you're actually a bit concerned about privacy and whatnot, what you're actually sending out there. Because at the end of the day, I'm not sure what the terms and conditions are, whether they actually retain their rights to record whatever you're saying and transcribing for further improving their models, so that there will actually then, people then listening to snippets of audio and then correcting basically the transcripts, and that gets fed in into another training run for the model for continuously improving it. But yeah, so that's it, I think, from us here. Thanks for the very patient audience tonight. Thanks for sticking around for so long.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now