Speaker 1: word recognition. So basically, this paper introduced the concept of dynamic time warping, in that people can speak the same word differently, but you are going to have the same features when you speak the same word on a sound recording, but you might not have it in the same time as how someone else speaks it. So, an example of how dynamic time warping is that you can look at these two curves, these two waves. Let's assume that these waves are of someone speaking. And so, it's the same word, but people speak the word differently, but then they still have features in common. So, what we do is we try to find a mapping between these features in a way that we can find the similarities in pattern which are mapped across time. So, let's say we take these two as two time series and each time step, for instance, can be matched to one or more time steps from one wave to the other. So, that's how they recognize a word as how it is, though these words can be spoken actually very differently. So, that's how it started. And now, of course, this technology is commoditized. It's found on all our phones and computers, so it's easy for anyone to do it. So, if you're not a phone user, actually, one can actually do voice recognition on your computer in a very simple way. So, all you have to do is just go on any command line with Python installed, and you just go pip install speech recognition. And you download the package, and then you can say a Python m speech recognition. Oh, oops. Okay, so there's something wrong with this one, so I'll switch over to my beta. Sorry. So, this is actually the speech recognition library that is very easy to use. I don't know how it failed to install on my other computer, but I'm assuming it's because it's an M1 Mac. So, the processor might be a bit too new for the program, but here it should be fine on an Intel computer. So, after you've installed speech recognition, all you have to do is... Okay, so, hello, SMS. Yeah, so you said, hello, SMS. So, basically, whatever you say, they'll be able to catch it, and they can and they can yeah, tell you what you're saying. So, basically, to create a system of considerable accuracy in this kind of thing, you need a lot of training data. So, you need to actually have many samples of people saying stuff, and you have to map it to labels, like in this particular sound recording, at what point in time, you have to train the algorithm using an alignment that is labeled by a human being. So, you have to know, oh, this part corresponds to which word, and so, and one of the easier ways which they do to, which they use to collect,
Speaker 2: yeah, so,
Speaker 1: I'm sorry. So, one of the places where it's easy to get a data set would be by looking at the Parliament transcripts. So, basically, because in Parliament, you have sound recordings of people saying stuff, and you also have people who prepare the handset to a transcript of whatever it has been said. So, basically, it's for many different languages, it's quite easy to just take recordings of whatever is said in Parliament, and the official reports of what was said, and create an alignment and use that for training a speech recognition engine. So, of course, you would have to, so, for different languages, of course, you have to train different models, because the corpus of each language is different, every language has its own vocabulary. And so, this is a website run by AlphaSapphire, which is an AI company, and so, basically, they are developing this thing called Vosk, and they provide trained models for use in speech recognition. So, you can say that, see that the different models trained for English, Indian English, Chinese, Russian, French, German, Spanish, Portuguese, Greek, Turkish, Vietnamese. Yeah, so, for every language that for every language that you want to do speech recognition on, you'd have to find your training data, and you have to do your own, and you have to train your models in order to perform speech recognition on that language. And, of course, the Parliament dataset was one of them, and a lot of other datasets exist for training such models. So, if we look at Vosk, we can look at, so, what Vosk can give you, which is quite interesting, actually, Vosk can give you So, if you give Vosk a sound recording, it can give you a JSON file that tells you for every word that it detects in the sound recording, the start and end time of the word, and the confidence of the model that, you know, it's actually the word that they predicted is the right word that they've, that was actually said in the sound recording. So, yeah, and so, what people do with this data, of course, is this. Okay, so, this is a paper about recipes. So, there are a lot of recipes for, so, this is an application of the speech recognition. Okay, it's a bit far off, but yeah, it's an application. So, there are text recipes available online, and there are also video recipes available online. So, what these people want to do is that they want to match steps in the text recipes to steps in the video recipes. So, they want to know, okay, so, this text recipe says crack an egg and scramble it in the same pan and mix it throughout vegetables. So, someone might not know how to do it. So, perhaps they're not very good in the kitchen. They might need some visual cues as to how they might want to crack an egg. So, what this program tries to do is that they'll try to match parts of text recipes, and they want to map it onto parts of video recipes. So, such that, you know, if someone doesn't know how to do a specific action in a text recipe, they might be able to just click on a link, and they'll be brought to the correct portion of a video recipe, which will show them exactly how they might want to achieve the step in the recipe that they've been using to cook. So, how they do this is that they create a transcription of a lot of YouTube videos. So, they have identified actually over 4,000 dishes, and they have collected both text and video recipes for each of these dishes. And for each video that they've collected, they transcribe the video using a speech-to-text software or speech recognition software, and they create a transcript, and they break up the transcript into different steps, you know. And then, you know, from a multi-model problem, they have reduced it into everything is text. So, now they can do the matching with ease. And so, that's one of the things that speech recognition can do. And of course, I believe in the previous talk, I was talking about a system of trying to match slides to parts of lecture videos. So, that's, I guess, something similar that would not have been possible without speech recognition. So, yeah. And so, that was a very short talk. But any questions?
Speaker 3: Oh, your talk is very, very short. But how to do? Can we develop an app to do this?
Speaker 1: Yeah. So, basically, to develop an app to do this, we can either use one of the speech recognition libraries that are already available. So, you would want to use that to either train your own speech recognition model, or you could even come up with your own speech recognition AI model and train your own. So, you decoding that from scratch. But so, I guess there are three levels. Like, you could use someone else's models, train models. You could use someone's algorithm to train your own model, or you could just use your own algorithm, your own data, and train your own from scratch. So, I guess it's possible, but it would be very time consuming if you're not professional. Oh, by the way, this is a shameless plug for Telegram. But let me share my screen again. And so, I want to invite you to two Telegram groups. So, the first one is Telegram Speech Recognition Help. So, this group is for if you want to train your own speech recognition model, you can go and ask questions. So, this is for beginners. So, how do I do this? How do I do that? How do I find the data sets to train on? They'll be happy to help. So, I think this Telegram group is maintained by people at CMU Sphinx, who have created, have moved on to AlphaSapphire to create the Vosk model I was talking about. And this, the second Telegram group is for more for the experts who already know everything, but they just want to share what findings they have and their own, their data sets and everything. So, they'll be sharing, you know, new annotated data sets for Indian, English, Finnish, and whatever. So, they're always training more models for more languages, so that this can benefit more people, I guess. Yeah. So.
Speaker 4: One question. Yeah. Our WhatsApp and Telegram. Yeah. You have one option where you can just click on it. And when you talk, words start coming out. Yes. Yes. Yes. So, yeah. To make it more accurate.
Speaker 1: Okay. So, that one. So, the accuracy of that speech recognition model would depend on the size of the model. So, I guess for Android, if you're using an Android or Apple phone, you'd be using either, you know, the model that is trained by Google or trained by Apple. The problem with these models is that, in order to use a more accurate model, you want to use a very large one. So, if I can show you the size of these models. So, you can see that the small models can be 40 MBs. Sorry. Wrong share. Okay. So, the small models can be, for English, can be 40 megabytes. So, the error rate is around 9%. All right. But a slightly more accurate model, which is with an error rate of 7%, is one gigabyte. So, that's huge. And so, 129 megabytes would give you around 8% error rate. So, the thing is, if you want to put this on a phone, the phone's computational power may not be sufficient to use the one gigabyte model for speech recognition. In fact, a one gigabyte model might use too much space on your phone. So, perhaps it's possible for us to improve the accuracy for the phone. But I think Google and Apple, they have chosen to use a less accurate
Speaker 4: one, but which takes up less space. Yeah. So, in your opinion, which one is more accurate,
Speaker 1: Google or the Apple model? Unfortunately, I don't actually use the feature very much, but I do observe that my visually impaired friends, like I help a few in school, you know, to help them write exams and stuff. I find that they prefer iPhones.
Speaker 4: So, between WhatsApp and Telegram, they use the same model from-
Speaker 1: I think if, okay. So, it depends. I don't know if it's the speech recognition that is included in your keyboard. I guess that's by the old operating system provider. That's by the old operating system provider, but I'm not sure if Telegram and WhatsApp has the- has that feature built in, or if they're using the operating system one. So, that's in the keyboard.
Speaker 4: So, I'm not sure. Okay. Yeah. Actually, the features being used in the phone is quite accurate. Yeah. So, yeah. Even you mentioned lower capacity is really very powerful. In fact, you can- a lot of- it could be very useful for old people to realize that you can activate the phone with your voice to call for help. Yeah. So, actually, I think we have
Speaker 1: some smart speakers that do it, right? So, like this, like Amazon smart speaker, you can say, hi, Alexa, and Alexa will say, what's up? And then you can say, help, help.
Speaker 4: No, you can actually activate the phone to call for the police or what, you know. For example, you just shout, hey Siri, call 999. Sorry, he's calling.
Speaker 1: That's the danger of having Siri on your-
Speaker 4: Yeah. You can hear the phone being activated. Yeah. It's pretty accurate, you know. Yeah. You just shout into the phone, even if it's off, and it goes to that particular person that you want to call.
Speaker 1: Maybe the drawback of that is the phone is constantly using up its battery because it's sampling recordings, right? And then it waits until it hears, hey Siri, and then it will act.
Speaker 4: But then, yeah, the consumption of the battery, yeah, not very high, you know. That's why it's the same with Trace together. People thought that, you know, you're going to consume a lot of battery power. Not so. Yeah. Yeah. So, they should have started by saying, if you you have the app, they will pay you $10. I think a high percentage of people will be having Trace
Speaker 1: together, you know. Measuring the big token to a small token. They decided to adopt the SG secure method, right? Forced every single national serviceman to install it on their phone so that you can be secure. So, that's unfortunate. Yeah. So, it is that the small model is indeed a good. So, like the word error rate is not that much higher for the smaller model.
Speaker 4: It's very, very, very useful. Yeah. It's not 100% accurate, but what you can do is just say out what you want to say and then after that, go back and look through and see whether
Speaker 1: you can correct all those errors. Yeah. So, I guess that in itself is quite useful.
Speaker 2: So,
Speaker 1: let me see if I can play around with Fosk. So, let's try to play around a bit with Fosk, I guess, if we can. So, this is. So, this is basic. So, Fosk provides API on Python for you to do speech recognition on audio files. So, all you have to do is to download the Fosk API and the models into a directory on your computer and then you can just run a Python script and then you can do speech recognition. So, here you can see that if we try to transcribe, let's say, an audio file. Okay, so now you can see that, okay, we are going to talk about something, something, something. So, then it'll already start generating the transcript. So, it's actually quite easy to use. So, all you have to do is to download. See if I can show the installation. So, all you have to do is to download. Yeah, you just have to do a pip install and that's good enough, actually. Yeah.
Speaker 4: So, is this Python for the PC? Yeah, you can.
Speaker 1: So, this Python, it works cross-platform. So, you can do it on Linux, you can do it on OS X, you can do it on Windows, it's all possible. And the purpose of this is just to change your words into text. Yeah, just to change your spoken words into text. Yes. I see. Can it be
Speaker 4: linked to maybe electrical gadgets or maybe to, you know, do some action? You see, for example, this Siri that I'm having here, the iPhone, it will say, hey Siri, what is two times six?
Speaker 2: Yeah. Two times six is 12.
Speaker 4: Okay. Yeah. So, you can talk to the phone, you know, and then it's almost like talking to another person and they reply with an intelligent answer. Can Python do that? So, I guess you can,
Speaker 1: like you can have a single board computer, say a Raspberry Pi that has Python, and then you can use one of the smaller models and you can program it to control a robot, for instance, and then you can tell the robot, move left, and then the robot will move left, to move right, and then it'll move So, yeah, that's possible. And it shouldn't be too difficult to do such a thing. You just need to have it set to listen periodically for commands from the user. So, actually, I've left the speech recognition running and it's been, it's been picking up parts of the conversation that we've been having.
Speaker 2: It's interesting. So, let's see.
Speaker 4: Any idea who are the, you know, researchers doing voice recognition currently?
Speaker 1: Yeah, so if you want to know who they are, like, just join the second group, because the second group will be all the researchers who are talking about, let's see, so they'll be talking about the papers, they'll be talking about papers. Yeah, speech recognition error correction, and you have so many people working on it. And you have, let's see, GitHub, you have this Indian English speech GitHub with the experiments and the scripts. So, they actually have 10 terabytes of recordings, 15,000 hours of recordings of Indian English, because Indian English sounds different. So, yeah, and they are sharing their problems and their algorithms and stuff. So, and their papers.
Speaker 4: And you mentioned that this is only in the Telegram app?
Speaker 1: Yeah, this is in Telegram. So, the first one is to seek help. And the second one is, yeah, they might block you if you try to ask beginner questions, but they're still friendly, I guess. Yeah, interesting.
Speaker 4: So, what actually happened, you just click on this link, and then you will be accepted.
Speaker 1: So, you click on this link. So, once you go to this link, you go to this link, and then it'll ask you to open it in Telegram. Open it in Telegram, they'll ask you whether you want to join a group, and you can just join a group. Okay. Yeah. So, so, they, so if you want to like read the papers, latest papers about this, you know, this group has a load of papers and a load of code for speech recognition. Yeah.
Speaker 2: And in fact, for, yeah, so, it's this one about
Speaker 1: keyword spotting on AI and microcontrollers.
Speaker 2: Yeah. So,
Speaker 1: I'm sorry for the very short length of the presentation this week.
Speaker 2: So, are there any more questions?
Speaker 5: Do you know anybody working on Singlish?
Speaker 1: Singlish. Actually, if I'm not wrong, people have actually paid Singaporeans to speak for them. I think. Yeah, I think, I don't know specific people who are working on recognition of Singlish, but I have heard that.
Speaker 2: Oh, yes. So, let's see.
Speaker 1: So, yeah, ASTAR is doing it. So, this is kind of funny, but there are Singlish-speaking robots and other ways to make AI work for Singapore and beyond. Yeah. So, it's basically saying that, you know, the models are not so accurate for Singlish, and you might want to mimic a Western accent to be recognized. But, well, ASTAR is training Singlish AI. So, so, SCDF is using an advanced speech recognition system to transcribe and lock every distress call in Singlish or dialect even. That's kind of cool. And they're also using it in the state courts. So, yeah, I guess people at ASTAR are doing it. What about dialects? Dialects?
Speaker 2: Yeah, I think they do.
Speaker 1: Yeah, so, there's a national speech corpus. So, I think it's
Speaker 2: Hmm.
Speaker 1: This is actually interesting. Oh, but this is English. So, so, this is an example of a prepared data set that can be used for training your own models. It's Singlish. And this is Asian. Yeah, so, this, this would be on Hokkien, Singapore Hokkien. And, yeah. So, so, the, the, the challenge is that for Singapore Hokkien, there's insufficient training data for them to build such a model. But, so, I've heard that this is 2016, but I've heard that people are actually being paid to speak Singapore dialects and get recorded for conversation. So, basically, they get conversation prompts and they talk about the topic and they're recorded. And that is essentially going to be used for training data to train speech recognition models for the dialect. Yeah.
Speaker 5: Thank you for the direct. I know 10 mile. Please tell power they have. I mean recognize. I think the 55 direct.
Speaker 1: I mean language in China itself. Ah, I see. That's interesting. Yeah, I mean, you know this margin is the team on genius. Yeah, team on genius. Yeah, I guess. Yeah, I've heard that everyone in China speaks Mandarin so they need to reach out to some more remote more remote area and then they speak direct. Yeah.
Speaker 5: For other languages because not. I mean, in China, not everyone speak Mandarin or so. I mean in some remote places they still have people speaking direct on of their, their own languages.
Speaker 1: Yes, yeah. Wow. Yeah, so, yeah, I know. I mean, I think even some high has its own dialect right and that's not even remote. I've, I've heard that Shanghai dialect is quite difficult to understand for Mandarin speakers. So, so I think actually companies are paying through their nose to collect training data to to power these applications. And this can be quite valuable. And once you collect your corpus it's not going to go away. So in a lot of instances in AI. The problem is not about the algorithms, the problem is about a lack of training data and so training data can go a long way to to to improving the performance of a lot of things.
Speaker 5: Yeah. That's, I mean, that's a reason why I mean, those earlier, I mean speech recognitions were actually need to be trained. Yeah. Now I mean they're moving those untrained, no need to train on. That's, I mean, they will probably recognize that.
Speaker 1: Supervised learning. You remember the robot at Discovery Center or Science Center, I forgot, that was like this robot, and you could talk to the robot. So, and it wasn't very accurate but I don't know. That was a long time ago.
Speaker 5: But really depends on, I mean, what kind of, I mean, library they have, I mean all the, the voice library.
Speaker 2: Yeah. What's it, this one, the talking robot. Is it this one? I'm not sure.
Speaker 1: That I remember visiting like such a robot on a school field trip. Like, long time ago. And, and you could talk to the robot.
Speaker 6: So cool at the time. Are you actually using this speech recognition with the chatbot technologies?
Speaker 1: Chatbot. Yeah. Yeah, a lot of, I've noticed a lot of organizations using chatbots so that they can cut down on their contact center stuff, I guess. So, quite interesting. Yeah. So, basically if you have time and you want to, and you know a language that is not already here, you can try to train your own model. And so, and you can do it with Fosk.
Speaker 4: I come in late. Yeah. And just a few questions about this speech recognition. Yeah. First of all, you got to digitize the Hokkien, Cantonese and Shanghainese, right? Yeah, you have to. How do you digitize? How do you kind of, I can't imagine how you digitize the language.
Speaker 1: Okay, so I guess you have to like, you have to record people speaking the language. You have to record the transcripts of what they actually said. So, that's the ground truth, right? So, when you record something and then you have to say, oh, this is what was said. And then you have to match the, you have to match, you know, the part of the sound recording, you have to align it with your transcript, so that the, when you train the algorithm knows, okay, so this part of the audio corresponds to this word in the language. And then you have to give it many examples of how one word sounds like. And then, then you'll be able to find, and then you apply the dynamic time warping algorithm. That's from 1978, by the way, but still used today, but that's one of the more basic ones. And you use it to note, okay, so every word in this language has distinct features, but then you know the speed at which different speakers speak the word is different. So, how do you align it such that, you know, we walk the time and then we can match the similar features to, from this sample to the other. So, you know that, okay, this, you find this pattern in the time series.
Speaker 4: This, this is it.
Speaker 1: So, so Chinese, like, one word is normally one syllable, right.
Speaker 4: So you have a Chinese print at the same time.
Speaker 1: Yeah, so, so you, you, so you, you speak, you record in Chinese, and then you create a Chinese transcript of whatever you recorded, and you will align the transcript with the sound recording, and then use that to train.
Speaker 4: How are you going to use it. Sorry. How are you going to apply, what's the application.
Speaker 1: So the application would be, say, there is. So okay so let's say you train a Hokkien model. And so, you train a model that will take in a spoken word in Hokkien as an input and Chinese character text as an output. And then you would. And then, and then you can, you can. So that's the model and then you can, you can incorporate it in, in, in, in a mobile application for instance, so that when, when I must say, I need help in Hokkien, like the phone will recognize you know the words I need to say, and then the police will try to call 999.
Speaker 4: For instance. Maybe see and I demonstrate to Dr. Ho using my phone. Yeah, yeah. You can see that the phone is off, you know, and then what I normally do is to say, Hey Siri, call 938-288-97 on speaker.
Speaker 2: Calling 938-288-97 on speaker.
Speaker 4: Can you hear? Yes, I can hear. So, so you can see it's so powerful, you know, if you know an elderly person fall down. You just need to shout at the phone. This is something I normally like to share with elderly people, because it's a great help to elderly people and the phone is off, you know, but the moment you shout Hey Siri. Siri is for iPhone, for those on Android, you use Hey Google. And you just use the same words, they recognize it with numbers is very easy, but with names. It's not so easy. So, you can, let's say, instead of in cold countries. If you want to type up the words, your fingers are already frozen, you know, so a lot of times they talk to the handphone, and it's really, you'd be surprised at the accuracy that the handphone is able to give it to you. Other features that you can use with voice recognition is to switch on the fan, switch on the aircon, you just shout, you know. And you can link that with actually with Google Home, you know, you can say, Hey Google, fan on or fan off, you know, and just goes off, you know, so that I think the next phase, if they can make it much more accurate. We don't need to go and switch on the fan or switch on the phone. We just shout out. And you, you find the fan functioning, the phone. You know the lights functioning, aircon functioning. That's beautiful. I'm trying to learn that you know that is very useful. You see, for, you can even ask Siri, if just again another demo. Hey Siri. A joke.
Speaker 2: Did you hear they've opened a restaurant on the moon. Great food, no atmosphere.
Speaker 4: Now the trouble, the moment you say, Hey Siri, all your household phone is activated. But if you are in the park, and you ask Siri to tell you about the news and so on. You don't have to key in, you know, you just shout at the phone, and then they tell you jokes, they will tell you news, they will give you news, and simple words, they will understand you perfectly, you know. But the moment you say, Hey Siri, and everyone with iPhone, all will be activated. So that's a danger. They have made it very accurate to activate a lot of phones, you know, I mean, not, they're more or less recognize everybody's voice, as what Siyuan just mentioned, people, different people talk in different ways. But somehow, this phone company have been able to roughly understand what you're saying previously is not accurate. Only now I noticed the accuracy has improved tremendously. You know, just just calling phone number. Instead of typing, wouldn't it be so easy right like I call Dr. Ho. So, so these are some of the things I think voice recognition is going to help us tremendously. And one important thing I noticed in the hospital is quite a pity, you know, they don't have dialects. And some of these elderly people can only speak in dialects. You know, speech recognition should be able to get the nurses to speak to the elderly people. The nurse can talk in English or in Mandarin, and then change it to Hokkien or Cantonese or whatever. Cantonese, Google Translate have it because Hong Kong is using it a great deal. But Hainanese, Hokkien, not available. And these are some of the things Singapore should do it, you know, since we have already stopped people from talking, speaking in dialects, but the elderly people are still using it. It should be done, you know, and sent to Google Translate, you know, because Google Translate is using speech recognition a great deal. And the latest thing is they're scanning words and changing it to whatever language, you know, you want it to be changed on the photographs itself.
Speaker 3: Hey Siri. Don't activate my phone. Transfer $1,000 to BC Lim. Your phone didn't recognize my voice.
Speaker 4: Anyway, those with Android, you can try with Hey Google.
Speaker 3: Hey Google, everybody transfer $1,000 to me.
Speaker 1: I think not too long ago, there was like this Burger King advertisement in the US that tried to activate everyone's like Google or Siri or something like that.
Speaker 4: Sometimes like Google just to demonstrate Google instead of Siri. Hey Siri, open Google's app. Okay, so this is an error. Instead of opening Google app, they open Google Map. So sometimes, since I'm on Siri, so I don't bother. If you want to get it into Google app. Let's see. I can't quite remember, but I noticed iPhone, you have the advantage of using Google app. And also, Siri. I can't remember. Suddenly slipped my mind because I've stopped using Google.
Speaker 1: I think for Google, it's like okay Google.
Speaker 4: Yeah. Hey Google, what is the time. Okay, I have to open Google I can't remember but if you hear me say, Hey Siri, what is the time now.
Speaker 2: It's 8.33pm.
Speaker 4: Okay, so you can say, Hey Siri, what is the news.
Speaker 2: Sorry, news isn't available on the iTunes store in your country. Okay, so news is not available on Siri, but it's available on Google.
Speaker 4: If you have Google Home, it's able to recite the whole news for the day. So, these are some of the amazing things that I'm seeing on voice recognition.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now