Unlocking Speech Recognition: Real-World Applications

Convert Your Audio To Text

4.9/5

3727 customer reviews

Explore Speechmatics' journey through speech-to-text technology, addressing challenges, applications, and future innovations for AI-driven speech solutions.

Webinar automatic speech recognition for real-world applications

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Okay. Welcome, everybody. We've got a good set of joiners, so we'll get going. Just before

Speaker 2: we start, you can ask any questions you have through the UI all the way through this presentation, and we'll try and answer as many of those at the end as we can. So, quick introduction. I'm Ian Firth, and I'm responsible for looking after the products at Speechmatics. And in case you're not sure who Speechmatics are, we are a machine learning company that specializes in speech recognition. So, in its simplest form, we take voice and convert it to text, and we do that with some fantastic accuracy. So, assuming you understand a bit about machine learning and artificial intelligence, and you might think that sounds like a fairly simple thing to do. You just build a training set, do some deep learning, and away you go. But as you'll see, even declaring that we have fantastic accuracy is a statement that is full of questions. So, let's talk about that. So much of AI and machine learning is academic. And although that sounds great, when it comes to applying it, well, quite often it just falls apart. So, today we're going to go on a journey through the world of speech recognition and talk through some of the challenges of applying this amazing technology in the real world, and explain a little bit about how Speechmatics can help with solving real needs and add some real value to speech applications. So, speech-to-text is used in many applications, all of which have different characteristics and challenges. Some of these challenges are for the speech-to-text system itself, and some are related to the platform that consumes this service and the environment within it operates. So, to start with, I just wanted to briefly talk through some of the generic issues that speech recognition has, all of which need some thought if you're going to get a system working that meets your needs. So, these loosely fit into three categories. There are set around capturing the audio, some issues around the environment which it runs, and some things you need to think about to do with the performance and attributes of the way it operates. So, to start with the capture section, there are many, many audio encoding formats, all with different characteristics. Some of these codecs are lossy or have varying bit rates, and they tend to be less good for speech. They were often primarily designed to support music, and their mechanisms for compression can cause all sorts of artifacts in speech that can make it hard to recognize. This can impact the sharpness and edges of the sounds or impact the higher frequency aspects that add subtleties to the shape of the sounds which make them different to recognize. Then there's the quality of the recording equipment and the microphones that you use. These can also have a big impact. Recording a conversation might need multiple microphones for each person, and the pickup patterns and directional attributes of the microphone might be important if you want to capture all speakers equally. Then the number of speakers themselves can be even more of an issue. We don't all speak in the same way. Generally, as soon as you put two people talking together, they talk over each other, and we all talk in this mode in what I would refer to as an unprepared speech model, where there's a lot of thinking going on as you're speaking, so there tend to be a lot of breaks in the conversation, words get repeated, there's lots of ums and uhs in the sentences. All of those things have an impact on the transcription. When it comes to the environmental aspects, speech recognition needs to operate where the people are. You can't get away from that, so there can be noise around them, and the way in which they speak can be impactful too. There are often many different accents, and terminology is used in different ways in different conversations. It's quite surprising, or it certainly surprised me over the last year, how much made-up language gets used in company names, products, place names, and other usage. When it comes to the performance aspects, you really need to think about the performance characteristics you need for your applications. Does the speech need to be transcribed in real time? Can it be processed after the whole recording is complete? Regardless of if you need real-time or not, you need to think about the processing time or latency, and the trade-off all that all has against the accuracy you need to obtain. You see on this slide that I refer to WER, W-E-R, that stands for Word Error Rate, and we're going to come back to that in a little bit because there's some complexities around that in itself. So, there are some things that can be done for all of these attributes. If you want a system that works, you have to acknowledge these issues and engineer around them. They can't be ignored. So, here at Speechmatics, we take these things quite seriously, and we have built several things to help those aspects. We've built some channel separation to cope with meetings and overtalk. This allows multiple-channel audio to be transcribed directly, which removes the need for dealing with overtalk in the same way. Alongside this, we support hundreds of audio codec formats so that we can do the best we possibly can with the audio formats you have in case you can't control that within your environment. Our models are accent and noise tolerant, enabling audio to be used in real-world scenarios, and that's very important. And we have the ability to inject new words, terms, and phrases to improve recognition to support those made-up words, places, and company names that I talked about before. And in the last performance area, we have very low-latency real-time recognition, as well as pre-recorded modes with a very fast turnaround time. And we work really hard to optimize the language models that we have to support the use cases that are needed for those real-world speech applications. So, at this point, it's clear that you can't just assume that a recording is just a recording. There's a lot of hype about speech being done, reaching human parity, and academically, that might be true if you're just measuring that word error rate that I referred to on a known test set that can be trained against and has clear speakers and no noise, one speaker at a time, et cetera. You know, pretty much anyone can do that. But can these systems be used in the real-world conditions with all the attributes that I've discussed? They all kind of go quiet when you start talking to people about that. The world is a global marketplace, and people travel more than ever before now, and I don't believe that there's really such a thing as a British accent or an American accent. In reality, people travel all over. So, how do you know an accent they're going to have? People grow up in America and then move to the UK. After five years, what accent do they have? Even if you think you can solve that if there are multiple people in the conversation and they all come from different places, then what do you do? Many of the speech-to-text solutions would make that your problem. You have to choose. Before the speech-to-text session starts, you choose what accent they have to get the best accuracy. So, how does the call centre know before the call whether the caller is going to be from New Zealand, for example? Even after the call, can you tell a New Zealand accent from an Australian or a Canadian from an American accurately and make a decision on that? So, why choose? Speechmatics understands this issue and takes it as our problem. So, we need to solve that. So, for our models, you just choose English and we do the rest. This is the sort of real-world aspect that the academic discussions just ignore. So, now you have an idea of the challenges. The next thing that is always asked is, is speech-to-text accurate? Well, I'd like to answer that with another question, which is, what is accurate? There is this common technique used in speech recognition to measure accuracy called word error rate, or WRR, for short. I mentioned it earlier and I said I'd come back to it. This is a measure of how many words were wrong or added or removed when compared to a perfect reference text. And basically, it's math. And again, academically, that's great. You can compare different systems. But is math the right answer here? Does it tell the full story? Well, I can

Speaker 1: tell you the answer to that is no, for several reasons. Let's see, let's take a simple statement

Speaker 2: like this that says, that is not true. If the system transcribes that, it's not blue instead. This is still a believable recognition. It's only slightly different. And there's one error, one word is wrong. So that's 25% error rate. What if the recognition transcribes, that is blue, or that is true? That's one error removed. That's still a 25% error rate. So mathematically, one sentence is as good as the other, apparently. But in that last sentence, the whole meaning is inverted. So that's really not the same level of error, in my opinion. And in fact, a reader might detect an error themselves between true and blue, but it's much harder to read in context and understand that what was meant was it's not true when actually it says it is. There's a much harder error to detect. And I think this is much more of an issue.

Speaker 1: So then there are other factors to consider for the real world. Is it just the words you care

Speaker 2: about? Here's a graph from a survey that I did over the summer, where nearly every respondent said they measure and compare systems using word error rate. But when we asked what's important to them, they responded that punctuation, speakers, timing, capitalization, names, acronyms, formatting, terminology, were all factors they cared about, and there were more. If you just measure word error rate, they're not really measuring any of this. So you need to be really careful when comparing systems and evaluating your needs. You need to look beyond that mathematical equation and find something that really works for you, not something you can write a paper about. We need to make this work for real if you want to deploy speech in any real way. In fact, some applications you might have might not even care about the transcript, just the ability to spot keywords or identify topics. And then the word error rate is even further from the right thing to measure. It just doesn't represent what you're trying to measure and can be very, very misleading. So if you use this measurement to compare speech systems before you start, you might make the wrong decisions. By the way, that survey I referred to is still open. So if you want to add your thoughts, the survey link will be in the email you get at the end of this webinar. And I would love to hear more thoughts about how you want to measure speech. So when you generally ask about speech recognition, the very first thing people think about is communicating with robots, Siri, Google Home, Alexa, your car. And then people think, well, that's just a bit of a gimmick because asking somebody to turn the lights on or wake you up at 7am is pretty basic and kind of pointless. It doesn't add an enormous amount of value. So just taking these gimmicks into the business world is not what we see happening. Speech recognition within businesses is so often about improving the human-to-human communication and to that end is very powerful. Like most of the useful AI and machine learning, it removes the tedious part of people's jobs and allows focus on the tasks that humans need to do properly. It's not about removing humans and replacing them with automation, but rather making humans into superhumans and delivering things at scale that previously would not be possible due to the mechanisms of requiring humans to do repetitive and error-prone tasks. So I wanted to have a look at a couple of examples where speechmatic speech technology is being used in the human communication models. So the first one I wanted to talk about is probably the most obvious use and that is providing subtitling or captioning. So more and more content needs to be subtitled. In the old world, subtitling was restricted to the content that is forced to be subtitled or accessibility due to regulation, to which there was a growing demand across the world for support. This can be restricted to the news and some mainstream channels or programs and this was and still is done using some very intensive processes. There was great value to the deaf, hard of hearing, people with learning disabilities and people whose first language is not English to having access to captions. But since the regulation does not force it, at least currently in Europe, and it's expensive, it's just not done on enough programs to satisfy that need. So this extends to online content too. People watch online content in lots of places where they can't have the sound on, so they expect subtitling in order to understand that content. If you look at Facebook, for example, the amount of content delivered by video is growing at a vast rate as it's much easier and more friendly to consume, but not by all. Getting all content accessible in this way is now possible with technology. It's very fast, so fast, in fact, that we can do it in real time. So I just wanted to show you a very short video showing a BBC program live streaming online which is showing the Speechmatics captions being generated from this live stream. I'm just going to show you a few seconds of this video so you can get a feel for the ability here. You'll be able to see the BBC subtitles in the same time as the Speechmatics subtitles. The BBC ones are in yellow and the Speechmatics text will be displayed in white

Speaker 3: below it. So let me just show you that. Around the world, we're on the brink of a new era in technology which will transform lives and change the way we live. This has the potential to bring us huge benefits, but many are anxious about what it means for jobs. That is why in the UK, alongside creating the right environment for tech companies to flourish through our modern industrial strategy, we are investing in the education and skills...

Speaker 2: Okay, I hope the video was viewable by you. It might have been a little bit laggy. It should be available. It's all its full glory in the recording you get after the webinar too if you want to see it again. But basically you can see from this that the BBC subtitles in yellow are actually a few seconds behind those auto-generated ones. The BBC ones are actually slightly better than our raw transcription because speed alone is not enough. This technology is used by Speechmatics partners like RebBee and Screen Subtitling Systems to provide this in a production form to support full solutions, making sure things like the location of the titles are correct and that the read rates are suitable, ensuring it's consumable by end users. These partners add formatting and house styles and slow down the output to really make sure it's readable and better than just the plain ASR output you just saw which comes up very fast. Due to high regulations associated with specific live programmes in the news, speech recognition alone does not meet those standards, but the ability of using this technology adds enormous value to the pure text. Without these things, captioning on large events such as things like Wimbledon or the Olympics would not normally be captioned as it isn't regulated. This means services like the RebButton, which is an added feature on UK television, can use some of the Speechmatics technology to make it possible to add captions where otherwise there wouldn't be any, supporting the ability to add value to channels and programmes which aren't regulated. So next I wanted to show you a different application, this time within the contact centre space. The contact centre solution from Redbox can be used to convert speech within recorded calls into text conversations. This enables search functions for spoken words and phrases across recorded calls and the ability to view the text conversation as well as listen to the audio call. Transcription searches can be used to locate calls where the words or phrases were either said or not said within the entire call or within a channel name or call direction. There are a couple of options you can define for a transcription search. Just click the options icon and enter the criteria. From here you can define the search type, spoken or not spoken, and where in the call to search, anywhere from the beginning or at the end. And there's a confidence level that allows you to set a margin of error, allowing you to ignore the lower confidence answers. It allows you to cope with things like pronunciation or speed of speech. In the example showing here, we've located all calls that have the words voice data in them. Now let's say you want to check the call agents are highlighting a voice data promotion in the first 30 seconds of their calls. You can change the settings to spoken in the first 30 seconds. Now only those calls get listed in the results. Similarly, you could change the search to not spoken in the first 30 seconds, allowing you to highlight calls that weren't complying with the requirements. Once that search is performed, you can highlight the call you're interested in and click the main play button and listen to the call and hear what was said. The media player also highlights the specific locations for the phrases that were searched for, allowing you to see exactly where things were said in the call without you having to listen to the whole call. And then when you found the bit you want to speak to, again, you can see the conversation pop up in an instant messenger style view of the conversation. This shows the second use case. This type of use case makes audio indexable and searchable, something that applies to many other use cases as well as the one shown here that is specific to the contact center.

Speaker 1: There is one more short demonstration I want to show you, and that's a transcription service

Speaker 2: from a company called TranscribeMe. TranscribeMe offers a transcription service that provides a full workflow around transcription, providing highly accurate transcripts across a wide range of content. It's really simple to use, providing simple upload of audio that needs transcribing. And then offers a range of transcription levels from fully integrated automatic transcription through a range of human augmented transcripts. Speechmatics helps by providing the automatic baseline transcript, which can be used for pure automation model, but also helps with the ability to efficiently provide high quality transcriptions to transcribe customers. You simply place an order, and after a processing time of your choice, transcripts are returned, and these can be viewed and edited really easily. These transcripts can then be accessed in easy to use forms such as Word or PDF documents, and you can play back the audio and compare just like this. TranscribeMe is a great way to use the TranscribeMe service. It's a great way to use the TranscribeMe service, and it's a great way to use the TranscribeMe

Speaker 1: service. Mathematicians, philosophers, and computer scientists, and we sit around and think

Speaker 4: about the future of machine intelligence, among other things. Some people think that some of these things are sort of science fiction-y, or out there, crazy. But I like to say, okay, let's look at the modern human condition. This is the normal way for things to be. But if we think about it, we're actually recently arrived guests.

Speaker 2: So I wanted to show you some of the values that can be added by human transcription today. The top view here shows the raw SpeechMatics output. This is transcribed exactly as it's said, with a few small errors, but the TranscribeMe transcript improves this by adding readability. You can see here that there is more punctuation in the TranscribeMe output, and some of the words that were not needed have been removed. This is where the speaker used mannerisms. He said, like, quite a lot, and even repeated two words whilst he was thinking as he spoke. That was a great example of that unprepared speech that I talked about earlier. So as you can see from these three very different examples, speech can be very valuable within many applications, and we've just shown just a few of those to whet your appetite a bit. But the attributes needed to make these valuable are different, and it's not just a case of sending some audio to a service and displaying the results. As you saw in those demos, speech recognition worked pretty well. So you might ask, are we done? The answer to that is certainly no. There is still lots to be done to continue improving and innovating in the speech recognition area. Although English is still by far the biggest worldwide demand, and many accents of English lead to an ongoing challenge here, one that we're rising to, but we could always get better. Academically, Speechmatics has proven that we can build pretty much all the languages. We have a digital employee called Al. Al is an automated linguist, and there's a set of components that learn how to build new languages using learnings from the languages we've already built. This removes the need for employing many specialist linguists for each new language, and reduces the amount of time building a language from the many months it might take otherwise. Reducing this complexity is not just about building languages fast, but it also enables us to do research and development and new improvements to existing languages very fast. This does not remove the need for data, though. Al enables us to do some of this with less data than ever before, but getting that data can still be hard. There are two sources of data needed for speech recognition to work. There is acoustic data and language data. This means that we need recordings with accurate transcriptions, along with a lot of text that can be used to model the language. We have the technology to build models very fast using Al, iteratively, starting with small amounts of data, maybe a few tens of hours that can be used to demonstrate that it works, and see what linguistic challenges are uncovered, and then to iterate until the language is understood, and we're throwing more and more data at it. Then we can build with increasing amounts of data to reach higher and higher accuracies as required, and we are happy to go through that process with customers that have commercial need for a language. If you want to embark on that journey with us, you can get in touch. Otherwise, we are currently focused on a core set of languages that cover about 28 languages, so we've already been on that journey 28 times. The map on your screen shows where in the world speech graphics technology can help people based on the first or official language spoken in each of those countries, so we have a good coverage of the world. Last year, we enhanced our language accuracy across the board through continuous improvement in data and algorithms, but this is not just improving the lower quality languages. We lifted accuracy in English by 16% across all of our use cases at the same time as making it a global model, covering many accents and dialects. These improvements are in real-world conditions where it's noisy with overtalk, accents, etc., as we've talked about already. This enables better use in the complex use cases that we've discussed. Our latest improvements in accuracy have led us to actually removing our separate English dialects completely, which means we can offer the best focus on the delivery of English across core languages. We're the first company in the world to do this, and it simplifies the use case of our models, reducing the deployment footprint and increasing accuracy for the use cases,

Speaker 1: which is a win-win-win situation.

Speaker 2: So, we are trying to enhance our language accuracy

Speaker 1: through algorithms and data, as we just said,

Speaker 2: and we've got that 16% increase this year, and we will continue to do further improvements. And, yes, we will release new features that help you adopt speech better, more efficiently, and integrate better. But, more importantly, our technology needs to be really easy to apply in those real-world use cases we've talked about. We really care about those real-world integrations, not just building for academia or status. We want to power the world and add real value to the solutions that help people. So, I'm going to stop there. We've talked about the challenges of speech recognition and things that can be done to make it work in the real world and what accuracy means to customers. Let's open it up to some questions now. Feel free to ask any questions you might have. Let's open it up to some questions now. Feel free to ask anything you like in relation to usage, languages. I'm particularly interested in some of the questions you might have around accuracy and what it might mean to you. So, feel free to put the messages or questions into the webinar, or we can get in touch after the webinar. We have some questions that were sent in as part of joining the webinar as well, so we can start going through some of those

Speaker 1: while we also look at some of the new questions coming in. So, the first question that I have got here is,

Speaker 2: what is the most important factor for automatic speech recognition for real-world applications? And I think, really, we've covered that in the webinar. I think really understanding what you're trying to achieve and setting some realistic expectations for your tasks is really important. Speech recognition can be really powerful and it's used and applied in the right way. But if you use it the wrong way, it can be quite disappointing. So, really knowing what you want to achieve and ensuring you have an environment to make that work and success is very key. Don't fall into the trap of just assuming that a company has an API in the cloud, so you can just throw some audio at it and you'll get the text back and you are done. We really enjoy working with customers and partners to make sure that they're getting the best study they can out of the speech, so we're happy to help people through that journey. Okay, I have another question here, which really relates to the ability to do real-time. I think it says, does Speechmatic allow online transcription of learning speech? I think we covered that by showing the real-time streaming speech. This is still through a very simple WebSocket interface. You can simply stream your audio into our technology and you can get the text streamed back in real-time. We provide what I would call the best guest word for the best guest word. What I would call the best guest word for the recognition almost instantaneously. And then within sort of a one or two second delay, you can get the final output that's as accurate as it's going to be. And this is suitable for use in many different use cases and models. So there's a question here about the ability

Speaker 1: to add words to our models.

Speaker 2: One of the features we do have is the ability to inject extra words that you need to be recognized as part of your transcription because our language model can't possibly have every word in it that exists in the world, especially for some of those made-up words. So it's possible for you to inject some context as part of starting a speech recognition session which contains words, terminology, people's names, place names, anything that you think is in our model. And we will, if our models don't have that word, we will add it to the model so it will be recognized. If we do have it, but we don't believe it's a very good word, we will add it to the model. If we do have it, but we don't believe it's a very likely word, we will improve the likelihood it can get recognized. So you can effectively adapt the speech in real time towards the application you are trying to solve on a per-transcription basis. So if you were doing a meeting solution, for example, you could inject the names of all of the people on the meeting at the beginning of that meeting so you wouldn't have to have everybody in the world's name in the system, just the ones you needed at that point. And this can be very powerful for not just transcribing, but reducing editing later so it will enable you to make sure that people's names are spelt correctly and recognized. Yeah, so there'll be a couple of questions around what languages you can use to integrate and interface with Speechmatics. We have our API for both the real-time and the batch is very simple. For our post-recorded service, we have a RESTful API that's available, which is very simple to use. And it's very easy to code against RESTful API in nearly every modern programming language that there is. So that's very straightforward. Integrating with our real-time system requires a WebSockets connection, but it's still very simple to use. Our software is available in many forms. You can use a Speechmatics hosted service where we operate and manage the service for you and you send your audio to us and we process it. But our services are also available to be run within your own environment. So you could run it inside your own cloud or inside your own data center or premises in a private model that enables you to control the data flows and doesn't require giving your data to us to operate. So you can run it in a completely private mode, enabling you to apply the speech into whatever application that you need.

Speaker 1: Yes, so as I talked about a bit earlier, there are obviously lots and lots of audio formats that people have,

Speaker 2: and some formats are going to work better for speech recognition than others. Generally, if you're in control of the format, using lossless codecs like FLAC or WAV are the best way to do it. So if you're in control of the format, like FLAC or WAV are best because there is no compression artifacts on them, but obviously they are larger to transmit and store. There are some codecs that are designed for the use within speech recognition, which are also very good and tend to be pretty lossless, and they work very well. If you're not in control of the format, what we recommend is that you send them to Speechmatics exactly as you have them because our service converts them internally and will do the right conversion. If you try and adjust the audio files in any way that affects the bit rate, you introduce all sorts of artifacts that can cause some fairly bad effects on the recognition. And so you have to be very careful if you're converting and storing things in different formats. Okay, so there's some questions around accuracy within noisy environments and telephony type environments. These are always challenging environments. Noise is probably the biggest enemy of speech recognition. So anything that can be done to reduce the noise before it's being processed is always a good thing. So using good quality microphones or making sure that agents at a call center, for example, are trained to use the microphones in the right way is often a huge step towards improving the accuracy. But beyond that, Speechmatics train our models on a very diverse set of audio files that cover as many of the use cases we possibly can. So they cover noisy environments and telephony environments where we're learning with people who speak in the right way because the way you speak on a telephone call is quite often different from the way you speak. Well, it's certainly different from the way a trained speaker would speak on a television program, for example. So we train on a diverse set of data that covers use cases that was close to the use cases that we know customers want to do. So that really helps with the accuracy in those environments. We actually have real world test sets that we use internally to measure those specific use cases to make sure that we are able to operate in those real world scenarios. It leads back to that statement where people think speech is very good because they test it against perfectly high quality audio. We explicitly don't do that. We test it against the audio that our customers have to make sure that it meets those cases. But obviously, it goes back to setting expectations. If you have got very noisy calls where people are doing a lot of mumbling and there's lots of overtalk, you will not get as good as an accuracy as if people are speaking really clearly. So actually, understanding what you're trying to achieve is very important in that aspect.

Speaker 1: So it's worth, I've got a couple of questions here

Speaker 2: about things like what's the difference between this and Dragon Dictate and things like absolute pure transcription to including things like under the nerves. The Speechfantic service currently is what we would call long form transcription rather than dictation. So it doesn't respond to commands like editing commands like full stop or comma. It would actually write the word full stop if you said full stop. So it transcribes what's said rather than allowing you to use it in a dictation form. That's what I'd expect. And to that extent, it doesn't transcribe ums and errs. It assumes that those are not things that it's trying to represent within the speech. It just transcribes the word that it hears at this point.

Speaker 1: Okay, so there was a question about, I think probably relating to the video we showed. Our text was all showing in lowercase

Speaker 2: without much punctuation, I think as well. And the capital letters weren't appearing after the full stops. And what I should have said is that video was made about, I think about six weeks ago, actually. Since then, we've had a lot of feedback from people who have been using it. Six weeks ago, actually. Since then, we have added punctuation and capitalization improvements that make sure that we do have much better capitalization than you saw in that video. And it doesn't just capitalize letters at the beginning of sentences. It does capitalize names and proper nouns and other things like that. Just that I didn't have time to remake that video before the webinar. I ran out of time. So I have to apologize for that.

Speaker 1: But we do have that function. Yeah, there are several questions

Speaker 2: that have been asked around either diarization or speaker change detection. We do have the ability to detect speaker changes. In our batch system, we have the ability to identify speakers and repeat speakers. And in our real-time system, we are getting very close to releasing a feature that enables speaker changes to be provided in real time. It's not quite available yet, but it's coming on the roadmap very, very soon. Speaker change and diarization are quite challenging. It's quite hard to do. So it is not an exact science. We are working very hard to work out how we can continue to improve the diarization. It was less of an issue when the accuracy of speech wasn't as good as it needed to be. Now speech, actually the accuracy of the actual words is getting better and better. Focus is beginning to look at some of those other elements of accuracy that we talked about earlier in the presentation. The speakers, speaker identification, diarization was one of those areas. It's one of the reasons why we're very interested in understanding what it is people think makes accurate speech recognition. So we can prioritize that work, hopefully and do the right thing. But diarization and speaker change identification is certainly fairly high in that list right now.

Speaker 1: Okay, so there's a question here around GDPR.

Speaker 2: It's the world's favorite four-letter acronym at the moment. I'm asking how we've tackled the need to ensure compliance and GDPR when handling customers' recordings. So one of the things that I talked about minutes ago was speech-matic's ability to deliver our speech technology into the environments you need. So if you require any form of data management and ownership of data, you can deploy our technology within your own security boundaries and manage the data storage entirely yourself. Our technology is designed to be as stateless as possible. So there is no long-term data source inside our solutions, which means you don't have to worry about speech-matic's part of your solution becoming part of your data purging processes, et cetera. So we essentially enabled GDPR support by enabling you to operate the software within your own bounds and not have to worry about who you send it to and how they store it. So that's quite easy to support. Our technology is also very easy to deploy yourselves. So solving the GDPR problem by taking the technology into your own bounds does not just fire up another problem about how you deploy and implement our speech. We've worked very hard on making it very, very simple and quick to deploy. I've got a kind of a fun question here, which would be fun to answer. The question says, is a Star Trek universal translator device a reality really, or is it something we will never see? And I think that's quite fun. Here at Speechmatics, we're really interested in innovation and the future in some of these areas. And there are plenty of people in the room who are interested in that. So I'm going to go ahead and open it up to questions from the audience. And there are plenty of people in the machine learning area who are looking at reading direct brainwaves and ignoring the speech element completely. Speech recognition itself really has only become reality over the last sort of few years, even though it's been sort of in academia for tens of years, it's only become a reality in the real world for real use cases in the last few as computers come to the forefront and the neural network technology has become better and better. So I'd never say never. I think that's a long way to go. And I think it will be possible. And it just shows how much of a fun journey there is still to go in these sorts of areas. So be looking forward to being part of that, assuming I haven't retired

Speaker 1: by the time we get there.

Speaker 2: So there was a question around, is our global English and its ability to work in lots of different dialects just about collecting more data. And data, obviously, machine learning is king. There is, you have to have the data before you can do anything. But you also can't just throw huge amounts of an understood data in and assume it's going to solve the problem. So there's a lot of data management and understanding what the data is and applying it in the right way to make that work, along with a lot of testing and iteration to make it work. So the core of making global English is certainly around having a lot more data from a lot more speakers. But there is a huge amount of effort in the way that's processed and ingested into the system to make that work. So our global English is aimed at coping with the English accents that are prominent in the places where the use cases we support are. So our global English does not necessarily mean it will be able to transcribe every English accent in the world. But it is and should be able to work very well within the countries that have the prime use cases we are trying to solve. And we are constantly looking at extending those boundaries into other areas as there is demand within the world for English speaking in those countries.

Speaker 1: So that seems to be most of the questions covered. If anyone has any other questions, feel free to submit them. We will answer any others that come through after the webinar.

Speaker 2: Thank you very much for listening. We will send you the recording afterwards. And we will see you in the next webinar. Thank you. But if you are attending Enterprise Connect in March or NAB in April, please do drop by our stand and say hi. I'd be happy to show you some more, talk to you some more and learn about your needs. Is it about accuracy or anything else? It would be great to talk. So thank you very much for your time today.

Speaker 1: And hopefully we'll see you soon. Thank you. Thank you.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support