Exploring Speechmatics' Advanced Speech-to-Text Tech at NAB Las Vegas
Join Chuck Joyner at NAB Las Vegas as he discusses Speechmatics' cutting-edge speech-to-text technology with Reggie, highlighting its accuracy and versatility.
File
MacVoices 22102 NAB - Speechmatics Delivers Fast and Accurate Transcriptions
Added on 09/07/2024
Speakers
add Add new speaker

Speaker 1: From NAB in Las Vegas, Speechmatics delivers accuracy and speed in transcriptions. This is Mac Voices. This edition of Mac Voices is supported by Collide. Collide sends employees important, timely, and relevant security recommendations for their Linux, Mac, and Windows devices, right inside Slack. Meet compliance objectives in a remote-first world without resorting to rigid device management. Try Collide for 14 days free. Visit collide.com slash macvoices to sign up today. That's K-O-L-I-D-E dot com slash macvoices. Enter your email when prompted to receive your free Collide gift bundle after trial activation. Collide.com slash macvoices. Mac Voices is in Las Vegas for NAB. I'm Chuck Joyner. Folks, we stop by this Speechmatics booth to talk to Reggie about their speech-to-text recognition system and what it can do for you. Reggie, good to see you.

Speaker 2: Thank you, Chuck. It's great to be here.

Speaker 1: So this is the first two-fisted microphone thing that we've done.

Speaker 2: Yes. I'm afraid my microphone isn't picking up any of your audio. So what we're going to do is use this one to test out our real-time speech-to-text demo.

Speaker 1: So that brings up the first question. Now, we're in a very, very noisy environment here. Am I expected to need a microphone when I'm playing with your system?

Speaker 2: No. What we have is a self-supervised learning system. So we train on millions of hours of data. And what that does is give us a real advantage on these noisy audio scenarios.

Speaker 1: OK. And what speech recognition engine are we using in the background for your system?

Speaker 2: This is our proprietary own speech recognition system. We've got it in 33 different languages. And we think it's the best in the world.

Speaker 1: Well, let's find out. Let's go for it. If you'll take us through a demo. Sure. Should I hold the microphone? I'll hang on to the microphone. That way, you're free to manipulate.

Speaker 2: So right now, what we're doing is we're spinning up a docking container in the background for our real-time system. And then you can see this on the screen. What you can see here is in the light blue format. That's our audio recognition model. And that is detecting the words that I'm speaking and what they sound like. And then displaying on the screen. And then once it turns to black, that's the language model setting in and using the context of the sentence to improve any accuracy for everything. If I also demonstrate some of our entity recognition, for instance, if I say speech should be worth $3.2 billion and should grow 50% year on year. That will display in a really nice way, I hope.

Speaker 1: It just did.

Speaker 2: If I don't look, it's all better.

Speaker 1: So can this detect between voices? In other words, I'm speaking to you. If I talk into the microphone, you talk into the microphone. Does it recognize each of us? Or is it just picking up? Is it just doing raw speech to text?

Speaker 2: So right now for our real time, it will pick up any voice sort of indistinguishable. We have a speaker change function. But to be honest, this doesn't work as well as it should. We're releasing real time speaker diarization. That's the speaker separation. Which will detect you consistently throughout the whole document. But that's not ready yet for this demo.

Speaker 1: I think it's an interesting combination here that I get speed as part of the transcription. But then I get accuracy as it has a chance to work on it just a few seconds longer.

Speaker 2: Yeah, well, why not have both? Some of our partners like to have the instant recognition as low as 2.5 seconds, for instance. Because they need it just to display something immediately or to act on in a call center. If someone says, I'm going to leave, you want to give them a discount as quickly as possible. And then some people need that accuracy at around five seconds for compliance issues with subtitles, for instance. So it needs to be as accurate as possible.

Speaker 1: So what kind of accuracy percentage do you guys claim?

Speaker 2: We're claiming an accuracy of around 90%. But that's obviously a range depending on your use case. In broadcast media, it goes really good. Because in general, we have lots of microphones and quite good audio conditions. And so that's the perfect environment for our system.

Speaker 1: How about the inevitable problems of background noise like we're dealing with here? And accents? Because you've got one, I've got another. How does it handle those?

Speaker 2: So the system is accent agnostic, I think is the term I meant to use. And that means that it doesn't matter what kind of accent you have, it can pick it up. We're training on all this kind of data. So it will pick up your voice, no matter your nationality or your accent.

Speaker 1: So how does this integrate in with my projects?

Speaker 2: So right now, we're the speech-to-text on a lot of different backend softwares, just for instance, 3Play Media. But also, I just signed a deal with Closed Caption Creator, and they're using our system. We can integrate with an SRT format into a lot of different ones. If you go to speechmatics.com, you can get an API code. And if you know a bit of coding, you could do it yourself. And that will produce subtitles for you. It's just a bit of work. We don't produce a user interface because we work in so many different use cases that there's no single one that would be suitable. So we pretend to work through partners in order to produce these kind of interfaces. And you can look them up on our website.

Speaker 1: But if I just want raw speech-to-text, I can come to you and get what you're showing here?

Speaker 2: Exactly, yes.

Speaker 1: What kind of pricing is there for something like this?

Speaker 2: So it starts off at $2.75 an hour for our batch transcription. And then for enterprise deals, we can go down really, really far. If you're doing millions and billions of hours, for instance, and I really hope that you are, you can go down to $0.24 an hour.

Speaker 1: Wow, okay. So that has to be with millions and billions of hours. But still, there are entities out there that are doing that. So that makes it super affordable.

Speaker 2: It does. So really, at that point, a lot of the costs that we're experiencing is just the hosting costs. And that comes quite a lot at a million hours. So yeah, you have to charge something.

Speaker 1: Oh, absolutely. So you mentioned SRT. What other formats can I export my transcription into?

Speaker 2: So the main one is our JSON format. So that's going to give you, if I demonstrate this here, what you can see in the light little dark blue box there, maybe you can't see that on the podcast, but that's giving us a timestamp, a confidence score. We also class certain entities. For instance, if I say my phone number is 07582588103, that's going to hopefully tag that as a number. And then, for instance, if you have a credit card that you need to block out, we can do that as well quite easily on the backend.

Speaker 1: Very nice. Very nice. The website where folks go to learn more.

Speaker 2: Sorry?

Speaker 1: The website where folks go to learn more.

Speaker 2: Exactly. The website is speechmatics.com. Reggie, thank you so much for the time. Great demo. Thank you, Chuck. Thanks very much.

Speaker 1: Folks, we'll have more from NAB in Las Vegas. I'm Chuck Joyner. This is Mac Voices. Visit macvoices.com for show notes and to connect with Chuck on social media. Get involved in our Facebook group or like our Facebook page and get more out of your Apple tech with Mac Voices Magazine, free on Flipboard and on the web. And if you find value in it all, consider supporting us through either our Patreon campaign at patreon.com slash mac voices or by making a one-time donation via the PayPal link on our front page and in the show notes of each episode. You will join these fine people who help bring you Mac Voices. Advertising handled by BackBeat Media at backbeatmedia.com. Bandwidth provided by CashFly at cashfly.com.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript