Speaker 1: Hello and welcome to our second Live2Live, live from our new office studios in central London. There's actually four of them, we both got lost on our way over here, but we have an area. The studio is packed with all of one vehicle, one person. We're going to be talking to you today about how to differentiate between media and event captioning services. I'm here with Tom Orton. Tom and I know each other from Red Bee Media. Can you tell us a little bit about yourself? Actually we should use Speechmatics first. We're hosting this. Speechmatics, obviously you all joined us last week, so you would have learned who we were then. Not much has changed in the last week. We are an automatic speech recognition company, so by captioning, everything that is an awesome speech engine, we will point it out in your voice, provide that to the world. We are an engine, we look to great partners like ourselves who take the magic we do and actually bring it to the real world. Please tell us about Red Bee Media and how you integrate with speech recognition technology.
Speaker 2: Yeah, thanks for that. My name is Tom Orton, I head up the products area for access to media management services, broadcast services within Red Bee Media. Red Bee Media provides media services, typically to broadcasters, but of course media is everywhere these days. So it could be streaming or content owners and government services and things like that. They all come into the ambit of what Red Bee Media does. Three main propositions show how you organize and present content, so streaming services or broadcast television, for instance. Supply, how you get content from one place to show it successfully in another, and it's metadata as well, surprisingly difficult. And enrich, which is how you add metadata to help people discover programs in their program guide, or crucially, how they can access those programs if they've got sensory impairments, if they're deaf or hard of hearing or blind or partially sighted, so that's typically subtitles or post captions, audio description and sign language translation, actually. And although we talk about deaf and hard of hearing people, I'm sure many people who are on this webinar who are watching it will have used subtitles at some stage or another, so it's got a far wider audience than just those deaf or hard of hearing people, and I should also say they can provide specific assistance to people with cognitive impairments and disorders as well. And that's because where our technology comes in. Speech and Attics, as you say, we've known each other for a while. A while ago Red Bee decided we wanted, you know, we knew artificial intelligence and machine learning driven technologies would be changing how we would produce subtitles and post captions. So the key thing for us was that we wanted to lean into that. We didn't want to say, no, we don't want anything to do with that. At the moment, and historically, when I first started doing subtitling, people just typed very quickly. I typed very quickly. It's a very manual process. And then we moved to speaker dependency. So that's new for live, is it? We actually did it for live. We had two people doing alternate sentences. So what was your words per minute on that? It's high. It's about 100 words per minute. Yeah. Someone said to me recently, you've got secretarial skills. It's worthwhile. But we moved to speech recognition, where you have individual re-speakers who say all the punctuations, say all the speaker changes. They are very, very skilled people. And the challenge of bringing, you know, artificial…
Speaker 1: What does that look like? So if you're trying to imagine a re-speaker, I think it's a foreign concept.
Speaker 2: It really is. And people are always surprised when they see it. So this is people literally sitting in a studio booth, getting low latency audio from the encoding system, listening to that, listening to what's being said and repeating what's being said with punctuation and switching between different colors. It's incredible to listen to. I know you've seen this and it's a really, really impressive skill thing to do.
Speaker 1: So they're basically watching TV, saying what they're hearing at the same time.
Speaker 2: At the same time and getting it to come out accurately as well.
Speaker 1: Why does that come out accurately?
Speaker 2: So the importance of it coming out accurately is obviously that we need to make sure that the information is being conveyed correctly. Historically, the technology was, we call it speaker-dependent speech recognition, because you would train that model to listen to that person's voice very accurately.
Speaker 1: And it cancels out all the noise you might have had.
Speaker 2: It cancels out all the noise. You've got a very protected environment. Now, of course, the big step change, which we saw coming in with companies like Speechmatics and more generally, was the ability to take automatic speech recognition and make it speaker-independent to the content of the program, rather than having that person doing that re-speaking. So that's really important because it potentially brings down costs and enables accessibility to be able to cross a far wider range of content than ever before.
Speaker 1: And that's because, obviously, a re-speaker, very highly skilled job, sort of very exhausting to do it. I think you can only do it for 20, 30 minutes, maybe an hour. 30 minutes typically, yeah. And then you've got to switch out. So if you've got thousands of TV channels to do, having thousands of re-speakers across all those channels just isn't economically viable.
Speaker 2: Absolutely. It enables more accessibility, which is always how we've seen it. But we had to bundle in that expertise into an automatic package. That isn't easy, right? And as you say, not everyone's aware of how challenging it is to produce high-quality subtitles for audiences who rely on those subtitles to access the content successfully. So we've had to work very hard with the likes of Speechmatics, but also internally, because you cannot just throw technology at that problem. As our head of technology said yesterday in a conversation about this, actually, we were talking about plug-and-play technologies. And he said, plug into what? Play what? Those are crucial questions about how you take technology like automatic speech recognition and turn it into a service that can be used on broadcast television.
Speaker 1: And currently, speech recognition is used in two main areas within broadcasts. You talked about the live. So replacing or using where re-speakers can't be, you just directly attach it to the audio. You get live captions out. And is there also an online?
Speaker 2: Absolutely. I mean, we see two uses, and you're absolutely right. Sports and news will typically be live, but there's a vast amount of content out there that will be pre-recorded. Drama, documentaries, things like that. We do a lot of that. And typically, pre-recorded content is expected to be extremely high-quality because you have time to get it right. Unlike the live environment, we all know things can go wrong live. We may talk a bit about that in a sec, but in a pre-recorded environment, it needs to be at the highest level of quality. Everyone expects that. So we've used batch processing for a while to really speed up the process of getting to a final file that can be used. So where you end up is effectively using that batch automatic process and then QC-ing it with a person. Improves efficiency, improves productivity, but you still have the person in there just to make sure it's all coming out correctly. And that was actually our first application of automatic speech recognition, was really to speed up that process and make it more efficient.
Speaker 1: Okay. And you mentioned sort of a livening use for sports. I've just written down BT sport in massive letters here. Tell us about BT sport and rugby and how that involves speech matters, because when I heard about this, I was jumping for joy.
Speaker 2: It really is fantastic. It's a UK broadcast first, and it relies on a set of things there. It relies on a customer, a broadcaster, who is really committed to providing more accessibility and will work with you to get there. And it relies on our internal teams and working with Speechmatics to get to a place where we can provide the service that they want. So UK broadcasting first, we are using automatic speech recognition to drive ARC, which is our sort of automatic captioning solution, product, whatever you want to call it, to provide over 200 additional hours of accessibility for BT sport programming a month, which it's absolutely incredible. It means that that 200 hours wouldn't otherwise have been accessible. So we're using this. It gives them a cost-appropriate way of generating far more accessibility across far more content. I've been amazed at how successful it's been. We've all been watching avidly, and it's been accurate. It's been giving accessibility to programs which wouldn't have otherwise had it. And it's UK broadcasting first. It's a real milestone. But what you don't see with milestones is how much work has gone into just getting there, making sure that you've got something which would be appropriate for that content.
Speaker 1: And I've spent most of my life trying to, I guess, irritate my friends and family by telling them that Speechmatics technology is all around them. Now if they go onto BT sport, they can genuinely see that, and that just feels so real.
Speaker 2: Absolutely. And the good position we're in is that we've always worked with BT sport, and there's still an awful lot of their programming will be done with the human respeaker, the technology that you just talked about. But what this gives them is another option to allow a wider range of programming to be made accessible. It's really all about giving, especially in the transitional technology space, it's all about giving broadcasters and indeed anyone else options that they can use to enable more accessibility.
Speaker 1: Excellent. So just moving on, I'd like to talk to you about errors in speech recognition. Now I've written here Princess Eugenie's dress. Tell me, what am I talking about there?
Speaker 2: So, automatic speech recognition, a lot of people will think, and indeed subtitling and captions, a lot of people will think about errors. And we need to think about what sort of errors matter and which sort of errors don't. Now Princess Eugenie's dress, well that was the royal wedding a couple of years ago now, a few years ago now. And this was done with human re-speaking, very, very high profile event. So speechmatics wasn't being used? No, no, speechmatics wasn't, it was re-speaking. Live anything can happen, right? And our re-speakers are excellent, but all it takes is a small recognition to change Princess Eugenie's beautiful dress to Princess Eugenie's beautiful breasts. Very unfortunate speech recognition. Now the reason that's interesting, and we can all laugh about that, and obviously it was both a moment of horror and amusement, especially at this distance, but if you're a deaf or hard of hearing person relying on subtitles to watch that, you can probably make a silent correction. You will probably understand what's happened, but the subtitle will also correct for it. But, and so that's critical, and we worry most about where a deaf and hard of hearing audience member can't actually tell mistakes being made, so it's almost in the status of misinformation. But of course if you're a broadcaster and that gets onto Twitter and there's a load of sort of publicity about it, that in itself is a quality problem as well, and broadcasters will be aware of risk both in terms of their audiences, they want to make sure they're delivering a good service, a good quality service, but they also want to make sure that they're protected against Twitter-worthy errors or meme-worthy errors, if you like, and that's a very good example of it.
Speaker 1: And of course there's tooling as well around to prevent these kind of errors, because obviously Jeremy Hunt has just become the new Chancellor of the Exchequer, and we now provide a system that will actually make sure that these kind of errors are cancelled out, that they can't go the wrong way, and that is, I assume that gets brought into what Red Bee's gives to its customers.
Speaker 2: Yeah, absolutely. I mean, we've got two or three layers of SafeShard, actually. We've got the speech-massive profanity filters, we have our own internal rule sets which can be applied on a, you know, depending on the market, depending on the broadcaster or whoever we're providing the service for, and we have a timed set of filters as well, because of course there's watershed rules too, so you have to do two things. You have to make sure you are providing a safeguard, but also there may be occasions where actually if it's post-watershed, if you have that filter on, you may not be providing a proper level of accessibility to that audience.
Speaker 1: Yeah, absolutely. And also I need to mention at this point that as part of this video is going on, you'll be seeing some live captions. These are LinkedIn's live captions we're providing. They're not very good, so if you're working at LinkedIn and you're seeing this, then if you want a nice captioning solution from either Red Bee and SpeechMassive combined, then we can make that much better for you. I know we're being asked why it's not on our system yet, but we will get it there.
Speaker 2: I just wanted to make a point regarding that, actually, which is that sometimes people think automation is automation, and it actually takes a lot of work to ensure that automation is an automatic captioning is appropriate for the content. And so we have a media library where, as you know, we quite often test new SpeechMassive releases. We test other speech engines as well. We've always been very clear with SpeechMassive. If you're not the best, we won't work with you, basically. It's a good place to be, right? Yeah. And we do that evaluation a lot because media is its own space, and improvements for some use cases may not be improvements for media. And making sure that we're getting media-appropriate speech engine use and that we're able to take that speech engine output and apply it in a way that is appropriate for this specific use case is a big part of how we're successful together and for our customers as well.
Speaker 1: And what you see is what's next in that automation process. We talked about speech recognition numbers. There's additional features around it. So what are you seeing as coming around the corner and being something that you feel comfortable putting in front of your users and your customers?
Speaker 2: Yeah. Where's the improvements at the margin, basically? Not all of it's to do with the production bit, which is worth saying initially, which is that you may want fully automated workflow, ideally. So are you integrated with scheduling systems? Are you integrated with play-out systems? So very boring, but actually you want to make sure that you're not just watching the washing machine, that you've got a fully automated workflow and that that's working successfully. So improvements at the margin may be... Watching the washing machine? That's me. Well, you know, that's common sense. It came from a mistake we made early on, which is we just threw some technology as a problem. And we automated a bit in the middle, but people were just waiting for the automatic outputs to come through. It didn't hurt saving any money because people were effectively watching the washing machine so they could take the laundry out. Okay. So, you know, you've got to think in a smart way about how you use automation. So, you know, there's bits around the edges, certainly, and there's integration points. But in terms of the ASR specifically, I mean, I think... I'm sorry to keep banging on about it. I don't know why. You know, speech matters, I've heard it. I have some interest in that. But the specific thing about ASR, I think, is that we want to see the improvements in speaker differentiation. If you think about a news program, it's really important that you can tell when one politician is speaking as opposed to another politician speaking. And this is a live... This is live use cases. I mean, we'd see a productivity improvement on the file side, but live, it needs to be reliable enough to enhance the accessibility and it mustn't be confusing. So we've got quite a high bar for when that speech differentiation, speaker differentiation is appropriate. Now, I think you will see some people talking about, well, can you use different mic setups and things like that. Perfectly feasible. You can do things like that, but they're not a normal broadcast sort of thing. We don't normally get that level of signal complexity back. So really what we would like ASR to be able to do is much more reliably than it can at the moment. I think it's near, but much more reliably can at the moment identify differences in the speaker.
Speaker 1: And on that note, we'd like to plug our new speaker change in real time. You've done that fantastically. We really appreciate that. And is there... What do you see as the next frontier in terms of the difficult use cases, edge cases where
Speaker 2: speech isn't quite working? I'll go back to the team and ask them, where do you see the big challenges? I mean, two, consistent answers with music and quizzes. Quizzes is an interesting one, actually, because A, there's a load of sounds in quizzes to indicate wrong answers, right answers, and things. There's some sonic context that is difficult to capture in pure ASR, if you like. But also, timing is a challenge, and who is speaking when is a challenge. But also, you've just got people sometimes giving single answers. And if you get a misrecognition on a single word, as a classic example, it's very difficult for a deaf or hard of hearing person to be able to tell whether that's accurate or not. That would be a good example of the sort of error we really want to avoid with deaf and hard of hearing people, which is an error where they can't tell whether it's an error or not. Right. So there's that, and music as well. We do actually, for festivals and things like that, we do quite a lot, and Eurovision, we do quite a lot of lyrics preparation so we can make sure the lyrics are... You personally do lyric preparation? I have done in the past, but our team... Are you singing as well, or just writing? But singing, of course, isn't always likely to be conveyed very well by automatic speech recognition. You've got the music context there, which presents challenges, and you've also got lyrics which may be being sung in a way which isn't easy for the ASR engine to accurately convey. I've been told that rap is often poorly conveyed, and you've got to think very carefully about how your subtitles are coming across to your audience. You can't have a situation where some things are being conveyed less well than others. So I think it feels like, especially in these particularly hard-use cases, there's a lot of room still for skilled subtitlers to be working side-by-side with automation, but effectively to be delivering accurate subtitles.
Speaker 1: Do we have questions we need to go through? Great. One more piece, then, just for you. This has always interested me in terms of... Nowadays, I find that as we work with the formal international media, we've almost changed the way in which we speak. What would have been considered high rhetoric 50 years ago actually isn't going to be as effective when you're speaking to an audience. You might be speaking your language as a second language, or vice versa. We've changed the way we speak to be able to deal with a wider audience. Do you ever think we'll get to the point where, for ASR, we actually start to change how we train presenters, or how we train people to actually cope better with speech recognition itself?
Speaker 2: Well, I mean, we all already silently do it at the moment, don't we? When we've got our phone, we speak in a slightly different way to our phone to make sure that speech is conveyed accurately. So I think we're already aware of that process. Two examples, which take either side of the coin. There was a Los Angeles news station a few years ago now where they were looking to trial speech recognition. And I won't say which engine it was or which company it was, but they had a Latino weather presenter. He was quite well-known. And it wasn't coming out very well for him. And they asked him to speak differently. And that seems to me a very bad example of what you're talking about. You're asking people to adapt quite important sort of aspects of who they are and the community they may belong to in order to make automation work. That seems to me suboptimal, to put it mildly. On the other hand, I think we all have experiences where we sort of, you know, try and ensure we're speaking clearly to make sure we're understood by all sorts of different audiences. And, you know, looking at whether commentators are in certain contexts, that a person is aware that they are more likely to be making their content accessible if they're speaking in a certain way. There may well be appropriate times where there's at least a level of awareness. Equally, you don't want to take away the flavor of things. You wouldn't want sort of sports commentary to be delivered in a way that was sort of flattened and devoid of emotion. We need to find a balance in those areas.
Speaker 1: So what we should probably do after this is work out sort of which one of the two of us have better speech recognition on our voices.
Speaker 2: I wouldn't want any speech engine trained on. But, no, let me say, I mean, we both speak in a certain way, which is actually quite a narrow section of society. And media needs to be for everyone. And national broadcasters and public service broadcasters will have all sorts of people, you know, presenting things from all areas of the community as they should. Speech engines need to be able to reflect that accurately. If one area of the community is more absurd than another, that is not a place where we can be. So one improvement I have seen, to talk about improvements in the future, is worth talking about improvements in the past. It's been a huge shift from, in terms of the amount of dialect and accent that speech engines get right. Hugely beneficial.
Speaker 1: And we, for example, based on that, we've gone from, I think, in our training sets for our engine on the unlabeled side, we've gone from a basis of tens of thousands of hours to millions of hours of audio. That is, of course, again, across accents and languages, which, again, gives us that robustness. That's what we're pioneering towards. We want to be as equal as possible across all different voice components.
Speaker 2: Yeah, and that's been a big technology improvement. It's a good example where the technology improvements are reflected in quite tangible ability to use it on all content with confidence by sort of major broadcasters.
Speaker 1: Just checking the time. Is that five minutes? Let's wrap it up. So, just to say thank you very much on that front. It's been great speaking to you. I managed to avoid saying all the things I shouldn't say. Uncharacteristic. I know.
Speaker 2: Next time. No, it's a pleasure to be here. I'm genuinely proud of what Red Bee and Speechmatics and what we're achieving there. I do see it as a paradigm shift in how accessibility is being provided. And we love working together, so long may it continue. Exactly.
Speaker 1: Brilliant. Ta-ra.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now