Challenges and Solutions in Live Video Captioning: Insights from LinkedIn's Heather Herford
Heather Herford discusses the complexities of implementing live video captioning at LinkedIn, highlighting the lack of standards and innovative solutions.
File
Best Practices for Live Captioning
Added on 09/26/2024
Speakers
add Add new speaker

Speaker 1: Welcome back to Almost Live here at Streaming Media West 2016. I have with me Heather Herford, who's a live video producer at LinkedIn. Heather, thanks for joining us. Thank you, Tim. So I understand one of the challenges you're trying to solve, which is a massive challenge for all of us, is closed captioning for live. There are a bunch of different parts of that. How are you sort of breaking down the problem and looking at the solutions?

Speaker 2: Well, so it really started out of a necessity, a need at LinkedIn to add captioning to our own internal all-hands meetings in an effort to make our work environment more accessible and inclusive. So, and I have a history, a background in language, in IT, in production, and so it was a great project for me to take on. And I'd actually worked on adding closed captioning to a nationwide television channel back in the early 2000s when the FCC mandate came out to add caption to all broadcasts. So I had some experience, so I raised my hand, and it turned out to be even more challenging than I thought. I think the hardest thing right now is that the standards that exist in the broadcast world have not been adopted for the online world. And the reasons behind that, as I'm discovering, are not so cut-and-dry.

Speaker 1: And I go back in this part of the industry for 18 years, I remember Smile Files, which were part of what Reel implemented, and then you had other tech solutions on Windows Media, you know. So we had the format issues, that was one issue. What are the other issues that you're finding as to why things weren't adopted, or why there's no standard per se, time text, etc.? Yeah, so, and for, live is

Speaker 2: different than video on demand for captioning. So I'm really focused on the live because I think that for the on-demand, there actually are some pretty decent solutions out there. There's lots of different file formats that work, most of the platforms support more than one file format, so you've got options. When it comes to live, many of the players don't actually support true

Speaker 1: closed captioning. So you're saying the on-demand players from a particular video platform might support the time text, but the live video player does not. Okay, so it's not as easy as the old line 21 days where we did the insertions. No,

Speaker 2: and that's exactly the issue, is that there's not really a standard that's been adopted across online media players. So YouTube had a solution for a while that had captioners going in and sort of adding captioning information on the back end, and it wasn't great for the kind of programming that we do at LinkedIn, which is sort of constant narration without a lot of breaks. So it really struggled to keep up with it. We'd lose huge amounts of text. So YouTube was one of the first to add support for the 708 standard, which is what the digital... Equivalent of line 21, yeah. Exactly, in the broadcast world. So that's nice because all of the gear that's out there, the captioning encoders, that's the information that they spit out. So we added a closed captioning encoder to our signal flow. So when we're live on YouTube, that works great, but many of the other platforms don't support it. The internal player, for example, that we use doesn't have any way of taking that data and decoding it in the player. So it leaves the viewer, you know, essentially what we do as a workaround right now, internally, is we create two streams, one with burned-in captions and one without, and we let the viewer decide which experience they want instead of having that nice toggle

Speaker 1: back and forth. So essentially you're doing what we used to call timecode burn back in the day when we transferred from film. So literally they choose... Open

Speaker 2: captions. Wow, that's crazy. Obviously not an elegant solution. It's a workaround for now, and like I said, it's all platform-dependent. So we just did our biggest conferences to Ustream, and Ustream supports the 708. So we were able to get really nice, great user experience, closed captioning right on our main...

Speaker 1: And what does the sort of death of Flash do to what you're doing? Because one of the beauties of Flash is it, inherently as a player, had some time text capabilities in it. Obviously we're now moving to HTML5 players. Are you finding the HTML5 players that are out there, the companies have thought about this, or is it still sort of a hodgepodge of some support it, some don't support it? I'm

Speaker 2: finding a hodgepodge. YouTube and Ustream are the only big, big players that support it that I found in the live space. Others will say they support it, but they don't actually support the standard, the 708 standard, and they offer a different solution which oftentimes isn't a great user experience. I've seen some where the captions, and I use quotes, captions, because it's actually a scrolling transcript that pops up in a separate window. Yeah, exactly. And most of those solutions, they actually, you are obligated to use the captioning provider that they're partnered with. And the quality is, it tends to not be very good. I always like to point out that, you know, comprehension is tied to accuracy. So if you and I are talking and you only understand 70% of what I say, this won't be a very good conversation. Which brings me to an

Speaker 1: interesting question. Having worked with some speech-to-text solutions, you know, in the last decade or so, one of the ideas was, well, we'll just plug a speech-to-text engine in there and have it do, you know, that, and then you put that up for captioning. But in reality, unless it's trained, you're getting 65% accuracy. The other option is to have somebody sit there and type it, and of course, as we've all watched live news with those, some of it's phonetic, etc. Yeah, you can come back later and clean it up for the on-demand asset. But where is the optimal solution for how to do that? I mean, what's

Speaker 2: your take on that? So there's actually an in-between, which in the process of adding captioning at LinkedIn, I discovered. Because it is really hard to train traditional transcriptionists to get beyond a certain accuracy level. The really exceptional ones that can get in the 90% range are few and far between and they're in high demand. So as we scale as content, you know, the volume of content that's being produced increases, that's not going to work. So I found that actually in other parts of the world, people are using voice writers. So speech-to-text solution, where there's still a human being who's taking in the

Speaker 1: content and re-speaking it. So the system is trained, the speech-to-text is trained for them, so they hear it. And that would also help you from a translation standpoint, if you went multilingual, I would assume. Okay. Yeah, so

Speaker 2: it takes care of that accuracy issue, it keeps the human element, you know, artificial intelligence is not there yet. So you still have a human making a decision and interpreting in those moments where it's, you know,

Speaker 1: where it matters, frankly. That's fascinating. So they may be interpreting somebody from their own language, or they may be interpreting

Speaker 2: somebody from another language. We're doing all English right now, and the accuracy we're seeing is incredible. And the cost is actually, you know, I don't want to throw numbers around, but it's actually a lot cheaper than what I was paying for live captioning back in 2002. So what we used to get upset with our

Speaker 1: mothers or our grandmothers sitting next to us and telling us everything that was being said on TV, is now actually turned into something lucrative for a person

Speaker 2: who can do that. It is, it's a skill. And here's the interesting part when it comes to scale, is that transcriptionists take several years to get trained up. It's a skill and it's actually an aging labor pool, most of the people that do that. So re-speakers or voice writers, as they're also known, can be trained in

Speaker 1: just a matter of months. And what percentage of those voice writers have a

Speaker 2: LinkedIn profile? You know, I don't know yet, but there's a real business opportunity there in the United States, like I said. And I know in Europe where they're doing their, everything is subtitled and captioned and it's done in multiple languages. There's a huge demand and so, and a scale. And it's the kind of

Speaker 1: thing that you wouldn't physically have to be in the room to do either. In fact,

Speaker 2: what we're doing is using an EEG encoder with iCAP. So our captioning provider is remote. They dial in with a code to that iCAP cloud software. We send our program audio there. So they're just hearing the audio and they're sending the

Speaker 1: captioning data back in. Given the fact that it's a stream and you have multiple seconds of delay anyway, the fact that they're getting a real-time audio feed, like off of a conference call, means that the captions sync up then to

Speaker 2: the video. Well, and EEG has a, some of their encoders have a feature where you can introduce even more delay to close that gap. So that's what we do. We actually close that gap and get it within, you know, five seconds. Sometimes it actually varies and sometimes the captions will be right on and even lead by a second or two, which is, I always wonder what the viewer thinks. The lead could be a

Speaker 1: strange thing. In two seconds he's going to say this. Well, Heather, fascinating conversation. Thank you very much. Thank you, Tim. Again, this has been Heather Hereford from LinkedIn, live video producer, talking about the challenges of live video captioning.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript