Enhancing Video Caption Accuracy with Real-Time Viewer Feedback in 2022

Convert Your Audio To Text

4.9/5

3718 customer reviews

Leon discusses the urgent need for accurate video captioning, highlighting DubSee's platform that leverages viewer feedback to improve captions in real-time.

Leon Lyakovetsky - Why Video Captioning Needs Built-In Viewer Feedback in 2022 (and How We Do It)

Added on 10/01/2024

Speakers

Add new speaker

Speaker 1: Thanks, Matt. Hi, everyone. One more talk before the break. So let's get after it. I'm Leon, and this is why video captioning needs built-in viewer feedback right now in 2022 and how we do it. So, by the way, shout out to Dan from Paramount who set up this talk brilliantly, talking all about subtitles and captions and standards. So we'll get through things much faster today. So my project is DubSee. It's a platform where audience members improve captions for their favorite content creators. And I'm here to talk about something a little different. I consider myself a captioning and subtitling specialist. But you don't have to be a video engineer like any of us here to agree with me that every video should be captioned correctly. I believe every video, whether it's racked up one view or a billion views, should have a corresponding caption track that accurately displays what is said and heard in that video. And while there are technical guidelines that specify correct or accurate captioning, as Dan talked about yesterday from the FCC, ADA, and W3, for the purposes of this talk, I'll just give my simple definition for what correct captions are. Which is I think that if you're watching a video with the sound on and the captions on, you should not notice any errors in the captions relative to what you are seeing and hearing. And if you're watching a video with the sound off, you can easily follow along with what's in the video and you don't feel like you're missing anything in your understanding of what's going on. It's a simple definition. In other words, the captions serve their purpose. So the reasons for having correct captions are very clear. Dan talked about it yesterday. We need videos to be accessible. We need them to be watchable. So if I'm boring you, you want to watch a little video on your phone with the sound off, you can watch it. We need to know what's in the videos. And finally, we need compliance, whether it's accessibility or FCC, depending on where that video lives. And the reason why I'm talking about this right now today, because the amount of video content out there far exceeds the amount of correct captions. And that's because today, accurate captioning ultimately relies on humans and not machines alone. So today, captions are typically fed into auto-generating programs that use artificial intelligence and machine learning with natural language processing to convert speech into text, which is then time stamped and synced with the original audio to create captions. These auto-transcriptions are very impressive technically, but they still need to go through a process of human input and review in order to meet any captioning standard. So if you're watching a video for longer than a minute and you haven't noticed any errors, you can thank the humans on that one. And that's because captioning is just a difficult task. The first major hurdle is speech-to-text accuracy, which for auto-generated transcriptions is between 85 to 95%. It's impressive, but it's not the 99% we need for accessibility compliance or to not notice any errors as we're watching a video. The second hurdle for the machines to overcome is any dynamic content. When you add in noise like background noise, street noise, you're talking about unique names, you have overlapping conversations where you're having an emotional dialogue and you're speaking in multiple accents all at the same time, the accuracy starts to drop significantly. And finally, even if you get speech-to-text right, you still have speech-to-context issues. So Dan gave us a nice example yesterday of how four to five minutes of baking is clearly not the same thing as 45 minutes. So that's technically speech-to-text error because 40 and 42 are still different words. But here's a really interesting example of speech-to-context error. What's the next trend in technology in the next five to ten years? Caption. What's the next trend in tech in the next 955 years? So where does that come from, right? So obviously it's converting five to ten into a time, which is completely inappropriate given that context. And this seems like an extreme example, but it's currently live on TikTok and many other captioning applications where we see five to ten years, nine to five job, and my favorite one was a two to three bedroom house, which is 258 bedroom house. And you know what's funny about that one is that that's actually a video where someone took a Reddit post, they took text, they converted it to speech, they took that speech, they put it back into text, and they got it wrong. So it's a very funny example, but it's happening right now in production. And the limitations of machine transcription. So there are things that machines that just cannot do yet. Dan mentioned this yesterday, but in general, they struggle identifying speakers in dynamic situations. They struggle with atmospherics, meaning that they only do very basic ones like music and laughter, very unhelpful. They can't do singing and music videos, forget about it. Punctuation, they can do phrases and sentences, but they struggle with punctuation as formatting, like when I'm quoting someone. Timings, they get off often by a word or two or whether it's too short or the caption is too long. And finally, positioning, which Dan talked about yesterday, that's only done by humans right now and very, very, very poorly by AI, if at all. So that means that if we want accurate, correct captioning, I think you'll agree with me that we need more humans. So where do we get them? Media companies and content creators are already doing what they can, and we can't just hire professional transcribers to transcribe every video out there. So as the title of this talk suggests, there's a whole other group of people that can help in the transcription process. And that's, of course, the viewers. And by viewers, I mean anyone watching a video while they're watching that video. And I'm not asking them to review transcriptions. I'm just proposing that we open up every video within the video player itself to take in viewer feedback on the captions that they see while they're watching. And I think opening up viewer feedback is the only way to get the most correct captions across the largest number of videos in the shortest amount of time. And it seems like a no-brainer to let any willing and able viewer easily give feedback in an intuitive way. Right now, those hyper-aware viewers just don't have a good way to give feedback. They have to resort to direct messages, emails, comment sections, or even creating and submitting their very own caption files. There has to be an easier way. And finally, the machines, in order to get better and better, need as much feedback data as they can possibly get. So I don't think I'm proposing something that new. This user feedback model has worked well in other applications. And I think it will work here. So I have a question for everyone. Where have we seen a similar system? Where have we seen a system where users report feedback, giving feedback is entirely optional, users verify others' feedback, and because a handful of people contribute, the system is better for everyone. Any ideas? Any suggestions? Sorry, louder. Maps. Yes. The answer is Waze. If you're unfamiliar, Waze is the mapping and directions app that became popular and part of its reporting system that allows drivers to alert other drivers to many kinds of hazards out there on the roads in just a tap on their screens. Drivers on Waze alert each other about cars, traffic, accidents, road debris, and most importantly, police cars. Then the system polls other drivers to confirm those reports. So whether or not any individual driver contributes, everyone gets a better experience out on the road. And I think we can do the same thing in video captions. So I think we need a Wazified-style viewer feedback system where somehow we get feedback. That is, we identify caption errors and get proposed changes from viewers. Then, understanding that this is a crowd-sourced system, we somehow need to verify those changes in order to see if they're correct and to prevent abuse of the system. And finally, we need a mechanism that publishes those changes and updates those captions for everyone to watch. So the system of the types of feedback I'm proposing in the system are flags, changes, and votes. So a flag marks the cap, marks that the caption has an error. A change proposes what the caption should actually say. And finally, votes are where viewers agree or disagree that a flag or a change is valid. To make voting more effective, viewers can only vote on flags or changes that are proposed by others. And so here's a little mock-up of this idea. So one time, I'm watching a video. I'm about starting my own software as a service company when I see that the captions actually say software as a surface. It's not an entirely new field of material sciences technology. It's just a captioning error. So when I see that error, what I want to be able to do is immediately tap on my screen once to flag that error. Once I've done that, I or anyone else who has seen that error can submit a change by tapping again and submitting the change correctly updating surface to service. Finally, another viewer comes along and they're polled and asked, was that change correct and do they agree or disagree? So now, where do those votes go? So the votes in the system are meant to confirm valid changes and stop abuse. Viewers get to agree or disagree whether a flag should be kept or a flag or change should be kept or removed. And the exact rules of that system are set by those content owners or administrators who set the rules they want to set. So for example, they might decide that if X percent of votes are disagreed, we don't consider a flag valid or they might decide that a proposed change is considered valid if it gets X percent of the total view count. Whatever they decide on, we then either notify those owners or auto approve the change and go straight to publishing. The voting system allows media companies to decide exactly how they like to implement feedback in case they want in-house teams to actually handle making those changes. And so, that's our model. So we're getting feedback, we're verifying changes, we're deciding if those changes are going to be auto approved and published directly to the caption file or if we're going to notify those administrators. And finally, you see happy viewers in that little section right there, happily watching the new video. And so now, we have our plans, it's time to build it and I think I'm at the right conference to do so because it's in part thanks to Mux, through Mux Player and Media Chrome Project that we have the tools to extend the video player with components. That way, we can easily add new experiences within the video player itself, a huge one for viewers. So on the front end, if anyone's familiar with Media Chrome, what I've done is a very simple thing. So all we need to do is integrate our caption feedback component right into that media controller as if it's just another media element because essentially, it is. So we have our caption feedback state object, caption feedback element and even a caption feedback button which resembles that captions button we all know and love. So for this demo, I'm using a sample here by Tom Scott, an amazing YouTuber. You should check him out and I think you guys already see what's wrong in this transcript. So this video is about Tower Bridge but of course, Tower Bridge doesn't have basketballs, it has bascules by the way which open and close to let ships through. Now, if I'm watching on a normal player and I want to fix this error, I would personally have to message Tom, write a comment or create my own caption file to make a change. But I don't want to do that, I just want to fix it. So we can go ahead and immediately enable feedback mode. So as you can see on the lower part on the player, on the bottom right, we press that green button and now we have feedback mode enabled which looks exactly the same as the video player except we have that green icon and a little green icon on the top left. Then we click that icon on the top left to flag that caption that's having an error. Now, I don't know what a bascule is so I keep on watching the video and I don't report, I don't propose a change but someone else comes along who does know his bridges and proposes a change. So first, we pop up the original caption into the box for people to edit what they think that caption should say and then we submit the correct change into that change box and we hit submit. Then what we'll see is other people's flags and changes on the bottom left and thanks to the media chrome element, this is all part of a traditional video player so that once we want to watch the video again, we just remove our cursor or our fingers and those elements disappear. So we're not disrupting the video watching experience in any way. And finally, someone else comes along and gets an option to agree or disagree that we made the correct change and so after all of that, we have the correct caption in production ready for everyone to see. And so the back end of this is quite intuitive and simple. So we have our caption feedback component which lives in the video player. It receives feedback from viewers and then it separately talks to its own API which then polls those viewers and then does the polling system behind the scenes to either notify the administrators or auto approve those captions right to that caption VTT file or whatever type of file you want to use. And what I really like about the system is that we don't have to actually touch the caption file at all besides just to update it. So the video player can stay as is. We don't have to refresh it when we're publishing changes because we only publish changes occasionally since we have to wait for the polling system. And as a result, next time the video player loads, it will refresh with the new caption file whenever it wants to. We don't have to touch the video player otherwise. In theory, I know it's just in theory, but it's a completely plug and play component for the video player. And so in summary, we need to engage viewers to improve captions. I propose a system of doing it by integrating feedback into the player and in the back end regulate that feedback with voting and approval automation. And I think built in feedback is a good idea and this gives us a template that we can use across many other applications. In live captioning, all we have to do is instead of publishing to a caption file, we publish to a live dictionary which then updates that same name or phrase that's mentioned later on in the stream. For example, a lot of live events, sports, the same names are said so many times in the same live stream that when you get it wrong the first time, you can fix it later on in the stream and then commit those validated changes into the caption file of the replay. For translation, it seems like a no-brainer to let multilingual viewers update auto translations and finally for AI ML training, captions are just the start of AI generated content that ends up in front of viewers directly. We're going to have AI generated text, AI generated images, and even AI generated video coming. And if captions are an indication of what's to come, they're going to need a lot of feedback as well. And I think that viewer interaction within the video player is such a powerful tool and yet right now it's only wasted on one thing. And that thing is skipping ads. So I'm Leon, thank you so much for listening. This of course is just a proof of concept but I'd love to hear anyone's thoughts and takes on this concept and if you'd like to help me build it, let me know. I'm Leon Makovetsky and all my links are behind that QR code. Thank you.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3718 customer reviews

1/730

Verified Order

“I needed an interview transcribed accurately and I was happy with the quick turnaround. ”

Jen

Jul 20, 2025

“Very accurate transcription, fast service, easy to use and order, thank you!”

Gabby

Jul 15, 2025

“I am beyond happy with this service, which I am using it produce interview transcripts for my dissertation research. The interface is easy, the customer service was prompt and informative, the transcript is accurate, and the pricing is wonderful. I will recommend GoTranscript to anyone who is in need of affordable human-powered transcription services.”

Justin McDonald

Jun 29, 2025

“great work. quick and professional”

christian oradesky

Jun 28, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support