Enhancing Media Content with AI-Driven Captioning Solutions by Intera Systems
Discover how Intera Systems' AI ML-driven solutions streamline caption creation and verification, saving time and costs while ensuring high-quality media content.
File
ML Based Solutions For Efficient Captioning Workflow
Added on 10/01/2024
Speakers
add Add new speaker

Speaker 1: Hello, welcome to the presentation. My name is Sana and I'm from Intera Systems. At Intera Systems, we help media companies ensure content readiness and delight viewers through our AI ML-driven solutions. In this presentation, we will discuss how a cohesive ML-based solution for caption creation and verification helps media companies deliver in terms of time and money and allows for creativity. So let's get started. 50 years back, the first captions appeared in analog television broadcast, giving the hard of hearing access to the spoken word of television content. Since then, captions have transformed many folds and are ubiquitous across all modes of content transmission. Captions are vital for a modern broadcast industry, both from a public welfare standpoint as well as business and regulatory standpoint. Major broadcasters have long mandated captions and now various regulatory compliances for them have been enacted to ensure equal opportunities to people with hearing disabilities. Captions are mandatory and they should be accurate, synchronous, complete, and properly placed. Besides with the rise in global online content consumption, captions and subtitles in local languages present an opportunity for viewers to watch and comprehend foreign language content with ease. In Europe, although officially there are only 24 languages, there are more than 200 that are spoken. Similarly, in Asia, there are more than 2,300 languages spoken by over 4.3 billion people. Thus, using captions and subtitles, mass audiences globally can be reached, more views can be garnered, and content can be better monetized. But still, many video service providers struggle to add quality captions to content and meet regulations due to technical and financial challenges. And of course, due to a lack of trained professionals. You might be surprised to know that cost is around $5 to $10 per minute of content, and an expert usually takes five to eight hours to caption one hour of content. What makes it so difficult to understand that? We should first look at what goes into good quality captioning. For good quality captioning, you need a language expert who types fast, is aware of the context, and can comprehend the jargon. He should also be adept at captioning rules, which are quite complex, like using line break at natural points, natural points like we should not split a noun from an adjective, or first name from last name, or verb from negation, and so on. Further segmentation should be done at clause boundary. The justification should be proper, ease reading, caption should be placed near the speaker to avoid too much of eye movement, proper start and end time should be given, captions should not cross short change boundaries, should be given more time for difficult words or multiple speakers, and also for a complex scene, etc. Captions should also maintain a proper reading speed. At the same time, speaker identification marks also need to be added. Audio description is also required. In short, captions are not just audio transcription. They are much, much more. It is very tedious, time-consuming, needs expert, and is a very costly affair. Broadcasters need an efficient and economical way to incorporate it in their workflow and to comply with regulations. We now have active research and impressive progress in a lot of ML-based technologies, which can come to our rescue. Though we must acknowledge that currently AI ML technologies are not 100% accurate, but even in the current state, they can free you from tedious tasks by giving you possible caption prediction along with confidence score. You can accept the predictions where confidence score is high, and just review the predictions where it is not. So just imagine yourself being presented with predictions, and you're selecting the appropriate ones, like tick, tick, tick, ticking off, and you are done. Yes, a lot of burden can be taken off your head using combination of ML technologies for audio, video, and natural language processing. Instead of captioning, you will be only reviewing, reducing considerable human effort and this cost. Let's quickly look at the economics of machine learning and captioning. So considering an average industry rate of $7 per minute, with two hours of content every day, your daily cost will be around $25,000. With a one-time investment in AI ML solution, you can bring down the cost by around 80% with just spending $1 per minute for review. We must say that ML in machine learning is very lucrative. So let's look at the technologies that are involved in caption generation. There have been recent advancement in automatic speech recognition, burnt-in text detection, object detection, speaker identification, scene change detection, and natural language processing for introducing proper line breaks and punctuations. These have opened doors for automated captioning. The ML solutions can automatically transcribe text, segment it, place it, and present to you for review. Accuracy also keeps on increasing with time, with self-learning models, as you keep on adding more and more out-of-vocabulary words to its dictionary. Sometimes you will be surprised by what such solutions can achieve. For example, before STD, that is speech-to-text, noise can be removed using noise reduction techniques and good results can be achieved while human transcriber will struggle to figure out speech amidst the noise. So let's look at these steps one by one. The first step is transcription, automatic speech recognition, that is transcription. It is the most complex part of the solution. Here, speech activity is detected and then it is converted to text using acoustic and language models. The good news is the complexity and this accuracy of these models have increased at an exponential rate over the last decade. In this step, we also identify changes in speaker, that is speaker diarization, and also punctuations to some extent using silence and modulations in extracted speech features. Next step is segmentation. The basic rule of captioning is generally that we keep 32 characters in one line and generally we keep two lines per caption frame. But using these basic rules does not give us optimal captioning. For example, you can look at the column marked as incorrect. Here it says I'm having nightmares, that I am, 32 characters reached, so we are going to next line, being chased by these giant, two lines reached, so we are moving to next caption frame, robotic clause. So you would see that this does not lead to a coherent reading. If we consider the segmentation that can be achieved using the result of natural language processing through identification of phrases and clauses, we will achieve it, I'm having nightmares, one caption, and second, that I am being chased by these giant robotic clause. Similarly, consider the second example. Please listen carefully to Joe, 32 characters reached, so moving to next line, tribune, before you act. Here you would see that this splitting of first name from last name does not lead to a very good reading experience. If we consider the result that we can achieve through natural language processing, we will get optimal captions which would read, please listen carefully to Joe, please listen carefully, and then to Joe tribune before you act. Also, seeing change detection technology aids in assigning timestamps, ensuring that captions are not crossing short change boundaries. For a human captioner to consider all these rules at the same time applying them is not feasible. It is done in a very iterative process, and it is very time-consuming, which machine can achieve in an instant. The next step is placement. Usually, captions are placed at the bottom center, but sometimes there might already be some useful information present at the bottom in the form of text like match scores, as you can see in the picture. In that case, text detection can point us to actually place the captions at the top of the display rather than bottom. Last step is the review. When ML has done its magic, there comes the review step. It is still a very important step, and an intelligent player designed specifically for this will greatly ease the process. It gives useful insights while editing, like calculating current reading speed. It can give useful spelling suggestions. It can flag when captioning rules are violated. It can play captions along with audio. It can display how final captions will look over video. It can highlight short-change boundaries. It can display a waveform to ease time marking, and so on. Sometimes, during captioning, what happens is a program transcript is already available while it was getting short. So, then the same can be used for generating captions by a technology which is not actually speech-to-text detection but is a very similar technology. The correct term is forced alignment. These transcripts being very accurate will result in production-ready captions. I must say that timestamping is the most successful application of ML technology in captioning workflows. Once you have accurate captions, you can use machine translations to generate subtitles for different languages. Another workflow is quality assurance for captions. Use of ML-based solution in the other part of the workflow, that is quality assurance, is even more attractive. ML may not be very accurate in telling what is in audio, but they're very good at ruling if a given text is present or not. So, if you already have captions or are receiving captioned content, then you can rely on ML predictions to judge if captions are accurate or not, and you can even correct them using the same technology employed in captioning. So, I will give you some examples. Sometimes, shift or drift gets introduced in captions during editing. Maybe due to human error, machines are very good at detecting and correcting such shifts and drifts in time. During transcoding or muxing, wrong caption track is matched with an audio track. The ML-based solution can reliably flag mismatch error, as none of the captions will be found in audio. So, as a broadcaster, accepting content from many different providers and geographies, you don't need subject matter and language experts for different languages, and you can rely on ML-based solutions. So, we can say that using ML technologies, you will save significant time and expense, and it will also solve privacy issues as you don't need to send your content to anywhere else. It frees the human captioner from worrying about segmentation, placement, scene change detection, and also writing text. The captioner can focus on difficult audio segments, do more creative tasks, like adding audio description, emphasizing or italicizing a particular piece of information, and you can have your peace of mind knowing that your captions are of production quality. Today, the industry has developed many ML-based solutions to meet the captioning challenges we discussed. Intera system-based caption is one such automated solution. It's unique in a way that it's a one-stop solution for all captioning needs, allowing broadcasters and media professionals to address requirements from caption generation to QC, auto-correction, review, editing, and export. Thank you for watching this. For more information, feel free to reach us at info at the rate interasystems.com.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript