Speaker 1: Hello, welcome all. My name is Sana and I am from Intera Systems. Let me tell you more about us. At Intera Systems, we help streaming media companies ensure content readiness and delight viewers through our AI ML-based solutions. In this talk, we will see how a cohesive ML-based solution helps media companies expedite their caption generation and verification at a global scale and with ease. Captions are quite old. In fact, the first captions appeared 50 years back. Since then, captions have come a long way and are present everywhere. They help improve accessibility, brand recall, memory retention, promote globalization, resolve audio issues, and much more. And because of these benefits only, major broadcasters have mandated them long back. Now various compliances have been enacted for them. In short, captions now are mandatory. But I must say that captions are vital not just from a regulatory point of view, but also from a business point of view. Today, global content is consumed in foreign languages only with the help of captions and subtitles. Captions and subtitles help the content reach global audiences, garner more views, and in better monetization. But easier said than done. Even today, a lot of service providers struggle in adding quality captions because of various technical and financial reasons. You might be surprised to know that it costs around $5 to $10 per minute of video. An expert takes around eight hours to caption one hour of content. There's always a lack of trained professionals. It's a tedious task. There are various strict quality guidance for it. And in OTT deliveries, where the video needs to be maintained at various quality levels, different frame rates, different resolutions, various editing and transcoding is involved. And any transcoding or editing step, if not done properly, can lead to loss in quality of captions, or even loss of captions altogether. Thus, manual captioning is impractical when content volume is huge. What makes it so difficult to understand it? We must know what goes into good quality captioning. So first of all, you need a language expert, types fast and accurate, is aware of the context and the jargons. The person should be adept at caption segmentation rules. These rules are quite complex. They help in breaking the captions into proper parts. And if not done properly, can lead to misunderstanding and confusion. In languages where the word boundaries are not separated by spaces, for example, in Japanese, an incorrect segmentation might change the meaning and create confusion. The justification should also be proper. Proper start time and end time should be given, and that too taking care of scene change boundaries. Should be given more time for difficult words, for multiple speakers, and for a complex scene. Captions should maintain a proper reading speed. The speed should be lowered when the target audience is our kids. They should be properly placed to avoid blocking objects of interest in the scene. Captions should be placed near the speaker to avoid too much of an eye movement. Speaker identification needs to be done. Audio description needs to be added. In short, captions are not just audio transcriptions. A lot goes into the mind of captioner while doing it. So we can understand that captioning is tedious, time consuming, needs expert, and it's a costly affair. What media companies need is an efficient and economical way to incorporate it in their workflow and to meet regulations. What ML solutions can do for us? ML-based solutions, there has been various advancement in them, but still are not 100% accurate. But I must say that even at current state, a lot of burden can be taken off your head using a combination of ML technologies for audio, video, and natural language processing. They can generate caption predictions for review along with confidence score. And if an intelligent player is bundled with the solution, it can make the reviewing process effortless. Just imagine yourself being presented with the caption predictions with confidence score, and whenever the score is low, you are reviewing it and just ticking off and you are done. It certainly reduces time, manual effort, and cost. Economics is completely in favor of machine learning and captioning. Considering an industry average of $7 per minute of video, if you have two hours of content daily, for a month you will incur around $25,000. With just a one-time investment in an AI ML-based solution, you can reduce this cost by 80% by spending only $1 per minute for review. The machine learning solutions are considerably cheaper. So now let's understand the various technologies that are involved in caption generation. These technologies are voice activity detection, speech-to-text conversion, which is also called speech recognition, speaker identification, natural language processing for punctuations and segmentation, scene change detection, and burnt-in text detection. Such a system, such a solution can transcribe the text, segment it, place it, and present to you for review along with confidence score. Accuracy also increases with time with self-learning models. Perhaps you will be surprised by what these solutions can achieve. For example, before transcription, a noise reduction technique can be applied and good results can be achieved, while a human captioner will struggle to understand the dialogues in the middle of noise. Caption generation is a four-step process. So now let's see in detail how machine learning is helping us in all these four steps. So first one is transcription. It is one of the most complex steps in caption generation. Here first of all, voice activity is detected and then it is converted to text using acoustic and language models. There is continuous learning here as more and more out-of-vocabulary words are added to the speech recognition dictionary during review. The predictions are also continuously improving. At this step, speaker diarization is also done. That is, change in speaker is done and speaker changes and speaker names are identified. Some sort of punctuations are also done at this step by using the silences and pauses in audio and the modulation in the extracted speech features. The second step is intelligent segmentation. Here we use natural language processing. So you must be wondering what is segmentation. So the caption text cannot fill the entire screen. It will block the video. At max, there can be one line of caption, two lines or three lines at max. Each sentence, each line should not contain more than 42 characters. Otherwise, it will run out of display area. So if you have a very big sentence, you need to break it into multiple caption blocks. This segmentation needs to be done considering natural pauses, clause boundary, so that the meaning of the sentence is not lost. Now let's consider an example. I'm having nightmares that I am being chased by these giant robotic claws. Now you would notice that we have done segmentation at proper clause boundaries. On the other hand, if we would just segment whenever we reach the limit of 42 characters, we will receive a very incomprehensible captions like, I'm having nightmares that I am being chased by these giant, and next will be robotic claws. Also consider the next example. The correct one would be, please listen carefully to Joe Tribbian before you act. Here we have considered clause boundaries. We have kept the first name together with the surname. On the other hand, using only character limit rule, we would achieve, please listen carefully to Joe, and next sentence will be Tribbian before you act. So proper segmentation really eases comprehension. At this step, time stamping is also done once segments are formed, and scene change detection is also employed so that the captions do not run over the scene change boundaries. Third step is placement. Captions are generally placed at the bottom center of the screen, but sometimes what happens is there is already an object of interest at the bottom of the screen. For example, in this match video, you would see that there is a scoreboard, there is some match information, and if we place captions by default in the bottom center, it will block that information. So here a caption generation system would use burnt-in text detection to detect that, and would direct the engine to place the captions at the top rather than at the bottom. And once the system has done its magic, it's time for review. It's the last step, but it's a very important step in captioning. An intelligent player greatly eases this review process. It will help in reviewing the captions along with audio and video. It can give useful insights while editing. It can tell when you have made any spelling mistakes, or it can also give spelling suggestions. It can flag when captioning rules are violated. For example, if you have violated the limit for the number of characters in a row, or you have reached the reading speed limit, etc. It can also play captions along with audio. It can display how final captions will look over video. It can highlight short changes for you. It can display audio waveforms to ease time marking, and so on. So these are the four steps in which machine learning comes very handy in the process of caption generation. And the use for ML-based solution is in timestamping. So sometimes what happens is a program transcript is available. It might have been created when the program was being shot. So it can be aligned and timestamped to create captions. The correct term here is forced alignment. These transcripts being very accurate will result in production-ready captions. Once you have accurate captions, you can use another machine learning application, that is machine translation, to generate subtitles for different languages. In fact, timestamping is the most successful application of ML technology in captioning workflows. Another AI-based solution is for quality assurance. ML-based solutions might not be 100% accurate in telling you what is present in audio. But they can very accurately rule out what is not present in audio and if it is not present at the intended time. Also, once the quality issues are identified, they can be corrected also using the same technology employed in captioning. For example, let's say because of some human error, there is a frame rate mismatch and because of it, there is a shift or drift in captions. Machine learning solutions can very reliably detect that and correct that. Also, let's say if a wrong caption track is muxed with the audio track, it will be reliably flagged as a mismatch error because none of the captioned text will be found in the audio. As a broadcaster, accepting content from many different providers and geographies, you don't need subject matter experts for different topics or different languages and you can comfortably rely on ML-based solutions. To conclude, I must say that using ML-based solutions saves time, human effort and cost. And not only this, it solves privacy issues also as you do not have to send your content to anywhere else. It reduces human error, allows time for more creativity. The captioner is free from constantly thinking about segmentation, placement, writing text and can focus on more difficult audio segments, can add audio descriptions and so on. And you can have your peace of mind knowing that your captions are of production quality. Now, let me take a moment to introduce to you Intera Systems' Maiden Captions, a product designed specifically for this business need. It has three modules, Caption Generation Module, Caption QC Module and Caption Correction Module. All these modules are very well integrated with feature-rich player for review. It is also an application based on machine learning approach and helps to create and manage captions in file-based workflows. It is available on-premise and on clouds and as of now supports more than 14 languages. For OTT platform, it has all the transcoding steps for creating captions at different frame rate, different video resolutions, different timecode formats. And I must say it's a one-stop solution to all your captioning needs. That was all for now. Thanks for watching. For more information, please reach us out at info at the rate interasystems.com.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now