Understanding SMPTE 2110-40: Enhancing Closed Captioning and Subtitling Workflows
Bill McLaughlin discusses the SMPTE 2110-40 standard, its impact on closed captioning and subtitling workflows, and the advantages over traditional SDI systems.
File
Live Closed Captioning and Subtitling in SMPTE 2110-40 Bill McLaughlin, EEG
Added on 09/08/2024
Speakers
add Add new speaker

Speaker 1: Okay, it's my great pleasure to introduce Bill McLaughlin from, oh goodness, EEG, that's right, EEG, it's right in the corner, right in front of me, how did I not realize that, okay. Yeah, right, okay. And live closed captioning and subtitling 2110, again, a very important thing, particularly for the U.S. market. So take it away, Bill.

Speaker 2: This presentation is about the 21-40 standard, which is for ancillary data. So that's a part of everything Wes just showed you. We're going to go into a little bit more detail, not so much even on the payload specs for that standard, but on the type of workflow changes and non-changes, basically how is 21-10-40 different from putting VANC data in SDI. Let me click, and just a quick intro on EEG and my role. We are the leading company for equipment. We are the leading company for equipment, especially in North America, for closed caption data. We make world subtitle data as well. And the closed caption and subtitle data is used to both live insert data from speech recognition services, from human transcription services, and also to play back files when there might be a transcript file that had been generated independently before the program runs, and then that's bound into the SDI ancillary data in real time. So that's also a component. That's a common workflow, and we'll look at both of those. So how does 21-10-40 work and what is our workflow, as we said? So first on the basics, as Wes describes in the overall talk, you look at 21-10 and your major difference with SDI is instead of having a single stream where essentially you have time domain multiplexing between video, audio, ancillary data, you have three different media streams or flows. You have a multicast stream of dash 20. You have dash 20 data for video, obviously the largest. You have maybe 100 or 1,000 times smaller than that. You have dash 30 data for audio in a separate multicast. And then the lowest bit rate of all, you have the dash 40 data for ancillary. Now the dash 40 data, the dash 30, and the dash 20 can be transmitted over the network in separate streams and recovered and put back together because they share a relationship to the same media PTP clock. So you can identify where the data is. You can identify where the data is. You can identify where each one belongs. And the ancillary data is going to essentially have after its PTP timestamp and the header information, as we said, it has some information, a field number for interlaced versus progressive and those kinds of things. The core payload is basically the same VANK packets that you have in SDI video. So it's, you know, VANK and HANK data according to the SMPTE 291 standard. And so you're going to be able to encode all of that. All of the same types of data that you would encode in SDI ancillary data in 2110-40. So in that sense, a lot of the workflows are the same. Now there's some new standards underway in the 2110 group regarding taking data that is not traditional SDI ancillary data and making that into a separately time stream. But for now, the dash 40 standard, it sticks to the legacy SDI formatting of the ancillary data. So it's a wrapper around a wrapper. That's fine. It's still a very small amount of data and, you know, for captioning, you're going to be looking at, you know, the SMPTE 334 packet, which is a, you know, a 6101 did, S did for US and Canada captioning. For the EU, UK, other PAL markets, you have usually OP 47 teletext would be the way to live carry subtitling data. And then there's the ARIB B37 VANK packets for the ARIB markets. So again, these. These ancillary data packets that have been defined and have been used in SDI are still really going to be the same for 2110. Now because they're the same data packets, you don't necessarily need a really, you know, smart new conversion workflow to do this. It pretty much will work with a basic SDI to IP gateway. So for example, if you had an SDI caption encoder, you know, you could have an IP core system. You, let's say you have an SDI island, you haven't replaced the caption encoder yet. You can pretty much get the IP data into an IP to SDI gateway. You can put it through SDI equipment that inserts ancillary data. You can take it back out through a conventional IP gateway and you have a new ancillary data stream. So just to illustrate that this basically is data that's directly and losslessly transferable, you know, you have a simple start like that. Of course. You're creating multicast in your network for video and for audio on both ends that are identical, that are not adding any new value. And it's a lot of bandwidth and a lot of extra equipment. So if we do native 2110 dash 40 insertion, and often that's even going to be doable in a, in a pure software environment because the data rate is not very high. All we really need to do is we need to have the dash 30 audio signals, the ones that we're going to use to create live captions. Again, because the text is essentially going to be a live translation of the audio through some transcription process. So we're going to read in the dash 30 audio and we're going to transmit the dash 40 ancillary data. So we're basically, we're only pulling one multicast that's maybe, you know, in the megabits range and we're only emitting a multicast that's going to be less than even a megabyte. So we have a very efficient workflow. We can put many channels of closed captioning on a single server. Um, and we're not, you know, stuck kind of as we would be in an SDI or in a straightforward direct, um, you know, time domain, multiplexed SDI to IP mapping like 2022. We're not stuck pushing around extra video to all of these channels of caption insertion. So we've pretty much went through what all these advantages are. You have a density advantage and you have a big bandwidth advantage, and that's going to save you money on the switch. That's going to save you money. That's going to save you money on, on the fibers and the SFPs and you know, the caption insertion is designed here to be a very simple thing. Um, where are those captions coming from again, because we're going to look at some of the elements that can make this more complex. The captions come essentially from outside of the video workflow in a way that is not directly synchronous to the video flow. Um, transcriptions are going to be generated by any process, human or automatic on a level of words and sentences. They're not going to be generated on a level of, you know, each video frame has unique data. So effectively your caption encoder is responsible for taking this, um, relatively slow moving asynchronous data and bringing it into the video domain and making it synchronous with the frames. So in 2110, you know, that process is basically governed through the PTP process for every frame of video. Uh, 2110 dash 40 frame is going to be emitted. Even if it's going to be blank. If it's going to be blank of ancillary data, or even if it's going to have a captioning packet that's going to be blank, you're going to get a packet every frame. And essentially when there's caption data text to send, that'll go out. Now you might need more than simply a 2110 40 stream coming out of each instance of a transcription process. And the reason to think about that is that in a lot of live caption and subtitling workflow, you're going to have a mix. Of, um, sort of captions that are already in upstream content, you know, so that could be, uh, you know, post-produced material that could also be anything that's a commercial or an interstitial that has captions already on it. And you're going to need to merge the live captioning data into that stream. And it needs to flow smoothly according to the caption standards like seven Oh eight or the DVB teletext or subtitle. So the caption encoder is often going to need to have an. Stream input for 2110 42 effectively, it's going to take in a multicast that already has caption data, some of the time or in some of the language services, and it's going to be able to mix that with new live data at different times or at different languages and then output a master caption stream that actually includes, you know, both all of the captions that were prerecorded as part of the file, any captions that could have been inserted in the upstream point. And also. Captions that are inserted locally and live. Once you have that caption stream, the good news is in some workflows, you may not need to independently encode those captions with multiple videos in a current SDI workflow. If you have two, two networks or a clean feed and a feed with graphics on it, you actually may need to close caption encoders to separately strike the ancillary packet onto each of two SDI signals. With a 2110 40, you're not going to need to do that because if you have a single source of the caption multicast, you could tune a receiver, a multi viewer, any 2110 device to actually receive the same ancillary signal associated with a different audio and video signal. So again, since these are separate streams, you only need one copy of given live closed captioning, and then you can send that to multiple screens to multiple record processes. You can use it as needed. So two different things we want to talk about in terms of these ancillary multicast is is a couple of architectures that I think both are going to have some place in this world. So there's no requirement that you all you put all of the ancillary data in a single multicast stream. And the same things are going on with audio and you can see that in the ISO eight demo in terms of audio. Grouping. There's not a lot of data that you can do with all of the ancillary data. You have the same question with ancillary data. The question is, do we have expert devices that make different packets like captions, time code study 104 triggers, HDR information, and to each of these experts generate their own multicast which is independently receivable at other points in the stream that has a lot of advantages because at that point, no one device definitely needs to be your expert device and all of these different standards. And then you also can route them independently and there are no single device needs to take them all apart. But what that does mean is that a receiver that does need to put these all together either to become a gateway to SDI or to finally put data out to an OTT platform to over the air broadcast that's going to need to be able to receive potentially two or four or n different multicast. So you can arbitrate which of these, say in an SDI, is going to go on which line of data. You know, they can't all appear right at the beginning of say line 10. They have to have different positions. So it requires some intelligence regarding ancillary data processing in the 2110 receiver. On the other hand, if you have a simple receiver, just a, you know, a simple SDI gateway that doesn't want to make complicated decisions about your ancillary data and just wants to take a single dash. Forty multicast or, you know, maybe two if you have the blue and the amber networks, you know, then you have a complex transmitter and a simple receiver. And the reason the transmitter needs to be more complex to deal with that receiver is that it's essentially going to produce the multicast streams additively. And so you're going to have you're going to have captions in the packet. And then if you had a different device that wanted to put in a time code, you would add that and you can you can regenerate the packets. You could do this. You could do this without incurring a frame delay. But you're going to need then a device that respects other packets in the chain in order to create the single multicast that has multiple did estes inside it. So a combination of the two when that's required effectively, let's say you have the simple transmitters and the simple receivers. You may have a need for, you know, a device that's essentially a sidechain key or architecture of the. The ancillary multicast, meaning we have these we have these ancillary data multicast. They each represent a different important expert feed, and we're going to need to combine them for the receiver into a single multicast. Perhaps we're going to do that with an expert device that then actually has some settings that say, you know, here is the order to put the packets in the multicast. Here is the horizontal and line number offsets to put into these packets. Once they're combined, that device is a little bit of a, you know, an ancillary multicast. It's a little bit of an ancillary expert at that point for the whole plan. So 2110 dash 40 is actually the most recently ratified of the core standards for, you know, for video, audio and ancillary data. It was, I believe, officially done March of last year, just before the NAB show. So where are vendors in the support for this? It's something that I've personally been involved in a lot as part of, you know, the interops and IP showcases. And we have come a good way since, you know, pre-standardization deployments and what's happened in the past year. At this point, you know, you can see this in the JTNM testing catalog. Most of the vendors of SDI to IP gateways and, you know, 2110 play out are supporting ancillary data now, at least in a, you know, for simple packets that are well known. If you have a closed captions associated, for example, with an MXF file on your play out. Most of the play out systems are able to put those captions out, you know, in at least one national standard that meets the market they're intended for. For SDI to IP gateways, most of them are now able to take a completely generic load of ancillary data from SDI or from Dash 40 and convert it back to SDI and Dash 40. Again, because the Dash 40 data is still a SMPTE 291 payload, the gateway doesn't have to have expert knowledge of any of that data. So, you know, it's, you know, it's, you know, it's, you know, it's, you know, it's, you know, it's an important question to ask when buying products to confirm that it supports arbitrary 2110 packets, 2110 packets in Dash 40 in any valid location for SDI. But this is mostly testing very well. The IP multi-viewers that are in use at this booth also show closed captions. You know, some of them show closed captions. in both the North America 608 708 format and in the European OP 47 format. The OP 47 format is still a little less common, and that's probably the case in SDI workflows too, but we'll probably continue to see some improvement in the decoders available in that area. Finally, the test and measurement equipment for Dash 40 is still a little bit of a work in progress. Most of the market-leading products at this point have an ability to show you that a stream of Dash 40 exists, to show you some basic information about its timing, the level of support that you would have in an SDI analyzer product where you could really break down the VANK packets in a bit more detail. For the most part, that's still under development in the test and measurement world, so a little bit of a caveat. This might take some extra help. There is a Wireshark detector that's still in development, and it's still in the test and measurement world, so a little bit that is open source and freely available that will capture the 21-40 and provide you with a full breakdown. Again, because your standard 21-1040 data rate is maybe only 50 or 100 kbits per second, capture on that is very manageable, unlike in the video where it's probably going to require some special equipment, be hard to do with just commercial off the shelf. So to summarize what we can do with 21-1040, we have a good path between SDI ancillary data and 21-1040 ancillary data. We have mostly good vendor support, and I think we're starting to see customers deploy 21-10 workflows to a point where there's actually a need to think about some of the more complex aspects of, you know, the data that we're using. So I think that's a good way to think about it. And I think that's a good way to think about it. And I think that's a good way to look at it. And I think that's a good way to think about it. And I think that's a good way for yeah, live closed caption insertion versus pre-recorded captions, getting that all into the same stream, actually starting to have a full studio workflow that's going to require multiple expert devices. And you know, that's something that I think, basically, we're well prepared for. And you know, as the standard has become a year old, it pretty much has been proven to be interoperable with different vendors. And, you know, I think that the the Simply One Year ratification of that is is going very well, according to that metric. So that's all I have for you short of your questions, so thank you.

Speaker 1: So just a question for you. So do you think that people are going to be sending caption streams in multiple languages commonly on 2110 networks because it's so much easier than any system we've had before? Do you think that's going to increase the number of captions in

Speaker 2: production? You know, it certainly can support more languages of captions. You know, that's something that isn't hard to do in terms of the IP production. I think that, you know, one thing if you look at the drivers for why you don't have multiple languages of captioning, more often, you know, especially in the United States, I think it comes down to, you know, there isn't a regulation. There isn't a regulation. There isn't a regulation. There isn't a regulation. There isn't a regulation to do it, and the generation of the data is expensive. And so I think actually a more relevant reason to have multiple captions now is that it's becoming cheaper and cheaper to do AI translation. And, you know, whereas AI transcription can still be a bit controversial in the caption community, whether that's providing the quality that's needed, AI translation of, you know, an accurately transcribed program is really even better and really works quite well. So I think that's something that's going to happen. I think, you know, that can be a driver for having multiple languages of subtitles, you know, as long as there's any audience interest in really seeing that and that there's a measurable impact at that point on the business side.

Speaker 1: Well, good. And then the only other thing I want is, are they ever going to figure out how to spell the names of the baseball players correctly?

Speaker 2: Yeah, you should feed that back to your local station. And, you know, yeah, they... It's amazing. They need to get the accented characters right and all that.

Speaker 1: Yeah, or just, you know, even something. All right. Anybody else? Any questions? Okay. Well, thank you very much. Appreciate it. No problem. Take care. Bye bye. Thank you. Bye bye. Bye bye. Bye bye. Take care. Bye bye.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript