Comprehensive Guide to Creating Accurate Captions Using YouTube and oTranscribe
Learn a four-step workflow for generating, editing, segmenting, and fine-tuning captions using YouTube's ASR, oTranscribe, and Aeneas alignment tool.
File
A Captioning Workflow
Added on 09/30/2024
Speakers
add Add new speaker

Speaker 1: This captioning workflow consists of four steps. Step 1. Generating a raw transcript using YouTube's automatic speech recognition technology. Step 2. Editing the raw transcript in oTranscribe, a free text editor you can use in your web browser. Step 3. Segmenting and aligning the edited transcript with the audio by use of a script that uses Perl and Aeneas, a Python-based forced alignment tool. Step 4. Checking the accuracy of the timestamps in the resulting captions file. Step 1. Generating a raw transcript. I decided to use YouTube's automatic speech recognition technology to create a raw transcript for a few reasons. 1. YouTube is free. 2. The word error rate, WER, has been good in my testing, coming in at 4% WER compared to 7% for Speechmatics, 7% for Popup Archive, 8% for Trint, 9% for Google Speech API, 13% for Google Voice Typing, 14% for a trained Dragon profile, 23% for an untrained Dragon profile, and 26% for IBM Watson. 3. Many people are familiar with YouTube. After uploading the audio or video as an unlisted video to YouTube, I wait for the automatic closed caption file to be generated. I then use a script which uses YouTube DL and mainly sed commands to download the auto captions file and clean up the text, so that it is a plain text file without any of the formatting of the VTT file from YouTube. Step 2. Editing the transcript. I then open the raw transcript file in oTranscribe, a web-based text editor. This tool is free and allows me to quickly play back the part of the audio that I want to hear while I'm editing. The text is not time coded to the audio, which would be great, but the shortcuts are handy and my work is saved every second. While I'm editing the transcript, I make sure to include important information, such as speaker identification and important non-speech sounds. I place brackets around this information, which is important for the alignment process in the next step.

Speaker 2: The text on the right says, save the planet, kill yourself. Kill yourself.

Speaker 3: Kill yourself. These things are an expression of what we might call liberal environmentalism. The basic idea behind liberal environmentalism is that mankind is bad for the planet.

Speaker 1: Once the editing is done and the transcript is perfect, as best as I can tell, I export it as a text file from oTranscribe. Step 3. Segmenting and aligning the transcript. The next step in the process is to segment the transcript so that the sentences are broken into caption-ready chunks. In keeping with best practices in captioning, we are interested in creating caption blocks that do not exceed 35 characters per line. We also want to have caption blocks that have two consecutive lines, when possible, which can help reduce the frame rate of the captions and make them easier to read. Finally, we also want the end of the sentences always to appear at the end of a caption block. For this step, it is also important that the full stops used in abbreviations or honorifics are not treated as the end of sentences. The segmentation of the transcript is performed in the first part of a script that I have created for use at this step in the captioning workflow. After segmentation is performed, the new chunks are aligned with the audio file in the second part of the script. The alignment is computed via a wonderful Python library called Aeneas, developed by Alberto Petarín. In the Aeneas portion of my script, I can adjust some useful parameters for the aligner to work just the way I want for my video, such as how long the head and tail of the video is, and whether to remove the non-speech segments of the audio from the alignment process. By executing this script, I now have a captions file and an HTML file that opens up a text editor that I can use for checking the accuracy of the timestamps. Step 4. Checking the Accuracy of the Captions File With the captions file in hand, I am now at the final step in my workflow, which is to fine-tune the timestamps in my captions file, if necessary. To do this, I use an HTML editor called FineTuneAs, which allows me to quickly check the timestamps and download the corrected captions file.

Speaker 2: What is the point? What's the point of Orthodox Christianity? There's so many versions of Christian faith out there to choose from. Why Orthodox?

Speaker 1: Once I confirm that the captions file is accurate, I can upload it to YouTube, Amara, or embed it as a sidecar file in my preferred video editor.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript