For most legal videos and training content, SRT is the safest simple caption file, while VTT is better for web players that need styling, placement, or metadata. Pick the format based on where the video will play, what the platform accepts, and whether you need clean audit-ready text or richer on-screen control.
This guide explains the main caption file formats in plain English, with practical tips for compliance, editing, timecodes, encoding, and speaker labels.
Key takeaways
- SRT is a plain-text caption format with sequence numbers, timecodes, and caption text.
- VTT is also plain text, but it supports web-friendly features like cue settings, positioning, and limited styling.
- For legal video, clarity, exact timing, speaker labels, and version control matter more than fancy styling.
- For training video, VTT can help when captions need better placement or must work smoothly in HTML5 players.
- Caption format choice can affect accessibility workflows, platform upload rules, editing speed, and long-term storage.
- Always keep a clean master transcript or caption file before converting to other formats.
What caption files contain and why the format matters
A caption file is a text file that tells a video player what words to show and when to show them. It usually includes timecodes, caption text, and sometimes extra instructions such as text position or speaker labels.
Caption file formats matter because each video platform reads them in a different way. A file that works well in one training system may fail, lose styling, or show captions in the wrong place in another system.
For legal and training work, captions do more than help viewers follow the audio. They support accessibility, review, search, quoting, translation, and recordkeeping.
Captions also help teams avoid confusion when audio is hard to hear. This can happen in depositions, recorded interviews, webinars, safety training, compliance training, and court-related video review.
Most caption files include these parts:
- Timecodes: Start and end times for each caption cue.
- Caption text: The spoken words and useful non-speech audio, such as [laughter] or [door closes].
- Line breaks: How text appears on screen.
- Speaker labels: Names or roles, when needed for clarity.
- Formatting instructions: Supported by some formats, but not all.
The main question is not “Which format is best?” The better question is “Which format fits this platform, this audience, and this recordkeeping need?”
SRT vs VTT: the practical difference
SRT and VTT are the two caption file formats most teams see first. Both are text-based, easy to open in a basic text editor, and widely used.
What is an SRT file?
SRT stands for SubRip Subtitle. It is one of the simplest and most common caption file formats.
An SRT file contains numbered caption blocks. Each block has a sequence number, a start and end time, and the caption text.
A basic SRT cue looks like this:
- 1
- 00:00:03,000 --> 00:00:06,500
- Good morning. Please state your name for the record.
SRT uses a comma before milliseconds, such as 00:00:03,000. That small detail matters because some systems reject files with the wrong punctuation.
SRT is a good choice when you need a clean caption file that many platforms can accept. It works well for simple training videos, legal review clips, internal archives, and video sharing platforms.
What is a VTT file?
VTT stands for WebVTT, or Web Video Text Tracks. It was designed for the web and works well with HTML5 video players.
A VTT file usually starts with the line WEBVTT. It uses timecodes like SRT, but with a period before milliseconds, such as 00:00:03.000.
A basic VTT cue looks like this:
- WEBVTT
- 00:00:03.000 --> 00:00:06.500
- Good morning. Please state your name for the record.
VTT can include cue settings for position, alignment, and other display details. The WebVTT standard defines how this format works for web text tracks.
VTT is often the better choice for learning management systems, web apps, browser-based video, and training portals. It gives developers and media teams more control than SRT.
The main SRT vs VTT differences
- Header: VTT starts with WEBVTT; SRT does not.
- Timecode punctuation: SRT uses commas for milliseconds; VTT uses periods.
- Numbering: SRT requires cue numbers in common practice; VTT does not require them.
- Styling: VTT supports more styling and cue settings than SRT.
- Web support: VTT is built for web video and HTML5 text tracks.
- Simplicity: SRT is easier to read, edit, and convert.
If you need simple captions for broad platform support, start with SRT. If your video will live in a web player or training platform that supports VTT, use VTT.
Related caption and subtitle formats you may see
SRT and VTT cover many everyday needs, but legal and training teams may also meet other formats. Some are used for broadcast, DVD workflows, editing tools, or platform-specific uploads.
TXT or plain transcript files
A TXT file may contain the full spoken text without timecodes. It is useful for review, search, legal notes, translation, and editing before captions are created.
A transcript is not the same as a caption file. Captions need timecodes so the words match the video.
SBV
SBV is a simple caption format used in some video workflows. It includes timecodes and text but is less common than SRT and VTT.
If a platform asks for SBV, you can often convert from SRT. Always review the converted file because timecode punctuation and line breaks may change.
TTML, DFXP, and XML-based formats
TTML and DFXP are XML-based caption formats. They can hold more structure and styling than SRT.
These formats may appear in broadcast, enterprise, and accessibility workflows. They are harder to edit by hand, so most teams use software to manage them.
SCC
SCC is a caption format often tied to broadcast and legacy closed caption workflows. It can carry closed caption data used in certain television and professional video systems.
SCC is not a format most legal or training teams should edit manually. Use it when a broadcaster, vendor, or platform specifically asks for it.
ASS and SSA
ASS and SSA are subtitle formats that support advanced styling, such as fonts, colors, and screen placement. They are common in some entertainment and fan-subtitle workflows.
They are usually not the first choice for legal or compliance training. Extra styling can create review and compatibility problems if the platform does not support it.
Embedded captions
Some videos carry captions inside the video file instead of in a separate sidecar file. This may be useful for delivery, but it can make editing and version control harder.
Sidecar files like SRT and VTT are easier to update without re-exporting the whole video. They also help teams keep a clear text record.
Compliance, accessibility, and legal video needs
Caption format affects compliance because it affects whether captions display correctly, can be reviewed, and can be preserved. The format alone does not make a video compliant.
Accessibility rules and policies often focus on the outcome: people who are deaf or hard of hearing must be able to access the audio information. The W3C lists captions for prerecorded audio in video as part of WCAG guidance on captions.
For legal video, captions should help viewers understand speech without changing the meaning. This is especially important for depositions, hearings, recorded statements, training on legal duties, and internal investigations.
Key compliance and legal review factors include:
- Accuracy: Captions should match the audio as closely as the use case requires.
- Timing: Captions should appear when the words are spoken, not seconds before or after.
- Speaker clarity: Viewers should know who is speaking when it matters.
- Non-speech audio: Sounds that affect meaning should be included, such as [alarm sounds].
- Readability: Captions should not move too fast or cover key visuals.
- Version control: Teams should know which caption file matches which video version.
SRT can support these needs when the platform displays it correctly. VTT may help when you need more control over placement, especially if captions cover legal exhibits, slides, faces, or important screen content.
Do not use captions as a legal transcript unless your process requires and supports that use. A legal transcript may have formatting, certification, speaker identification, and review needs that captions do not meet.
If you need both captions and a transcript, create or keep a strong transcript source. GoTranscript offers closed caption services for caption files and video accessibility workflows.
Selection guide: which caption format should you choose?
Use the delivery platform as your first filter. Then check editing needs, compliance needs, and long-term storage needs.
Choose SRT when:
- You need a simple file that many platforms accept.
- You want an easy file to open, review, and edit in a text editor.
- You do not need custom placement or styling.
- You need captions for a legal review video, basic training video, webinar, or internal archive.
- Your team may need to convert the file into other formats later.
SRT is often the best starting point for teams that want clean, portable captions. It is also a practical format for review because people can read it without special tools.
Choose VTT when:
- Your video plays in a website, web app, or HTML5 video player.
- Your learning management system prefers or requires VTT.
- You need cue positioning or alignment.
- You want to include notes or metadata in a web-friendly format.
- You need better control so captions do not cover slides, forms, or legal exhibits.
VTT gives you more control, but that does not mean every platform will honor every VTT feature. Test the file in the actual player before final delivery.
Choose another format when:
- A broadcaster asks for SCC or another broadcast format.
- A video vendor requests TTML, DFXP, XML, or a platform-specific file.
- An editing system needs a certain caption export format.
- You need burned-in subtitles, where text becomes part of the video image.
When in doubt, ask the platform or vendor for its exact caption file requirements. File extensions alone are not enough because some platforms also enforce encoding, timecode, and line-length rules.
A quick decision checklist
- Where will the video play? Website, LMS, court system, internal portal, or video platform?
- What formats does the platform accept? SRT, VTT, SCC, TTML, or something else?
- Do captions need styling or placement? If yes, VTT or another advanced format may help.
- Will people edit the file by hand? If yes, SRT is usually simpler.
- Do you need a separate transcript? If yes, keep a transcript file apart from the caption file.
- Will the file support legal review? If yes, track versions and avoid undocumented edits.
Basic handling tips for clean caption files
Small formatting errors can break a caption upload or create confusing playback. Use a careful file handling process, especially for legal and training materials.
Timecodes
Each caption cue needs a clear start time and end time. Captions should appear close to the spoken words and remain on screen long enough to read.
Watch for these timecode issues:
- Overlapping cues that confuse the player.
- Large gaps where speech has no captions.
- Captions that appear too early and reveal answers before a speaker says them.
- Captions that appear too late and make testimony or training steps hard to follow.
- Wrong millisecond punctuation when converting SRT and VTT.
For legal clips, timing matters because people may compare captions to video moments. For training, timing matters because captions must match steps, warnings, and on-screen actions.
Encoding
Use UTF-8 encoding unless the platform says otherwise. UTF-8 handles common punctuation, names, symbols, and many non-English characters better than older encodings.
Encoding problems can turn quotation marks, accented names, or symbols into strange characters. Always open and preview the uploaded file after conversion.
Speaker labels
Speaker labels help when more than one person speaks. They are important in depositions, interviews, panel training, role-play training, and meetings.
Use clear labels such as:
- Attorney: Please describe what happened next.
- Witness: I heard the alarm at about 3 p.m.
- Trainer: Pause before you enter the restricted area.
Keep labels consistent across the file. Do not switch between “Attorney,” “Counsel,” and a person’s name unless your style guide requires it.
Line breaks and reading speed
Good captions are easy to read without pulling attention away from the video. Break lines at natural phrase points when possible.
Avoid splitting names, legal terms, or training commands in ways that change meaning. Also avoid long caption blocks that fill too much of the screen.
Non-speech sounds
Include sounds when they help the viewer understand the content. Examples include [siren], [phone rings], [laughter], [inaudible], or [overlapping speech].
Do not overload captions with every small noise. Focus on sounds that affect meaning, context, safety, or the record.
File names and version control
Use file names that match the video and version. This helps prevent teams from uploading captions for the wrong cut.
A practical file name may include the project, date, language, and version:
- deposition-smith-2025-04-12-en-v2.srt
- safety-training-module-3-en-final.vtt
Keep old versions if the video has legal or compliance value. Record who changed the file and why when edits affect meaning, speaker labels, or timing.
Editing and conversion pitfalls to avoid
Caption editing looks simple, but a small change can create a real problem. This is why legal and training teams should review captions in the final video player, not only in a text editor.
Do not assume conversion is perfect
Many tools can convert SRT to VTT or VTT to SRT. Conversion can still change punctuation, cue numbers, positioning, notes, or special characters.
After conversion, check the file for missing captions, broken timing, and strange characters. Also check whether VTT styling survived the move to another format.
Do not let captions cover key visuals
Captions can block slide text, exhibit numbers, medical labels, warning icons, or faces. This can weaken both accessibility and comprehension.
If placement matters, VTT may help in web players that support positioning. If the platform does not support placement, adjust the video layout or caption style during production.
Do not mix transcript edits with caption edits
A transcript edit may improve grammar or readability, but a caption should usually follow the spoken audio. Changing captions too much can make them less faithful to the recording.
For legal video, avoid cleaning up speech in ways that change meaning. For training, avoid rewriting instructions unless the caption file follows an approved script.
Do not ignore platform rules
Some platforms reject files because of encoding, invalid timecodes, unsupported tags, or the wrong extension. Others accept the file but display it poorly.
Upload and test captions before you publish or share the final video. Test on the devices your audience will use when possible.
Do not use burned-in text when users need control
Burned-in captions are always visible because they are part of the video image. This can help when a platform does not support caption files.
But burned-in captions cannot be turned off, restyled, or read by some assistive technology workflows. Use sidecar caption files when user control and accessibility options matter.
Common questions
Is SRT or VTT better for legal video?
SRT is often better for simple legal review because it is easy to read, edit, and store. VTT may be better if the legal video plays in a web system that needs caption placement or other web text track features.
Can I rename an SRT file to VTT?
No, renaming the file extension is not enough. You must change the formatting, including the WEBVTT header and the millisecond punctuation.
Do captions count as a transcript?
Captions and transcripts overlap, but they are not the same. Captions are timed for video playback, while transcripts are usually easier to read as a document.
Should speaker labels appear in captions?
Use speaker labels when they help the viewer understand who is speaking. They are especially useful in legal video, interviews, panel training, and scenes with off-screen speakers.
What encoding should I use for caption files?
UTF-8 is the safest default for most modern caption files. It helps preserve punctuation, symbols, and names with accent marks.
Can I use one caption file on every platform?
Sometimes, but not always. Many platforms accept SRT, while web players often prefer VTT, and broadcast or enterprise systems may require other formats.
Do I need captions for training videos?
Captions are often needed for accessibility and are also useful for noisy spaces, quiet viewing, search, and review. Check your organization’s accessibility rules and the platform’s caption requirements before publishing.
Final thoughts
Caption file formats are not just technical details. For legal and training video, they affect access, clarity, review, editing, and how well the video works on its final platform.
Use SRT when you need a simple and portable caption file. Use VTT when your web or training player supports it and you need more control over how captions appear.
If you need help creating captions, transcripts, or clean files for different platforms, GoTranscript provides the right solutions, including professional transcription services that can support your video workflow.