Transcribe Any Audio or Video Using ElevenLabs ScribeV2 (Full Transcript)

Step-by-step guide to uploading files, tagging events, adding key terms, editing text, and exporting transcripts (including SRT/VTT) with ElevenLabs.

Download Transcript (DOCX)

Speakers

Add new speaker

[00:00:00] Speaker 1: In this video you're going to learn how to transcribe any audio or video file using 11Labs, whether it's a podcast, a meeting recording, a YouTube video or even a full-length film. You can get a highly accurate transcription with timestamps, speaker labels and even entity detection in just a few clicks and I'm going to walk you through the entire process step by step. Inside 11Labs you can find a speech-to-text tool powered by their latest model ScribeV2, which is one of the most accurate transcription models available in the world right now. It supports over 90 languages, it can handle files up to 10 hours long and it does things like speaker diarization, which means it can tell who's speaking when, even with up to 32 different speakers and it also picks up non-speech audio events like laughter or applause, which is actually really useful depending on what you're transcribing. So let's begin and I'll show you all how it works and if you want to follow along you can click the first link down in the description below. Inside of 11Labs to begin transcribing your content all we want to do is go ahead and click on speech-to-text and here we can start beginning to transcribe our video or audio by clicking transcribe files in the top right. Here as you can see all I have to do is simply drag and drop my file. So let's say we had a YouTube video with multiple speakers, I could drag this in and I could drop it. As you can see we've got 30 minutes of audio, 240 megabytes and I can even preview the audio if I want to and if I wanted to upload something else I could go ahead. Now this video right here is in English and I could actually leave it up to the AI to detect the language or I could go and select it myself. Below that we can also choose to tag audio events. So if there's things like clapping, laughter or footsteps, Scribe V2 will detect that in our audio and it will tag the audio event with a timestamp. We could then go ahead and choose to include subtitles and we can add key terms. For example if you have specific brand names or unique names that you want to make sure that the AI gets right you can go and add those right here. So for example we could go and add 11 labs. Next we simply click upload files and as you can see our file begins uploading and once it's done 11 labs will then transcribe it using Scribe V2. And while it's processing let's go ahead and transcribe some more audio. This time I'm going to drag in the audio track to one of my YouTube videos. So imagine I wanted to turn the YouTube video into a blog or have a transcript so my viewers can follow along, I can do so. Once again I can go through the same events. This time I don't want to tag audio events but I do want to include subtitles and we're also going to include 11 labs as one of the key terms and then I simply click upload. And we can literally just see that the video that's 30 minutes long has just finished transcribing. Let's go ahead and click on it so we can take a look. Here we have the full transcription of our video. So at the top we've got the first line of the person being intro'd onto the scene. We then have the music while the speakers sit down on the stage and then as you can see we're switching back and forth between the two speakers. I can go in and edit this text just like I would anywhere else and I can even follow along with the exact video and if we want we can choose to run a spell check just to make sure that all of the transcription is correct. And after running a spell check we can see that it's removed the stutter from which and turned it just into which and likewise with pleasure above. So we can go and accept those and we can then go and preview our video. And as you can see we now have an accurate transcription and we can go ahead and click on export and here we can export it as a text file, a pdf, a docx, json, html and even srt or vtt. And if we want to render this as srt or vtt we actually need to have opted in subtitles when transcribing. So if we go back to my youtube video which should now be finished as we can see right here. If I open this up here on the top right I can click export and here we can download as srt and vtt and that is how to transcribe your content. And if we go back the last thing I want to show you is scribe real-time v2 which is the fastest most accurate transcription model in the world. So if I click on try the demo and then we simply click transcribe we are now talking on camera and as you can see it's transcribing what I'm saying in real time straight into 11labs. And this is a transcription api that you can go connect to any product or tool that you're building to get real-time transcription for your live content. If you have any questions about how to transcribe your audio or video with 11labs let us know in the comment section down below. And if you enjoyed this video and you want to see more please hit that like button and don't forget to subscribe. Thanks for watching.

Summary

The video explains how to transcribe audio/video files using ElevenLabs’ Speech-to-Text tool powered by ScribeV2. It covers uploading files (drag-and-drop), selecting or auto-detecting language, enabling audio event tagging (e.g., laughter/applause), opting into subtitles, adding key terms for better accuracy, and then editing and spell-checking the transcript. It also shows exporting in multiple formats (TXT, PDF, DOCX, JSON, HTML, SRT, VTT) and notes that SRT/VTT require subtitles enabled at upload. Finally, it demonstrates Scribe Real-Time V2 for live, real-time transcription and mentions the availability of an API for integrating real-time transcription into products.

Copy

Download

Title

How to Transcribe Audio/Video with ElevenLabs ScribeV2

Copy

Download

Keywords

ElevenLabs Remove

Remove

11Labs

Remove

ScribeV2

Remove

speech-to-text Remove

Remove

transcription Remove

Remove

speaker diarization Remove

Remove

timestamps Remove

Remove

speaker labels Remove

Remove

entity detection Remove

Remove

audio event tagging Remove

Remove

subtitles

Remove

SRT

Remove

VTT

Remove

export formats Remove

Remove

spell check Remove

Remove

real-time transcription Remove

Remove

Scribe Real-Time V2 Remove

Remove

transcription API Remove

Remove

YouTube transcript Remove

Remove

podcast transcription Remove

Remove

meeting recording Remove

Remove

Copy

Download

Key Takeaways

ElevenLabs’ Speech-to-Text uses ScribeV2, supporting 90+ languages and files up to 10 hours.
You can enable speaker diarization for up to 32 speakers, plus timestamps and labels.
Optional audio event tagging can detect non-speech sounds like laughter or applause with timestamps.
Add key terms (brand or unique names) to improve recognition accuracy.
Enable subtitles during transcription if you want to export SRT/VTT caption files.
Transcripts are editable in-app and can be improved via spell check to remove stutters/typos.
Export transcripts in multiple formats including TXT, PDF, DOCX, JSON, HTML, SRT, and VTT.
Scribe Real-Time V2 provides live transcription and is available via an API for integrations.

Copy

Download

Sentiments

Positive: The tone is instructional and upbeat, emphasizing ease of use, speed, and high accuracy (e.g., ‘one of the most accurate,’ ‘just a few clicks’), and encouraging viewers to try the tool and ask questions.

Copy

Download

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file