Build Real-Time Transcription with Scribe V2 Realtime (Full Transcript)

Learn Scribe V2 vs Realtime, tokens for client streaming, partial vs committed transcripts, commit strategies, and live translation via Chrome API.

Download Transcript (DOCX)

Speakers

Add new speaker

[00:00:00] Speaker 1: Scribe V2 Realtime is an extremely fast speech-to-text model, which can be used for live transcriptions. The combination of speed and quality enables use cases that just weren't possible before. For example, we were able to build this real-time language translator using Scribe V2 Realtime and the Chrome Translator API. Hello, I am talking to you from the void, and my goal is to teach you about 11 labs. In the age of AI, it's important to understand the systems and how they work, and knowing what your byte-coded apps are actually doing helps you prompt better and fix bugs faster. So in this video, we will cover the whole process from a high level and how it works, how it goes to Scribe Realtime, transcripts, and the Chrome API, and what the committed transcripts and live transcripts are. But if you want a step-by-step tutorial that's more focused on the actual code you would write, you can click the link in the description. But if you want to understand how all this works, make sure to keep watching. 11 labs offers two models, Scribe V2 and Scribe V2 Realtime. Scribe V2 is optimized for accuracy. This is used for batch transcription, subtitling, and captioning at scale. Scribe V2 Realtime is optimized for ultra-low latency. So this is great for voice agents, meeting note-takers, and other live applications. So Scribe V2 is a great choice for transcription when it can be asynchronous, whereas Scribe V2 Realtime is used when you need the transcription to happen live. For a live language translation application, Scribe V2 Realtime is the obvious choice. To use this model, you need to use the Speech-to-Text API. There are two things you need to execute before you get started. First, you need to initialize the Scribe instance, and second, you need to connect to the API. Now, before you start, of course, you need to have 11 labs API key, and you need to install the SDKs. Then, depending where you are calling this API from, you will need to initialize it differently. If you're calling it service-side, you can initialize directly with the API key, and use that instance to connect to Scribe. But, exposing an API key to the client is a massive security risk. So if you are streaming client-side, which we are going to do, you will need to initialize using a single-use token. This token also needs to be generated service-side to protect the API key. Since we are building a React application, we will use this token approach. And this token is passed to the connection phase. To accomplish that first thing of creating the instance of Scribe, it's very trivial. You just give it the model ID. So now, once we have initialized it, we can connect to the Scribe API. For the React language translation project, we will be using client-side streaming, like we said, and thus, we need that single-use token. First, we will need to retrieve a token from our backend. We're going to pass this token to the connect command, and then, you see, we use the microphone for the input. Now, where does actual transcription happen? When working with the Scribe API, there are two types of transcripts, partial transcripts and committed transcripts. Partial transcripts are the live transcripts. These transcripts happen via WebSocket, and they are returned to you as you speak. So if you say the phrase, the cat is, you will see those words streamed in in real time. The other transcript is the committed transcript. This transcription works in committed segments. When and how you choose to commit your transcripts will define how your transcript is segmented. For this, you need to define a commit strategy. Now, there are two options for commit strategies. The first option is fully manual, which gives you full control to decide when transcripts get committed. The best practice for this is to do it during silences or other logical points like a turn model. The other option is to let the Scribe API determine quality segments on its own using voice activity detection, or VAD. So this approach automatically detects speech and silent segments. When a silent threshold is reached, transcription engine will commit that transcript segment automatically. Now, a question that comes to mind is why do you even need a commit strategy? So as you are speaking, Scribe is trying to transcribe every single word. So if you are saying, for example, I scream, this can be recognized as I scream because you don't have enough context. But when you have the full message, for example, I scream every time I see a spider. So now you can commit this whole phrase and this one should be correct. You should not see I scream for this first part because you have the full context of that conversation. So this committed transcript should be more accurate given that it has the full context. So basically, the partial transcripts should be shown live and the committed transcripts can be an accurate transcription of the conversation history. Not to display the actual transcriptions. So once you are connected to Scribe, the transcription is automatically happening. You can get access to those transcripts using the partial transcript or the committed transcripts properties. Partial transcript, again, is the real-time transcript of the current segment that is being transcribed. The committed transcript is the conversation history. So there is a list of segments that have been committed while connected to Scribe. So that's all you need to get a real-time transcript of your conversation. Now you can dress this app up with a nice UI or add even more features like language translation. To create the real-time language translator from the demo, you pass the transcripts to the Chrome AI Translator API. This displays the output in real time. So thank you for watching. If you enjoyed this video, make sure to like and subscribe and I'll see you in the next one.

Summary

The transcript explains ElevenLabs’ Scribe V2 and Scribe V2 Realtime speech-to-text models, focusing on building a live language translation app using Scribe V2 Realtime plus the Chrome Translator API. It contrasts Scribe V2 (accuracy, batch/asynchronous transcription) with Scribe V2 Realtime (ultra-low latency for live apps). It outlines setup steps: obtain an ElevenLabs API key, install SDKs, initialize a Scribe instance with a model ID, and connect to the Speech-to-Text API. For client-side streaming (e.g., React), it recommends using a server-generated single-use token to avoid exposing the API key. It then describes partial (live, WebSocket-streamed) transcripts versus committed transcripts (more accurate, segmented history) and the need for a commit strategy—either manual (commit on silences/turns) or automatic using voice activity detection (VAD). Finally, it shows how to feed transcripts into the Chrome AI Translator API to display real-time translation output.

Copy

Download

Title

How Scribe V2 Realtime Powers Live Transcription & Translation

Copy

Download

Keywords

ElevenLabs Remove

Remove

Scribe V2

Remove

Scribe V2 Realtime Remove

Remove

speech-to-text Remove

Remove

real-time transcription Remove

Remove

low latency Remove

Remove

WebSocket

Remove

partial transcripts Remove

Remove

committed transcripts Remove

Remove

commit strategy Remove

Remove

voice activity detection Remove

Remove

VAD

Remove

single-use token Remove

Remove

API key security Remove

Remove

React

Remove

microphone streaming Remove

Remove

Chrome Translator API Remove

Remove

live language translation Remove

Remove

Copy

Download

Key Takeaways

Use Scribe V2 for high-accuracy batch transcription; use Scribe V2 Realtime for ultra-low-latency live applications.
For client-side streaming apps, do not expose the API key—generate a single-use token server-side and connect with that token.
Partial transcripts stream live over WebSocket for immediate UI feedback; committed transcripts represent the more accurate, finalized conversation history.
Choose a commit strategy: manual commits at logical boundaries (silence/turns) or automatic commits using VAD with a silence threshold.
Committed segments improve accuracy by leveraging more context than partial, in-progress hypotheses.
To build a live translator, send transcripts from Scribe V2 Realtime to the Chrome AI Translator API and render translations in real time.

Copy

Download

Sentiments

Positive: Enthusiastic, promotional tone highlighting speed/quality benefits, practical guidance, and encouragement to like/subscribe.

Copy

Download

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file