[00:00:00] Speaker 1: Scribe V2 Realtime is an extremely fast speech-to-text model, which can be used for live transcriptions. The combination of speed and quality enables use cases that just weren't possible before. For example, we were able to build this real-time language translator using Scribe V2 Realtime and the Chrome Translator API. Hello, I am talking to you from the void, and my goal is to teach you about 11 labs. In the age of AI, it's important to understand the systems and how they work, and knowing what your byte-coded apps are actually doing helps you prompt better and fix bugs faster. So in this video, we will cover the whole process from a high level and how it works, how it goes to Scribe Realtime, transcripts, and the Chrome API, and what the committed transcripts and live transcripts are. But if you want a step-by-step tutorial that's more focused on the actual code you would write, you can click the link in the description. But if you want to understand how all this works, make sure to keep watching. 11 labs offers two models, Scribe V2 and Scribe V2 Realtime. Scribe V2 is optimized for accuracy. This is used for batch transcription, subtitling, and captioning at scale. Scribe V2 Realtime is optimized for ultra-low latency. So this is great for voice agents, meeting note-takers, and other live applications. So Scribe V2 is a great choice for transcription when it can be asynchronous, whereas Scribe V2 Realtime is used when you need the transcription to happen live. For a live language translation application, Scribe V2 Realtime is the obvious choice. To use this model, you need to use the Speech-to-Text API. There are two things you need to execute before you get started. First, you need to initialize the Scribe instance, and second, you need to connect to the API. Now, before you start, of course, you need to have 11 labs API key, and you need to install the SDKs. Then, depending where you are calling this API from, you will need to initialize it differently. If you're calling it service-side, you can initialize directly with the API key, and use that instance to connect to Scribe. But, exposing an API key to the client is a massive security risk. So if you are streaming client-side, which we are going to do, you will need to initialize using a single-use token. This token also needs to be generated service-side to protect the API key. Since we are building a React application, we will use this token approach. And this token is passed to the connection phase. To accomplish that first thing of creating the instance of Scribe, it's very trivial. You just give it the model ID. So now, once we have initialized it, we can connect to the Scribe API. For the React language translation project, we will be using client-side streaming, like we said, and thus, we need that single-use token. First, we will need to retrieve a token from our backend. We're going to pass this token to the connect command, and then, you see, we use the microphone for the input. Now, where does actual transcription happen? When working with the Scribe API, there are two types of transcripts, partial transcripts and committed transcripts. Partial transcripts are the live transcripts. These transcripts happen via WebSocket, and they are returned to you as you speak. So if you say the phrase, the cat is, you will see those words streamed in in real time. The other transcript is the committed transcript. This transcription works in committed segments. When and how you choose to commit your transcripts will define how your transcript is segmented. For this, you need to define a commit strategy. Now, there are two options for commit strategies. The first option is fully manual, which gives you full control to decide when transcripts get committed. The best practice for this is to do it during silences or other logical points like a turn model. The other option is to let the Scribe API determine quality segments on its own using voice activity detection, or VAD. So this approach automatically detects speech and silent segments. When a silent threshold is reached, transcription engine will commit that transcript segment automatically. Now, a question that comes to mind is why do you even need a commit strategy? So as you are speaking, Scribe is trying to transcribe every single word. So if you are saying, for example, I scream, this can be recognized as I scream because you don't have enough context. But when you have the full message, for example, I scream every time I see a spider. So now you can commit this whole phrase and this one should be correct. You should not see I scream for this first part because you have the full context of that conversation. So this committed transcript should be more accurate given that it has the full context. So basically, the partial transcripts should be shown live and the committed transcripts can be an accurate transcription of the conversation history. Not to display the actual transcriptions. So once you are connected to Scribe, the transcription is automatically happening. You can get access to those transcripts using the partial transcript or the committed transcripts properties. Partial transcript, again, is the real-time transcript of the current segment that is being transcribed. The committed transcript is the conversation history. So there is a list of segments that have been committed while connected to Scribe. So that's all you need to get a real-time transcript of your conversation. Now you can dress this app up with a nice UI or add even more features like language translation. To create the real-time language translator from the demo, you pass the transcripts to the Chrome AI Translator API. This displays the output in real time. So thank you for watching. If you enjoyed this video, make sure to like and subscribe and I'll see you in the next one.
We’re Ready to Help
Call or Book a Meeting Now