How to Transcribe Audio with ElevenLabs Scribe V2 (Full Transcript)

Set up your API key, use the TypeScript SDK, and enable diarization and entity detection to turn audio files or recordings into rich transcripts.

Download Transcript (DOCX)

Speakers

Add new speaker

[00:00:00] Speaker 1: Eleven Labs just released the best speech-to-text model in the world, but instead of talking about how good it is, let me just show you. It supports key-term prompting, so you can configure the model to recognize specific words, for example, brand names like Eleven Labs, and identity detection, so you can extract correct information like phone numbers, 773-888-1989. It also supports speaker diarization for up to 48 speakers, which means you can associate and identify 48 different people. It has dynamic audio tagging and much more. And all of this can be built into your products using the Eleven Labs libraries. In this video, I'll show you how you can start transcribing audio with Scribe V2. So it all starts with the API keys in order for you to connect to Eleven Labs. You need to go to your Eleven Labs account, and click this little developer's icon over here. And click create an API key. Give access to all the things you need. So for at minimum, you need the speech-to-text here. But depending, if you're just learning, I just do all of them. Then you can create your key and start working with it. So once you have it, copy-paste that value and add it to your environment variables. So your env file, and it should look something like this. Make sure you call it Eleven Labs API key. Then you need to install the Eleven Labs library using your favorite package installer, depending on which language you're using. Here, we're going to use TypeScript. So I'm using pnpm to install Eleven Labs. Here it shows I already have the .env, but if you don't have it, make sure you do pnpm install .env. This will make it easier for you to manage your environment variables. So an example we're going to create here is just a simple back-end function using Node.js that will take an audio file and convert it to your speech-to-text, to a transcription. It's going to use Scribe V2. I'll play all the settings, and I want to give a really simple example because once you understand it, you can apply it to whatever project you want to. So to get it all started, we need to create an Eleven Labs client. Now, since we named our API key, I think Eleven Labs underscore API underscore key, this Eleven Labs component will automatically just bring it in using the .env config. Now, to start transcription, we need to have a file. There is a public file. You can find this in the docs of Nicole talking. This is just a regular MP3 file of someone talking. With a soft and whispery American accent. Now, once we have that file, we have the MP3, we want to turn it into a blob, which is a binary large object, I believe. And this is basically the root type of what a file is, which is audio MP3. And it takes the buffer of just bits and bytes and creates the type of binary large object, which is a file. Now, to get to the transcription from this file, we use the Eleven Labs client we created. We use the speech-to-text, and we convert this file using our Scribe V2 model. And we can output this directly to our console. So let's see how this looks. It's calling the file. It's calling the API with the file that we have. And here's the whole transcription object that we get. So notice there's multiple things that you get. So first you get the language code, which is English here. It says the probability is pretty certain that this is English. And the transcription is correct. With a soft and whispery American accent, I'm the ideal choice for creating ASMR content, meditative guides, or adding intimate feel to your narrative projects. This is very cool. But you also get all the words broken down for you in a nice array. So here it tells you the start of each word. So we have with, then we have a space, then we have a, then a space, then soft. It tells you exactly the timestamp when this word started, when it finished, the type of this word, and the text. Now, while I was recording, I was curious what log prob is. And it's the log of the probability with which the word is predicted. So basically how accurate 11Eyes believes this word was predicted. And zero is the highest. So if we look at our data, it looks like all of these were predicted with a very high probability. You can extract just the text by calling the property. And this way we'll get just a clean text output just like this. There's also quite a handful of features that 11Eyes has. For example, you have the key terms, you have the identity detection, multi-channel, diarization is somewhere up here. Yeah, diarize is just a true or false. So by signing this flag, you get diarization up to 48 speakers. Entity detection can take either a string. So we can just do PII of people's personal information. Or you can pass an array of the different types of entities you want to detect. So for example, offensive language, personal information, and other types. And that's the basics of ScribeV2. It's a very simple API and really powerful once you put it into your applications. Like I did in the app that I showcased in the introduction. That was an Astra application, which is my favorite, I guess, web framework. This application was server-side rendered. So I made sure my API is on the server. Make sure that's not exposed to the client because you don't want your API keys in the client. And this is the speech text. And you can see it does the exact same thing here. We create a client and we convert. We detect the personal information like phone numbers. We use the V2 file. It's the exact same code we have. All I did here is add a bunch of checks to make sure that everything works nicely. Now what I did here on the front end is I tapped to record. But remember, we're taking in the file. So I take that recording that we just created. I create a blob, a binary large object file from it. And I pass this to the back end, the API speech to text back end. And it gives me the transcription back. And then what I did is display nicely in this list and extract whenever you get the entities. You get a list of entities. So I guess let me show you that as well. If you are trying to identify entities, I have this console log with transcription of the entities. And these are the detected entities, which was the phone number. So on the front end, I take this. I convert it to numbers. And you can have a cool-looking application like this. And the ideas I have what you can do with this are infinite. So if you're looking to get started with transcription, I recommend to go to 11labs.io, create an account, get an API key, and start building with Scribe V2.

Summary

ElevenLabs released Scribe V2, a speech-to-text model featuring key-term prompting, entity/PII detection (e.g., phone numbers), multi-channel support, dynamic audio tagging, and speaker diarization for up to 48 speakers. The speaker demonstrates how to set up an ElevenLabs API key, store it in environment variables, install the ElevenLabs library in a TypeScript/Node.js project, create a client, load an MP3 file, convert it to a Blob, and call the speech-to-text endpoint with the Scribe V2 model to retrieve a transcription. The returned object includes language detection with confidence, full transcript text, and word-level timestamps plus log-probabilities. The speaker also shows enabling diarization and entity detection via flags, and discusses building a server-side API to keep keys private, sending recorded audio from the frontend as a Blob, receiving transcription plus detected entities, and rendering results in an app.

Copy

Download

Title

Demo: Building speech-to-text with ElevenLabs Scribe V2

Copy

Download

Keywords

ElevenLabs Remove

Remove

Scribe V2

Remove

speech-to-text Remove

Remove

transcription Remove

Remove

TypeScript Remove

Remove

Node.js

Remove

API key

Remove

environment variables Remove

Remove

.env

Remove

key-term prompting Remove

Remove

entity detection Remove

Remove

PII detection Remove

Remove

phone number extraction Remove

Remove

speaker diarization Remove

Remove

word timestamps Remove

Remove

log probability Remove

Remove

Blob

Remove

frontend recording Remove

Remove

server-side API Remove

Remove

Copy

Download

Key Takeaways

Scribe V2 supports key-term prompting to improve recognition of specific terms like brand names.
Entity/PII detection can automatically extract sensitive information such as phone numbers.
Speaker diarization can label up to 48 different speakers in one audio stream.
The API response includes language detection confidence plus word-level timestamps and log-probabilities.
Keep ElevenLabs API keys server-side; send audio blobs from the client to a backend for transcription.
Setup flow: create API key → set env var → install SDK → create client → send audio file/blob to speech-to-text with model set to Scribe V2.

Copy

Download

Sentiments

Positive: The tone is enthusiastic and promotional, highlighting Scribe V2 as 'best in the world' and emphasizing ease of integration and powerful features with practical demo steps.

Copy

Download

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file