[00:00:00] Speaker 1: Eleven Labs just released the best speech-to-text model in the world, but instead of talking about how good it is, let me just show you. It supports key-term prompting, so you can configure the model to recognize specific words, for example, brand names like Eleven Labs, and identity detection, so you can extract correct information like phone numbers, 773-888-1989. It also supports speaker diarization for up to 48 speakers, which means you can associate and identify 48 different people. It has dynamic audio tagging and much more. And all of this can be built into your products using the Eleven Labs libraries. In this video, I'll show you how you can start transcribing audio with Scribe V2. So it all starts with the API keys in order for you to connect to Eleven Labs. You need to go to your Eleven Labs account, and click this little developer's icon over here. And click create an API key. Give access to all the things you need. So for at minimum, you need the speech-to-text here. But depending, if you're just learning, I just do all of them. Then you can create your key and start working with it. So once you have it, copy-paste that value and add it to your environment variables. So your env file, and it should look something like this. Make sure you call it Eleven Labs API key. Then you need to install the Eleven Labs library using your favorite package installer, depending on which language you're using. Here, we're going to use TypeScript. So I'm using pnpm to install Eleven Labs. Here it shows I already have the .env, but if you don't have it, make sure you do pnpm install .env. This will make it easier for you to manage your environment variables. So an example we're going to create here is just a simple back-end function using Node.js that will take an audio file and convert it to your speech-to-text, to a transcription. It's going to use Scribe V2. I'll play all the settings, and I want to give a really simple example because once you understand it, you can apply it to whatever project you want to. So to get it all started, we need to create an Eleven Labs client. Now, since we named our API key, I think Eleven Labs underscore API underscore key, this Eleven Labs component will automatically just bring it in using the .env config. Now, to start transcription, we need to have a file. There is a public file. You can find this in the docs of Nicole talking. This is just a regular MP3 file of someone talking. With a soft and whispery American accent. Now, once we have that file, we have the MP3, we want to turn it into a blob, which is a binary large object, I believe. And this is basically the root type of what a file is, which is audio MP3. And it takes the buffer of just bits and bytes and creates the type of binary large object, which is a file. Now, to get to the transcription from this file, we use the Eleven Labs client we created. We use the speech-to-text, and we convert this file using our Scribe V2 model. And we can output this directly to our console. So let's see how this looks. It's calling the file. It's calling the API with the file that we have. And here's the whole transcription object that we get. So notice there's multiple things that you get. So first you get the language code, which is English here. It says the probability is pretty certain that this is English. And the transcription is correct. With a soft and whispery American accent, I'm the ideal choice for creating ASMR content, meditative guides, or adding intimate feel to your narrative projects. This is very cool. But you also get all the words broken down for you in a nice array. So here it tells you the start of each word. So we have with, then we have a space, then we have a, then a space, then soft. It tells you exactly the timestamp when this word started, when it finished, the type of this word, and the text. Now, while I was recording, I was curious what log prob is. And it's the log of the probability with which the word is predicted. So basically how accurate 11Eyes believes this word was predicted. And zero is the highest. So if we look at our data, it looks like all of these were predicted with a very high probability. You can extract just the text by calling the property. And this way we'll get just a clean text output just like this. There's also quite a handful of features that 11Eyes has. For example, you have the key terms, you have the identity detection, multi-channel, diarization is somewhere up here. Yeah, diarize is just a true or false. So by signing this flag, you get diarization up to 48 speakers. Entity detection can take either a string. So we can just do PII of people's personal information. Or you can pass an array of the different types of entities you want to detect. So for example, offensive language, personal information, and other types. And that's the basics of ScribeV2. It's a very simple API and really powerful once you put it into your applications. Like I did in the app that I showcased in the introduction. That was an Astra application, which is my favorite, I guess, web framework. This application was server-side rendered. So I made sure my API is on the server. Make sure that's not exposed to the client because you don't want your API keys in the client. And this is the speech text. And you can see it does the exact same thing here. We create a client and we convert. We detect the personal information like phone numbers. We use the V2 file. It's the exact same code we have. All I did here is add a bunch of checks to make sure that everything works nicely. Now what I did here on the front end is I tapped to record. But remember, we're taking in the file. So I take that recording that we just created. I create a blob, a binary large object file from it. And I pass this to the back end, the API speech to text back end. And it gives me the transcription back. And then what I did is display nicely in this list and extract whenever you get the entities. You get a list of entities. So I guess let me show you that as well. If you are trying to identify entities, I have this console log with transcription of the entities. And these are the detected entities, which was the phone number. So on the front end, I take this. I convert it to numbers. And you can have a cool-looking application like this. And the ideas I have what you can do with this are infinite. So if you're looking to get started with transcription, I recommend to go to 11labs.io, create an account, get an API key, and start building with Scribe V2.
We’re Ready to Help
Call or Book a Meeting Now