Exploring AI Voice Synthesis: Features and APIs
Learn about AI voice synthesis, its applications, and how to use APIs like 11 Labs, OpenAI, and Google for creating natural-sounding voices.
File
Generación y síntesis de voz, TextToSpeech utilizando APIs ElevenLabs, Google Cloud, Open AI, PlayHT
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Well, it turns out that yes, one of the most popular themes today of the AI generative is the voice generation, text-to-speech, voice synthesis, spoken text, whatever you want to call it. Because we already passed that threshold where voices listened to robotics to feel natural, like a person would speak to you. And if by itself this was already used in scenarios like GPS when the map guides us or when we ask our assistant for things, there was a barrier to not accept it completely and it was because of this. But the generative AI brought this moment in which we can already talk to someone in real time. And that is why many industries are trying to adapt this type of technology. For example, you have probably already seen call center commercials, customer service where they call a person and this person takes the call with an artificial intelligence in real time and the person does not even realize that he is talking to a bot. Or audiobooks, before someone had to narrate it and it's still going to happen. But there are already people who put on such a real voice that you do not distinguish that it is a voice generated with artificial intelligence the one that is speaking. And so many of the modern interactions are going to be with this type of voices. And in this video I am going to do an analysis and I am going to share it with you from all these companies that offer this voice synthesis service and how to use their corresponding APIs. Seen from the perspective that we are programmers and that in the end we want to automate certain processes. And if you want the quick code or delve into the analysis, here I leave you in the description my article from Medium where I explain everything. But know that it is not just any analysis, I am very interested in voice synthesis because I am integrating it into my application. Which is simply impressive once you already have the communication with real voices and it helps a lot to learn English. That is why I understand a lot how voice synthesis works now and what are the alternatives. And the application sponsors the video but it does not sponsor anything because it is free. Check it out if you want to see a demonstration of what I am talking about, of the quality of audio. And let's go to the video, I'll explain everything to you. Well, in this notebook I have a small abstraction of all the APIs to basically call the same method and just change the provider. Level Labs, OpenAI, Google and PlayHD, which are the ones we are going to see today. And well, to have it all in one click and evaluate how efficient it is synthesizing the voice. And this text is to show you which ones can speak Spanish and which ones do not. But most importantly, I am going to show you how it works. And well, to have it all in one click and evaluate how efficient it is synthesizing the voice. And well, to have it all in one click and evaluate how efficient it is synthesizing the voice. So I am going to click on the Level Labs one. Let's see how long it will take, about 4 seconds. And this is how it sounds.

Speaker 2: This voice is from 11 Labs, let's see if I can speak Spanish. The mouse is funny, the flame went out, but I am alive.

Speaker 1: Well, yes, it has perfect Spanish, perfect English. And this has a trick actually, the sentence. To see if it can say the R's and not the L's. To see if it pronounces well double L's. Because some people pronounce Lama. The accents, of course, it's not Apago, it's Apago. And the same, in the V's it sounds Vivo. That would be the V in English, but in Spanish it's just Vivo. And so for everyone, let's see. And of course, we have to mention that there are many configurations. This is just the most basic one, selecting the voice. That more or less the voice has the same quality for each provider. And well, the one from OpenAI would sound like this. It's like 3 seconds.

Speaker 3: This voice is from OpenAI, let's see if I can speak Spanish. El ratón es gracioso, la llama se apagó, pero yo vivo.

Speaker 1: This is very good, the truth. We continue with Google Cloud. The voice is very fast.

Speaker 4: This voice is from Google Cloud, let's see if I can speak Spanish. El ratón es gracioso, la llama se apagó, pero yo vivo.

Speaker 1: And well, it doesn't know how to speak Spanish. But look, the response time is... Incredible, it's ten times faster, comparing it with OpenAI. And at least the pure voice in English is very good. And here we have PlayHT.

Speaker 5: This voice is from PlayHT. Let's see if I can speak Spanish. El ratón es gracioso, la llama se apagó, pero yo vivo.

Speaker 1: It also doesn't know how to speak Spanish. The time was decent, but this company also has some surprises, and we're going to see them now. Well, let's start with the most popular one, which is Eleven. You've probably already seen videos of statues that speak, that have voices like this.

Speaker 3: The soul is tainted by the color of your thoughts.

Speaker 1: Videos of philosophy, of motivation, abound in TikTok. Most of those are made with Eleven Labs, because you have to distinguish between two things. The platform that helps content creators, which I'm going to see when I start the session. And the API, which is what is going to integrate into our systems to have that type of voice. And well, let's see a test. Sensual. This one because it's the voice that whispers.

Speaker 6: Your voice is sensual.

Speaker 1: It serves for meditation and other things. And well, I'm going to call it from the API. Well, once you've started the session, you'll find this interface in which you can already generate your text and download the audio by selecting the voice that you like the most. It's beyond just generating audios, because you can create voices, you can double your own voice or your own videos. Which is very good, because it maintains a consistency. It has the consistency of how an audio of yours would be heard in multiple languages. But we come here to talk about the API and voice generation. So we're going to go here, in these sections where you're going to see your API Key. This is the one that they're going to use in the Python library to make the connections. And create a client that can make requests to Eleven Labs. So I tell you here in Profile and Key. And then to install the library, we click here. And here is the Python library. That will guide us. It redirects us to GitHub. Which is basically doing this instruction. git install elevenlabs And after that, following the instructions should be very simple. Just put our access key. And then start calling the different APIs. As we're going to see in this example. Well, it's a Jupyter notebook. As you probably already know, commonplace. Just import the library. Here they put their key. Look, I already executed this cell before. And then I just put xx. So they don't show up in the video. But Client already has my correct data, actually. So this is going to work. Put yours. And you can see, for example, all the existing voices with this call. It would be good if you save this in a CSV file. So you can see exactly what each type of voice offers. For now, I'm going to try with this one. Which is called Mimi. Which is Mimi. A young woman's voice. Like for children. Which is also like a Swedish English. And so all the voices have some description. And so all the voices have some description. So if we execute. Basically this is the line that is going to do the work. Basically this is the line that is going to do the work. From the same client. Convert. They pass the text they want. The ID of the voice. And a model. I forgot about the models. Apparently the default is this. Or this, a monolingual. Which is going to speak Spanish and English in the same line. So put this one. The multilingual version 2. And from what returns. Text characters will return. But with this we can convert it to a bytes file. But with this we can convert it to a bytes file. And having the bytes we can save the mp3 file. If we wanted. Or this is like a utility function. Of the notebooks. Which basically renders this. This audio. So we can give it play. So calling the function would be something like this.

Speaker 7: Hello if I speak Spanish.

Speaker 1: The flame is alive. Something to mention. Is that here we can pass the settings. And how it takes different tones. So if you don't like the default. You can look for which one you like more. And continuing with OpenAI. Well here we have it. Maybe not everyone knows. That OpenAI sells voices too. The synthesization. Because it is well hidden. It does not seem to be its star product. Although it is very good. If we go here to pricing. We will see that there are some audio models. One to recognize audio. And another to synthesize the text to speech. And looking for the documentation. We will find here this section. Text to speech. Here we are going to see how it works.

Speaker 3: Here is the example.

Speaker 1: Which uses at the end. The same library as ChatGPT. If you have already used it before. You probably already have this installed. And all you have to do is change. The reference. Instead of client.llm. It will be client.audio. And so simple. Let's see. Just one more thing. Once you start session. It will ask you where. If you want to use ChatGPT or API. Here is where we are going to see the documentation. Where we were before. But here is where we are going to get our key. They are going to give you create a new one. They just put an alias. And I already created mine. This one I am going to discard at the end of the video. But it is going to help us for now. So you can see how it works. Well, it is simple. You just have to pass the model. A voice. And the input. Which is the text you want it to speak. And then they execute this line. And then with this. They are going to read the bytes. So it is executed.

Speaker 3: And well, let's see. Well, not bad.

Speaker 1: Not bad. Remember to install. pip install openai. And if you think this is very advanced. I have some old tutorials on the channel. So you can learn a little bit about Python. Because this should be very common. And if not. Then write to me. Tell me. Oh, it is very advanced. You can do something more basic. And let's see what I do. Ok. If you have not seen videos on this channel. Well, you have to know. That this is the cloud that I like the most. Although I know others. This is the only one that I do dominate. And that I never explode it. So try the voices here. It is also something that is interesting. For this analysis. Because I'm going to tell you from there. I don't think it's the best voice. But for the response time. And the cost. It comes to me like a ring on the finger. Because let's see. The most expensive ideas. The generation of voice.

Speaker 6: But let's see.

Speaker 1: That was a British voice. Installing the library. Pip install this. And then here we are going to see how we import it. Simply like this. With this instruction. And the same of the same. Except that in Google and in the clouds. We use to identify a service account. This is one of the easiest. So I hope you know how to create a service account. And from there. Some other instructions. Because you have to pass some objects. To the final call that would be this. Client.Synthesize And I did not mention it. But the fact that it does not speak Spanish and English. In the same line. It does not mean that it does not have Spanish voices. English voices. Portuguese voices. Or any language that comes to mind. I have 25% more free characters. Than Eleven Labs. That's why it caught my attention. And the quality is very good. And one of the things. In which they are focusing. Is to make communication in real time. For that you have to generate a voice. Ultra light and fast. That also has very good quality. That's why I have PlayHT on the radar. Although at this point. I'm not completely exploiting its features. Once they register. You can see here. For developers. And here you will see your user ID. This you copy. Also the secret key. Here you can create a new one. You copy it. And you can see the documentation. Then you will be suggested to install this library. Which is the client that will allow you to connect. With PlayHT. And it's going to be something like that. Once you have it installed. Open your Jupyter notebook. Your script. If you want. But I highly recommend using the library. Because it will allow you to do streaming. Of the audio. Which is basically this. You select a voice. The voice list. You can get it by consulting the voice API. For example here I saved it in a csv file. Which looks like this. With an extension. So here you will see all the voices. That you can download. Any of these you can use. As long as it is version number 2. Because these are the best voices. And these are the ones that are streamable. So simple and easy. You create a bytes file. You iterate on what the API returns. And then you have the audio. This to make it synchronized.

Speaker 7: And it sounds like this.

Speaker 1: Remember that the code to do this. Is in the description. Final conclusions then. What is the best model? Or if we want price. Or if we want a balance. But if we say the best quality. The best audio. I think it has it. But it is not the option that we would choose. To connect with an application. Because it is very expensive. As I told you at the beginning. You can go deeper into this article. To see more in detail the costs. But in summary. We have here two models of OpenAI. Two of Google. One of 11Labs. And one of OpenAI. And as you will notice. The one that gives us more characters is OpenAI. In its normal model. That the quality is also very good. And secondly. We also have the Google voices. That right now I am going to show you. These are the ones that I am using. For a single reason. OpenAI does not give anything free. While Google gives a million free characters. For this type of voice. I am also using the studio one. That is more expensive. But they simply do not compare. Because these two products. Although they have good voices and we can connect them. They are made more for a final user. For someone who edits videos on YouTube. Who automates videos on TikTok. Who creates quality content. It is for them. And these are the ones that I analyzed. You tell me in the comments if you know any other. And now a demonstration. Of the voices that I am using. Most of them are from Google Cloud. But in the end they were included from OpenAI. From 11Labs. And those were high quality voices. That are reusable for all users. That is, I only paid once. And now when you see examples of words. You are going to hear these audios. Example. That was one of high quality. Now the normal ones. I use them in conversations. These are generated by each user. Because one talks about what he wants. And there is no way to reuse it. Example.

Speaker 6: This is a fish that can generate electricity. This electricity is used for hunting prey. Or defending itself from predators. So you already know.

Speaker 1: If someone is interested in using this application. Because I firmly believe that it is more useful than Duolingo. You can go to the networks of this project. I leave them in the description. Also follow me. Here we talk about Google Cloud. Of technologies. Of data. And see you next time friends. One more thing. What would you do? And also follow me on social networks. For more similar content.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript