Create Multilingual Voice Translations with AI

Convert Your Audio To Text

4.9/5

3702 customer reviews

Learn how to build an app that translates your voice into multiple languages using Gradio, Assembly AI, and 11 Labs.

Build an AI Voice Translator Keep Your Voice in Any Language (Python Gradio Tutorial)

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hey and welcome. Today I'm going to share with you my voice translator. This was an app that was surprisingly easy to build but it's kind of mind-blowing. So what it does is I can record myself speaking in English and then it will translate it to different languages. And the mind-blowing thing is it will use my own voice to generate my speech in these other languages. So for example let's record something as an example. Hello, good morning, my name is Misra and I'm excited to share this app that I made with everyone else. So I stop it and then submit it. And after around 20-30 seconds I have it here in this example in 6 different languages. So let's try the Turkish one. Merhaba, günaydın, adım Misra. Yaptığım bu uygulamayı herkesle paylaşacağım Hey John Liam. Or in Russian. Здравствуйте. Доброе утро. Меня зовут Мистра и я рад поделиться этим приложением, которое я сделал со всеми остальными. Or in Japanese. こんにちは.おはようございます。私の名前はミストラです。私が作ったこのアプリを他の人と共有できることを楽しみにしています。 I don't know it's honestly kind of eerie to hear yourself speak languages that you cannot actually speak but it was super exciting to build and to play with so let's get started and I'll show you how you can build this app. So this app is really easy to build on Gradio. I'm going to use three components in total not Gradio components but three technologies. The first one will be assembly AI to transcribe my English speech into text and then I'm going to use a translate module in Python to translate it into any language that I want and then I'm going to use 11 labs to take that text in whatever language it is and turn it into audio that is generated audio with my voice. The first thing we're going to do is to start building a Gradio layout. So the first layout that I showed you is a little bit more complicated even though it has the same functionality. So I will show you how to build this so that we do not have to get into a lot of Gradio details but just to kind of build everything together again here I just record my voice and then I can get the generated audio in Spanish, Japanese and Turkish. But both versions will be on the GitHub repository so if you either want this simplified version or the more complex interface you can find both of them in the GitHub repository and the link is in the description below. So let's start by importing Gradio. So I'm not a Gradio expert but as far as I understand there are two ways to build Gradio apps. One of them is to use Gradio interface which is a simpler version in which things are kind of already connected to each other. You just need to specify what your input is going to be and what your output is going to be. But if you want to customize your layout a bit more, maybe group components together, using blocks would be a better solution. Like I said, to not go into the details of Gradio too much, I'm just going to show you how to build with the interface option here and then you just give a name to your interface and when you call launch on it it basically starts your app and then you can access it locally. All right so what we're going to do is we are going to define some functions that will have all the functionality. So there will be a voice to voice function which will be our main function. And then I will have a audio transcription function in which I'm going to call assembly AI. And then we're going to have a translation, let's call it text translation function. And finally, a text to speech function, which is in which we're going to call 11 labs. So basically these are the three components or API that I'm going to use and this is where I'm going to call all of them from. So these will be the... I will just return true from these for now. And then I can show you what the interface looks like. Same with this one. All right, so inside the interface what you need to specify is a function you're going to call and then your inputs and then your outputs. Our outputs are easy you can just pass it a list so we're going to create three output interfaces audio interfaces and their labels are going to be the languages we're going to use so I can call maybe Spanish first and then another audio component which I will label as Turkish and another one that I will label as Japanese. For the inputs we can create an input audio, audio input component. Again this is going to be an audio component. So the sources for example, there are multiple sources you can say upload and microphone but I only want users to record audio so I will call microphone the type it will return I wanted to return a name a file name and there are some other things that you can specify like show download button show share button but I don't really want to specify those right now the default values are good enough for me and this audio input will be the input for me. All right so if we run this we can just quickly take a look at what the interface looks like. But I named the strongest should be file path not file name so it passes me the path of the audio file that is created after I recorded my voice. So let's save this. There are two ways how you can run Gradio apps. If you run Python simple or whatever the name of your app is it will run once but if you make changes you will not have the chance to see those changes by your fresh refreshing your app instead if you run gradio and then the name of your app you it will be updated every time you make changes and you save it you can just refresh your browser and then you will see the new version so let's run it like that some errors but we'll figure that out so for now. Let's just see what it looks like. Allow the microphone. So it basically builds all of the interface. That's all we're going to do. You can record yourself here. After you record yourself you can press stop and then you will be able to see your waveform. You can either clear it or submit it. Alright so now that the interface is done let's fill it in with all the functionality we want. Let's see let's start with assembly AI for audio transcription. So what I have to do is after I get the file path it's going to be passed to the voice-to-voice function so I can pass it here audio file and then instead of return true transcribe audio I will call the audio transcription function. I'm going to pass this audio file directly there and as a result I'm going to get a transcription. Transcription response let's call this. And then we can fill audio transcription function. In here I'm going to call assembly AI to do the transcription. So let's import assembly AI as a AI. Of course now that we're importing them here make sure that you are installing them as you go along. So I will set my API key. Here is where you can specify your API key with AssemblyAI. Create a transcriber, so AssemblyAI transcriber and then I can call the transcribe function on this transcriber and as a result I'm going to get a transcription back and then I can return this transcription. And I will pass the audio file to the transcribe function and that's all I have to do. The transcription that I'm going to return is not going to be the text, it's going to be the whole response that I get from Assembly AI. And the reason I'm going I want to do that is because I want to be able to see in the main function whether this transcription errored out or not. So let's go ahead and check that. To see how I can check that I can just go to Assembly AI docs, speech recognition, and it will tell me how to catch errors. So it's like this here. We can check the transcription response status and if that status is assembly AI transcription status error I can raise a grade your error and I can just throw the error message that is returned from assembly AI if that is not the case because I know that for this function to be to have finished it neither either needs to error out or be completed so there are no other options for this transcription response to be in, then I can just get the, let's just call it text, and the transcription response is going to be in the text section. And one thing to note here is when you use the Python API, Python SDK of Assembly AI, the code will hold here and wait for this transcription function to finish running. Alright, so we have the text in English and we want to translate it to other languages. Of course the next step I'm going to call the text translation function and I'm going to pass this English text in here and as a result I'm going to get the translation. How I'm going to do that is by using the Python's translate module. I'll quickly show you the documentation. Here it is. If you go to the official documentation. So they have multiple providers. I've been using the free version and it's been working well enough I think. Yeah of course it can be a bit more context aware at times but overall for just to kind of use it personally I think it's good enough. And I think the free one is called MyMemory. That's a default option. And if you want a paid option, for example, the Microsoft Translate, you can also change it within the Translate module and yeah, then you can pay for it and you'll probably get a better translation, but I haven't tested it, so I don't know. So let's import it from Translate, Import, Translator, yes. And then here, remove the return true. I'm going to create a translator. I need to specify from which language Which is English? to language Spanish you just need to know the two-letter code for the language that you want to translate from and to And then I will call this the Spanish translator As you can see here if you want to translate if you want this voice-to-voice translator to go from a different language like something other than English to another language, you can also do that. That's very easy to set up here. If you want to make it even more sophisticated, you can set up some parameters, maybe let the user choose which language they want to translate from. These are all options. So then I call the Spanish translator. I pass the text to it. I could just call it text and pass it here and then I will have created the Spanish text. Then I can just return that. Oh wait but I forgot something here on the translator. I need to call the translate function. Alright so this is to do Spanish. Just to kind of do it quickly without adding too much complications. Kind of like hacking a solution together I will add a Turkish translator just literally renaming things and then a Japanese translator you can do this in a much more efficient and nicer way of course but let's just get this working and then you know we can worry about the optimizations later all right so now I'm returning through three different translations. I can also name them here. Spanish translation, Turkish translation, and Japanese translation. All right. Now I have the texts. The last thing to do is to generate the audio with my voice in these languages. And for that, it's honestly quite easy. All we have to do is to go to 11 labs and get the required code. So let's go to 11 labs. I will go to their docs. They have really nice instructions for us so you need to install 11 labs, you need to install python.env and then you need to of course have a 11 labs API key and here's how you find it. Once you created an account you can go to my account profile and API key and then you will be able to see your API key here. Maybe before I show you the code actually I will mention how to create your own voice so when you're on your dashboard on 11 labs after you created an account like I showed you you can get your API key but if you want to generate the translated text into your voice with your voice you you have to first clone your voice. So you can go to voices on your dashboard. As you can see, I already have my voice created. So there are a bunch of different ways. You can either design a voice from scratch, you can do instant voice cloning, so it only needs one minute of audio of you speaking to clone your voice. There you have a big voice library. If you don't want to clone your own voice, you just kind of want a voice, you can choose it from there. Or you have professional voice cloning. That one is a paid feature, so you need to subscribe to Eleven Labs. But I subscribed for once. I think I paid $25 and I get like a very serious quota that I can experiment with. So I think it's worth it just to try this out. So in the professional voice cloning, all you have to do is to provide at least 30 minutes of audio of you speaking. Of course, this is easy for me because I've been making videos. I have a lot of very long videos of me speaking. so I only provided around 33 minutes and those are the results that you saw in the beginning of this video so it's pretty good only with 30 minutes of audio you can get a very close clone and the upper limit I think is around three hours so or that's the optimal they suggest if you upload three hours of audio of you speaking you will probably get like a flawless clone so that's what you have to do before you start coding the section but let's get go and copy go ahead and copy the code so you can either specify your 11 labs API key as a environment variable and then you need to pass it as part of their client that you're creating. I need to install and import 11 labs. Let's see how to import it. 11 labs, voice settings, 11 labs, client, okay. So we can just copy and paste it here. I'll just remove this. I will just specify my 11 labs API key here manually. And then let's go ahead and copy this code in this function. All right, so what I need to pass it is the text, right? So I maybe already call this function here, text to speech. I will pass it the Spanish translation. All right, and I can just call it text here too. Let's go line by line. So client, text to speech, we call the convert function. We need to specify the voice ID. By default they are selecting Adam, but I want to select my own voice, so I can go figure out what the ID of this voice is. It's here in your dashboard, you can find it. If you click it, it will be copied, so you can just change this ID. I will leave all the other settings the same here. model ID says 11 turbo v2 I would like to multilingual model so that I can use other languages voice settings so these settings you can experiment with in the speech section so I could say good morning how are you and this will generate my own voice good morning how are you and then you can change the stability, similarity and then see and style exaggeration and see what kind of level you like. The level I like so far I know is for stability is 50%, similarity 0.8% and style exaggeration 0.5, sorry 0.8 as in 80% and 0.5 as in 50% works well for me. Speaker boost is boost the similarity of a synthesized speech and the voice at the cost of some generation speed okay that these are the settings I used before and they worked for me so I'm just gonna keep them as is I do not want to play it so this will create the voice there are some things I didn't import you you ID also I need to import to save this generated voice so don't need to change any of this code it is just saving the audio file that is created into a file and this is the file path and we can just return to file path so we do not need to change any of this code it exactly fits in with whatever we want to do. Alright so we can go back here we can say this is s audio path this is the file path that was returned to us but what I've seen is if we pass this path directly into Gradio it is not able to play it. So what we have to do is do from pathlib import path and then convert this path into a pathlib path path to pass it to Gradio. So just like an extra step that we have to do here. I I just need to call path here and then I can call it okay I'll just call it path not great naming but you know I'm hacking things together here so I'm also just going to completely butcher it here and call the same things the same function three times with just a different translation text and then do the same thing here three times. All right, so I have taken my English voice, I have translated it, transcribed it using Assembly AI, I have translated it using a translate module in Python, and finally I have taken these translations and then generated audio with them in my own voice. And here are the file paths that point to those files. What I have to do now, the only thing I have to do now, is to return these paths. Turkish path and Japanese path. Gradio is so simple now basically because I'm returning these from the main function that I called in the interface which is here, voice-to-voice, it will pass all of them one by one to these audio components. Unless I made a mistake here, there's a a typo or some sort of bug it should work so let's run it and see let's see oh of course I haven't passed my API keys yet so let me do that and then we run this again all right I pasted my assembly and 11 labs API keys where they need to go and then record again let's record again hello it is a beautiful day today but I'm a little bit cold because the AC in this room is blowing really hard stop it listen to a quick hello it is a beautiful day today okay it looks like my voice is clear so let's submit and hope it works this time all right exciting hola hoy es un día precioso pero tengo un poco de frío porque el aire acondicionado de esta habitación sopla muy fuerte nice merhaba bugün güzel bir gün ama biraz üşüdüm çünkü bu odadaki klima It's really bad for your throat. Hello. Today is a beautiful day, but the air conditioner in this room is blowing a lot, so it's a little cold. so that we have an idea of what we're talking about. Yeah, so if you want a more sophisticated one like this, you know, where you see the audio, you can play the audio, you can download the audio, you see the text also, that is possible with Gradio. You just need to, you know, customize it a little bit more and it adds up so much code that I didn't want this tutorial to be super clunky. But like I said, you can find this tutorial, you can find all of this code on our GitHub repository for this tutorial. But yeah, so it's been very eerie to hear my voice in many different ways, in many different languages, saying things that I wouldn't even normally say because I had my friends record things for me and then we generated it in my own voice. But I'm also curious what kind of apps you can do with this technology. I've heard recently that Samsung started their own voice-to-voice translation so when you're having a phone call with someone maybe who speaks Chinese your voice translates they hear your voice in Chinese and you hear their voice in English but I think it's not their voice it's more like a generated voice with AI but I can think of this as like when you want to use when you want to send what's up voicemails to your friends you can record it in your friends language and I'm sure that was surprised them quite a bit or I can even imagine this being used to practice you know for example if it says yeah como esta su salud I guess I can try to say that in Spanish and see how close it is to what it what it's saying because it's also my voice so it's easier to imitate maybe but if you have a good idea of how you would use this technology leave a comment and let us know I'd be very curious to hear it maybe we can even make a tutorial out of your idea I think that'd be really cool I hope you liked this tutorial. If you want something similar, you can go check out Smitha's video where she teaches you how to build an AI voice bot. So this is like a chat GPT bot, but instead of texting with it, you can speak with it and it will speak back to you, which I think is quite interesting. She will teach you how to do that in just 20 minutes. Thanks for watching. I hope you had fun and I will see you in the next video.