Guide to Google Speech-to-Text API Setup in Python

Convert Your Audio To Text

4.9/5

3720 customer reviews

Learn how to set up Google Speech-to-Text API in Python. From environment creation to using transcription models, this guide covers it all.

Google Cloud Speech-To-Text API With Python For Beginners

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hey, how's it going, guys? In today's video, I'm going to show you how to get started with Google Speech-to-Text API in Python. With Google Speech-to-Text API, you can convert speech to text, transcribe videos, and even recognize customer keywords. Now let's look at the agenda. For this lesson, I'm going to cover how to create a Python virtual environment first that is dedicated to this project. Then we activate the environment and install the Google Speech-to-Text API Python package. Then I'll show you how to set up a Google Cloud project. Then we enable the Speech-to-Text API. Create a Google service account. Then I'll show you two examples on how to transcribe short and long videos. Now let's dive into the tutorial. Now before we do anything, I want to cover the pricing real quick. Now in terms of the pricing, first 60 minutes is going to be free every month. And after the 60 minutes, it's going to be 0.006 cents per 15 seconds. And that's all you need to know when it comes to Google Speech-to-Text API pricing.

Speaker 2: Let's create the Python virtual environment. Right, so here in this Python VMV folder or directory, these are all my Python virtual

Speaker 1: environments. You can create as many Python virtual environments as you want based on the number of projects. Now I'm going to launch my terminal. And I'll be using bash as my default terminal. And to create a Python virtual environment, I'm going to type python-m stands for module followed by the module that we're going to use to create a virtual environment. And it's going to be VMV followed by the virtual environment name. For the environment, I'm going to name the environment speech-to-text-demo, enter. And once the virtual environment is created, I'm going to cd it into the directory. And it's going to be cd speech-to-text-demo. Right, so here let me go into my project folder. For demonstration purpose, I have this 15 seconds audio. Now if I play the audio, I don't know if you can hear the audio since I don't have the system voice recording turn on. Alright, so if we look at the audio length, it's a 15 seconds audio, and we're going to transcribe this file later on when we write the Python program. Alright, so let me close the file. Now to use the Python virtual environment, I need to activate the environment first. And to do that, we need to run the activate file in the scripts directory. Alright, so here I'm going to type source. And because I'm using bash, I need to type the source command. But if I'm using PowerShell, or just a regular command, you can simply type the file path directly. Alright, so to run the activate file in bash, I'm going to type source, followed by the folder path, followed by the file name, enter. Then we can install the speech to text API Python package. And to install the package, I'm going to type pip install, Google dash cloud dash speech,

Speaker 2: enter. And once the Python packages are installed, we can go ahead and move on to the next step,

Speaker 1: which is to create a Google Cloud project. Alright, so open your browser, and navigate to console.cloud.google.com. Alright, so I'm going to just type the URL address in the address bar, and enter. And you should see a page that is similar to mine. If you don't have an account, simply sign up for an account and it's free. Some of the Google Cloud services may not be available until you enable building on your account to sign up for AI. So before we can use any Google API service, we need to create a project first. So here from the drop down, I'm going to click on new project. In our name, the project, Google speech, demo, and create, you can sync a project as an app. So basically, we're a Google project, we're going to add different services that this project will be able to use. Right. So once we create a project, we need to select the project as the active project. So I'm going to select Google speech demo. Now we're going to enable the speech to text API by clicking on this navigation menu, APIs and services, then click on library. On the search bar, I'm going to search for speech to text, enter. I should see three options. We want to click on cloud speech to text API. And here we want to enable the API by clicking on enable. And here you should see the building require window pop up. And as because this one of the APIs that is required building enable, so I'm going to click on enable billing.

Speaker 2: Now choose my billing account. Set account. Now if you see the loading window is now responding.

Speaker 1: Alright, so I'm going to do is I'm going to click on the navigation menu. And I'll go to billing directly. And make sure that you select the correct project. Then we're going to click on link and billing account. Alright, so let me try again. So let's slide with my project limit on how many projects I can enable the billing. Alright, so let me close this. And I'm going to choose a different project. Assuming I'm still using the Google speech demo project. So I'll choose a different project. Now going to API's and services library. Speech to text API. And make sure that you enable the cloud speech to text API. Now I want to create a Google service account by clicking on navigation menu, APIs and services, then click on credentials. So a service account is a special type of Google account that belongs to your application or a virtual machine. The service account is used to authenticate your application when making API requests to Google Cloud Platform services. And to create a service account, want to click on create credentials, then choose service account. Now he will need to give this account name, I'm going to name the account, let's do speech demo, then click on Create and Continue. Now we need to assign road to this account to limiting what types of permission that this account has. And for demonstration purpose, I'm going to choose owner. Now at this point, we're finished creating the account, so I'm going to click on Done. Now to authenticate the application, we need to download the service account client file. Alright, so under the service accounts, I'm going to click on speech demo, which is the account I just created. And we can figure out the account name by looking at the name column. Alright, so click on the account.

Speaker 2: On the top, I want to click on keys.

Speaker 1: And here, click on add key, then click on create new key. Now he wants to save the file as a JSON file. So make sure that you choose JSON as the key type, then click on create.

Speaker 2: Alright, so I'm going to save the JSON file to my project directory.

Speaker 1: I'm going to paste the folder path, Enter. And for the file name, I'm going to name the file, SA stands for service account, underscore speech.

Speaker 2: demo.

Speaker 1: Alright, so if I go back to my project folder, yes, I saved the JSON file in the wrong location. Let me move the file to my project directory. Now at this point, we can go ahead and write the Python programs or the Python scripts. Alright, so I'm going to create two files, demo one.py and demo two.py.

Speaker 2: Then I'll launch my code editor, and I'll be using VS code for this exercise.

Speaker 1: Alright, so I'm going to open the demo one.py file. Now the first thing I want to do here is I want to import the libraries. Alright, so here I'm importing the IO library to convert the audio file into binary streams. And from the google.org tool, I'm going to import the service account library. Then from google.cloud, I'm going to import the speech module. Alright, so here I'm going to construct the speech client object to connect to the speech endpoint. So here I'm going to grab the file path of the client JSON file. And I replace the file name to the client file variable. Then using the service account credentials from service account file function, we can pass the client file path to create a credentials object. And when we create the speech client object, we can authenticate the connection or the application by providing the credentials object to the credentials parameter. And I'll name the speech client object as client. Now I want to upload the audio file. And I can do that by using the IO module. So here I'm creating a variable to store the file path to the audio file. Then I'm going to insert the wait statement to open the audio file using the IO.open function. I'm going to read as binary. And I'll name the output as f, stands for file. Then we can load the content by using the f.read method. And I'll name the output as content. Then we need to convert the content object to a format that Google speech-to-text API will be able to recognize. In this case, we need to use the speech.recognitionAudio class. And we'll pass the content object to the content parameter. And I'll name the output as audio. So one thing is very important to know that when you are transcribing an audio file that is longer than one minute or more than 60 seconds, in that case, you need to upload the file to a Google Cloud Storage. And we'll load the audio file from a Google Cloud Drive directly. And since my audio file is only 15 seconds long, so I can directly transcribe from my local drive. So we know there are different audio files. MP3, WAV, MP4, etc. And based on the audio file that you are using, in this case, I'm using a WAV file. So I need to manually configure the audio file property. The first I'm going to configure is the encoding type. If I'm using a different audio file type, in that case, we can go to the audio encoding reference, which I'll link the link in the description below. So here's a list of supported encoding type, which you can go to on your own. And for the sample rate, I'm going to set that to 44,100 since I'm using a WAV file. And launch code is optional. Now once we have the configuration setting and the audio file data, we can make the request code to transcribe the audio file. So here I'm going to reference the client object. Then I'll use the recognize method, and I'll provide the configuration setting and the audio file's content. Now if I run line 10 all the way to line 21. Oh, actually, I forgot to create a client object. Actually, let's do that. So I'm going to run this code block right here to create a response object. Alright, so here I'm seeing that my client is not created.

Speaker 2: Alright, so let's take a look.

Speaker 1: And here I'm getting a file not found here. And that's because I forgot to include the extension. Let me try again. So I'm going to run this entire code block.

Speaker 2: Alright, so it looks like the transcription is finished. Now if I print the response object. Oh, I have a typo.

Speaker 1: And it's going to be the output that is going to return from the response object. So here we have the results key, which contains authentication that relates to this API code. So find alternatives, followed by transcript. And here's the transcription from the audio file. So from the audio file, the text returned as Google Cloud Speech-to-Text API is a service provided by Google Cloud that allows developers to convert audio to text using machine learning technology. And this one is Sinclair. He can transcribe your time or pre-recorded in over 120 languages. And from this transcription, we have a confidence value of almost 90%. And the result in time is when the audio ends. And for this API code, I will spend 15 seconds of usage. I want to retrieve the transcript. From the response option, we need to reference the result attribute that alternatives. And this should be results.

Speaker 3: So what's going on here?

Speaker 1: Oh, I see. So when I print response.results, it's going to return a list because there can be multiple transcriptions involved. In that case, we need to reference the elements index. And it's going to be zero, the first element. Now if I print response.results.alternatives, it's going to return a list of transcriptions. But here we can simply reference the transcript attribute to send the text directly. And this should be transcripts. Right, so let's see. Actually, this should be alternatives. So we can bypass the results attribute. So let's do this. I'm going to insert loop. For result. In response.results. I want to print the result. The alternatives. I'm going to reference the first element, which is going to be the transcription. The transcript. Now if I run this loop. It's going to return the transcript text. Now we Google speech to text API. There are actually different models that we can use to ensure that our transcription comes a little more accurate. Alright, so let me paste the link and which also link this link in the description below. Alright, so if I scroll down to transcription models. And here's a list of models that we can use based on the environment or the use case that your audio file is based on. And for this transcription, I want to use the video model. And which is best suit for audio from video clips or other sources, such as podcasts that have multiple speakers. Now going back to the script. I want to go back to the recognition config class. And here I want to add a new parameter called model. And I'll insert the model ID down to use. In this case, I want to use the video model. To enhance the transcription accuracy. I'm going to run this code block. Now if we look at the transcription this time. So this time we have Google Cloud speech to text API is a service provided by Google Cloud that allows developers to convert audio to text using machine learning technology. It can transcribe real time or pre recorded audio in over 120 languages. So as you can see that by specifying a model based on your use case, we can actually increase the transcription accuracy just like that. Now let's assume that your audio file is longer than 60 seconds. In that case, can we open the demo to that pipe file. So here I'm going to grab this code block. I'm not going to run the operation. I'll simply show you the script that I use to transcribe audios from Google Cloud Storage. So here we close my terminal. We want to transcribe audio file that is longer than 60 seconds. We'll pass the Google Cloud Storage object UI first. And it's going to be the syntax. GCS followed by the UI link. The recognition config object is still the same. Except that when we upload the audio file. We need to use the UI printer. And we'll pass the objects UI. Then from the speech client object, we need to run the long running recognize method. And we'll pass the config object and the audio file. So basically, this long running recognize method is going to perform the transcription operation from the file from Google Cloud Storage. And once the transcription operations finish, then we can follow the same steps. We can print the transcript using the same loop. But to retrieve the result, we need to use the operation that result method. And here we can set the timeout duration. I like to set the default to 90 seconds. If a file is longer than 5 minutes, then I'll change the timeout duration to maybe 120 seconds or 200 seconds. It really depends. So this is going to be everything I'm going to share in this video. And if you enjoyed this video, please give this video a like and subscribe to my channel. And I'll see you guys next time. Bye bye.