Guide to Google Cloud Speech-to-Text in Python

Convert Your Audio To Text

4.9/5

3726 customer reviews

Learn to utilize Google Cloud Speech-to-Text API in Python, covering pricing, setup, and practical code examples for transcribing audio efficiently.

Getting Started with Google Cloud Speech-To-Text API in Python

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hey, how's it going, guys? In this tutorial, I'll be covering how to use Google Cloud Speech-to-Text API in Python. So the Speech-to-Text API is one of the Google Cloud Service products. Using the Speech-to-Text API, we'll be able to transcribe media files such as mp3 and p4 wave files or transcribing when we are streaming a podcast. And since I'm trying to keep the tutorial short, I'll only be covering the things I think are the most important to get you started running. Let's go to the product page first. If you go to cloud.google.com slash speech-to-text, and that's going to take you to the product page. And here you can find all the information related to the Speech-to-Text API. Right, so the first thing I want to talk about is the pricing. So whenever we are using a Google Cloud Service product, we need to know how much does the service cost. And if we go to the pricing page, here we have a pricing table. So the pricing structure is divided into two different models. Standard model first, the enhanced model, and within each model, we have two different tiers. One is if you let Google to record your usage, then Google is going to give you a discount. First, if you don't give Google access to record your usage, then Google is not going to give you a discount. So things like that. And I want to first of all talk about the difference between the standard model first enhanced model. So the difference between these two models are the enhanced model incorporate data from phone calls, video models, which have better voice recognition performance first, the standard model that basically translate into if you want a better, accurate result, they want to use the enhanced model first, the standard model. If we look at the cost, the first 60 minutes are going to be free for both standard model and the enhanced model. And after the first 60 minutes of usage, and the quota is going to get reset every month. And that means every month, you can use up to 60 minutes of speech to text API for free. And after the first 60 minutes, then it's going to cost you a point 006 cents per 15 seconds. They are logging up in first, if you decide to let Google to record your usage, then it's going to be point 004 cents per 15 seconds for the standard model. And for the enhanced model, it's going to be a point 006 cents per 15 seconds and point 009 cents per 15 seconds without data logging. And that's everything I want to talk about in terms of the pricing structure. Now I want to talk about the quota and usage limit. If you go to the quotas and limits page, if you are providing a local media file, then the speech to text API is going to limiting the media file to be less than 10 megabyte. And the media lens cannot be greater than one minute. If you want to upload a file, a media file that is longer than one minute, and the file size is greater than 10 megabytes, then you need to upload the media file to Google cloud storage first, which I'll show you examples in this video. This is everything I want to go over on the products page.

Speaker 2: Now I want to go into a Google cloud console.

Speaker 1: Before we can use any Google cloud service, we need to enable the service first. But first, make sure that you sign up for a free Google cloud account if you don't have one. And next, you want to make sure that you create a Google cloud project. And for this video, I'm going to use my Google cloud demo project. And once you choose the project that you want to use, want to click on the navigation menu, then click on APIs and services, then click on library. The first API we need to enable is the speech to text API. And the API name is going to be cloud speech to text API. And make sure that this API is enabled. And next one to enable the cloud storage API.

Speaker 2: Let me go ahead and enable this API first. The next one is going to be here, let's go back. The next one is going to be cloud storage.

Speaker 1: And from this list, we can enable both cloud storage and Google cloud storage JSON API. Once you enable all the require Google cloud service, we can now create a Python program to interact with a speech to text API. Now open your Python editor. So here I'm going to use VS code. To be able to interact with the speech to text API in Python, we need to install the Python SDK first. Right, so to install the library, we want to use the command pip install, dash dash upgrade Google dash cloud dash speech. In this tutorial, I'll show you two examples, which I think is going to cover majority of the use cases. Now, let me go into my project folder first. Alright, so for the examples, I'll be using these three media files. Basically, this one is a three minutes long media file. My second story is about love, which is a speech Steve Jobs gave in 2005 at Stanford. And this media file is three minutes and 18 seconds long. And for the file size, it's going to be six megabytes. So this file, I'm going to upload to the Google cloud storage first, then I'm going to show you how we can upload this file to the speech to text API. And for the local files, I'm going to use these two files. Once mp3 file and once way file. The reason I want to use these two files as examples, because I want to show you the performance difference of an mp3 file for survey file. And let me play the demo audio that way file first. And basically, this is a six seconds long audio. Hi, my name is Jay. I teach Python and Excel on YouTube. So basically, the audio's content is, my name is Jay. And I teach Python Excel on YouTube. So that's it. Now let's go into our code editor. In the demo.py script, I'm going to import the OS module and increase the font size. And to import a speech library from Google Cloud, want to import a speech module. Oh, and one more thing, make sure that you download the client service key as a JSON file. And here my client secret key is going to be named as client on the score service on the score key dot JSON. I'm going to copy the file path to my Python script. So the way how Google Cloud service is via the environment variable, and we can create a temporary environment variable using the OS module dot environment. So this is a method allows you to map environment variable using the OS module. And the Google Cloud service is going to look at the Google application credentials environment variable to locate the client service file. So here we're going to type the environment key, Google on the school application. I'm a school credentials. It goes to the client service key path. Dan can create my speech client, I'll name the client speech client. He goes to speech, speech, client object. Let me run this code block first. And I'll create a speech client successfully. Let me terminate the session. If I come out line four, if I tried to run this code block, and since I didn't provide a client service key, I'm going to get the default credential, could not automatically determine credentials is going to give you the explanation, please set Google application credentials or explicitly create a credential and rerun the application. Now let me uncomment this line and let me create a speech client instance. For the first example, I'm going to show you how to transcribe a local media file. And as I mentioned before, file size has to be less than 10 megabytes. And file lens has been less than one minute. Now let me grab the media file pass first. And it's going to be a wave file. I'll name both file media file name, underscore, and this one is going to be mp3 and wave file wv. I want to go to the documentation first. So here, let's take a look at the documentation from cloud speech to text documentation on the reference, want to go into rest reference under the rest reference API, want to look at the recognition audio resource. When we upload a media file, we need to convert the media file as a recognition audio object first. And this resource takes two different types of contents. One is the Google cloud storage object UI, basically the object file path, or we can create the object by providing the media's by content. And this is used when you want to upload a local file, right? So let's go back to our Python script. I'm going to grab the by string of both media files first. So I'm going to type, we'll open media file, and this one's going to be mp3 file entry as binary. Now let's call this F1 and I'll name the by data object by data underscore mp3 is equals to F1.3 down to create a speech recognition audio object. So this is going to be recognition audio. And since I'm providing the by string, so I need to reference the content parameter is equals to by data mp3, I'll name the outputs, audio, underscore mp3. Oh, I've got the equal sign and I'll copy this code block. I'll just make a copy and I'll rename the variable. This one's going to be F2 and it's going to be a wave. And it's going to be a step number one, load the media files or transcribe media files for step number two. We need to configure the media file output. And it's going to be step two. To configure a media file output, we need to use the recognition config resource. And below is the JSON representation of the inputs. We can specify things like the language code, whether or not if the speech is going to be in a different language, the simple rehearse. So this one, I sense a required field, the media encoding speech to text API model, and whether or not you want to use the enhanced model and cover other properties, which I'll let you to go to the lesson on since he will have the description associated to each field. Now to create my recognition and config objects, I want to configure the mp3 file first. I'll name the object config underscore mp3. It's equals to speech that recognition config. And I know for the mp3 file, the sample rate hurts is going to be 48,000. And I also want to enable the punctuation and the prime tenant is going to be enable automatic on the score punctuation. I'll set that to true. And for the language code is going to be English. And that's everything I'm going to provide in terms of the configuration. I want to copy this code block and I'll just copy and paste. And I want to change mp3 to wave. And since I'm uploading a wave file, we need to specify the audio channel count. And for the way file is going to be true. And simple rate hurts is going to be 44,100 for the way file. I want to create my config objects in my audio objects. So run line 11, all the way to line 34. You have a typo, oh, this should be enable automatic punctuation. Let me try again. Okay. I'm able to create both objects successfully. Stand number three is going to be transcribing the recognition, audio objects. And this should be low media files, not transcribing media files. Because I'm transcribing a local media file, the file size is smaller than 10 megabytes and the media lens is less than one minute long. So we can use the recognition method here. If I go into speech. We have two different methods, recognize method and long running recognize. The recognized method is going to be used when you want to transcribe a local media file, or if you want to transcribe a media file, which is less than 10 megabytes or less than one minute long. Otherwise, you will be using the long running recognize method. The long running recognize method requires one extra step, which I'll show in a second. Now let me show you how to use the recognize method first. And to transcribe the recognition, audio object from the speech, client, object, recognize. Want to provide the config object to the config parameter, and let's transcribe the MP3 file first. And it's going to be config underscore MP3. They want to provide the recognition audio object to the audio parameter. And I'll name the outputs response, underscore standard, underscore MP3. Now make a copy of this code block. And I'll rename MP3 to wave. Now let me go ahead and ask you these two code blocks. He's saying the audio wave is not defined. Oh, this should be F2. Let me create a config wave object. Right, so here, let me print the response standard MP3. Let me go ahead and create the response standard wave object. I'm going to print both objects, so you can compare the results. So here, if we look at the MP3 file first, Hello, everyone. My name is Jay. And here's recognizing the JsJerry. And I teach Python in Excel on YouTube, and it should be N. First, if I'm uploading a wave file, and because wave files is able to store a lot more data, we're getting a more accurate result. So here from the first line, my name is Jay. And the second line is I teach Python in Excel on YouTube, and it's the result I should expect. Now I've covered how to use the speech to text API. Now I want to show you the last example, which is to how to transcribe a low media file. And it's going to be example three, transcribing a low media file. And this will be example one and two. To transcribe a low media file, we need to upload the file to Google Cloud Storage first. Now here, let me go into my cloud console. And I'll go into the dashboard first. And from the dashboard, we want to click on navigation menu. They want to go into cloud storage under the storage session and select cloud storage browser. And on my cloud storage, I already created a bucket, which is equivalent to a folder. I named the bucket speech to text media files. And we go into the bucket. Inside the bucket, I have two media files. And we transcribe the WAV file. So this file is 33.4 megabyte. And we know the media lens is going to be three minutes and 18 seconds long. From the object details page, we want to grab the URI link, which is this link right here. Now create a variable called media underscore URI. And we'll assign the URI link to this variable. And from the speech client object, the recognition audio, let me check something. Okay. I want to assign the media URI variable to the URI parameter. I'll name the outputs long audio wave. Oh, this should be URI. And next, I want to configure the audio output. So phone speech, client, recognition, config. We need to specify the sample rate hurts, and it's going to be 48,000 on to enable the punctuation. And I want to set the language code to English. And for the audio model, let's use in hands, and it should be using hands. I'll set that to two. And for the audio model, let's use the video audio model. And I'll name the outputs config wave in hands. Let me create these three objects. And he's in the, uh, the attributes are fine. Let me type the resource name again, recognition, config, which I again, oops. Oh, I know why this should be speech speech client. Because the resources is directly coming from the speech module, right? So let's go back and it should be speech as well. Okay. Now once we create the recognition audio object and a recognition config object, we need to process the media file first. So from speech client that long running recognize. So basically we're just replacing the recognized method with long running recognized method. So I can simply copy paste the parameter statements. And I'll name the outputs operation. Now here, let me go ahead and create the operation object and here I'm getting an attribute. Oh, this should be recognized. Let me try again. Now this time getting the non-stream remove. Let's see, let me go ahead and recreate everything from scratch. Okay. So this time everything worked. Now if I print the operation object, you'll simply print the object ID. And here, what I'm looking for is the transcribed result. And to retrieve the result, want to reference the operation object, that result, oh, I did this incorrectly. This should be results followed by timeout. This statement here is performing the transcribing task. So I'm going to name the outputs response. Now if I print the response option, and this is what the results looks like, and I see that this is incorrect. Let me take a look. Oh, and this should be a long audio wave, not audio wave. Now let me rerun the operation. So if you have a typo, this should be recognition, audio, and let me rerun the process again. And here are the results. Let's take a look. So here we have all the results for the media file. If I want to print just the transcript or the competence value, what we need to do here is we need to iterate each result. So here I can see that for result in response, that results, let me just do a double check. Okay. So if I print response, that results is going to return us a list. If I simply just want to print a transcript, so reference the transcript attribute, and if I want to print the competence value, I'll type result confidence. Let me try. I'm not so sure this is correct. In here, I'm getting the attribute here, let me take a look. I think the correct syntax should be results, the alternative. Let me take a look. It should be alternatives. So this should be result.alternatives.

Speaker 2: And it's an empty line.

Speaker 1: Now let me try again. Attribute transcribe. Oh, I see. Okay. So we need to reference the element location. And for the element location, it's always going to be zero. Right. So here we have the transcription. And as well as the competence value in a separate line. All right. So this is if someone to share in this video. And hopefully you guys found this video useful. And as always, thank you guys for watching. I'll see you guys on the next video.