Implement Google's Speech-to-Text API in Projects

Convert Your Audio To Text

4.9/5

3721 customer reviews

Learn how to integrate Google's Speech-to-Text API, convert audio, and transcribe efficiently. Follow this guide for successful API implementation.

How to Use Googles Speech-to-Text API with Python (2025)

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hi guys, welcome back to the channel and today I'm going to show you how you can implement Google's speech-to-text API in your project. So guys, let's get straight into it. All right.

Speaker 2: So first of all, head up to your Google Cloud console and then search for speech-to-text API.

Speaker 1: Now you need to enable this API. If it is not enabled in your console, then an Enable API button would appear here. All right, so after you enable it, you have to download the GCP keys for your account, which are available in IM and Admin section. Then you head up to the Service Accounts section, and you can select any service account

Speaker 2: you have created before.

Speaker 1: Go to the keys bar, and now you can create a new key by this button, create new key. Alright, so now we can head up to the coding part. So for this video, I have taken this audio as a sample, which goes like this.

Speaker 3: The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is the hot cross bun.

Speaker 1: Right, so we are gonna transcribe this audio. And so first of all, to begin with, we need to install requirements. So for this, we would need to install Google Cloud Speech library in Python, which goes like they've installed

Speaker 2: Google Cloud Speech. Now install this library.

Speaker 1: Right, after installing, now we can head up to the coding part. And first of all, obviously, import this library from google.cloud.

Speaker 2: And then put speech-v1. I'll just import it as speech. Now I can import OS just to refer to my Google key.

Speaker 1: All right. So now we can begin the implementation part. So first of all, you need to create a client for Google speech, which can be created by client equals to speech.

Speaker 2: dot speech client and i'll create this from service account file which i can refer to as gcp key dot json because the file currently exists exists in my

Speaker 1: current directory all right so the client is set up now we can uh run its functions now So the thing with speech to text API is that the audio needs to be a mono audio instead of a stereo audio. So first of all we have to create a function where we can convert our stereo audio to the mono audio. So how do we do that is I can I'll just create a function convert to mono I can pass in

Speaker 2: input file output file

Speaker 1: all right for this we have actually an inbuilt library in python we can import audio segment from pydub so we can import from pydub import audio segment so this audio segment is used for converting stereo to mono file Now we can implement its functionality by running audio equals to audio segment, not from file and we can pass in our input file as the parameter and then we can set its channel to one which will eventually converted it to a mono channel audio. all right now we can export the audio by running audio.export or output file and format we can keep as wav so this function will be used to convert our stereo audio to mono audio which is necessary for this speech to text API all right now we can also write a function to transcribe the audio so I I'll just name it transcribe audio and I'll just pass the input file as the

Speaker 2: parameter all right now we can open the file with the with open function

Speaker 1: and we need to read the file in binary mode so we have to write rb for that

Speaker 2: and I'll just import it as audio file after importing the audio file we can work upon with the audio content so first of all we need

Speaker 1: to extract the content from the wav file and how do we do that is audio content equals to to audio file dot read that will read the audio and convert its content to a readable format. Now we can run the speech to text API functions. We can run the client functions like audio equals

Speaker 2: to speech dot recognition audio content.

Speaker 1: So content basically this is for recognizing the content and then converting it to audio which is readable by the speech-to-text client. Now we can configure our recognition configurations. We can configure it by config, which is speech.recognition.config. Now we can enter our specific configuration. For the demonstration purposes, I'll just put up the standard configuration which I require, which is an encoding parameter which I can put up as speech.recognition.config.audio.encoding.linear16 This is the most basic encoding type we have to use and we can use in config file. Alright, now we can put the sample hurt rate as sample audio. So for the sample, we actually need to pass in the actual rate of the audio file. So for that, we can also use another function, which is inbuilt in Python, which is present in the WAVE library of Python packages. So we can import WAVE, and now we can open the audio sample with our WAVE module, and then we can get the sample rate by running the wavefile.getFrameRate function. will return the sample rate of the audio file so we have extracted the sample rate here we can put the sample rate here right now now we also need to put the language code in which the original audio is so now we have completed the configuration for a speech client now we can actually run the function to transcribe the audio which is client.recognize so i'll just save that in a response variable which is line dot recognize config equals to config and then audio is equals to audio. In order to actually see the transcription we need to parse the response and for parsing we can run we can run through response or results and we can print the transcript as result dot alternatives and it's zeroth index dot transcript. Similarly, if you want to print the confidence metric of the audio, we can use the dot confidence to print the confidence metric of that audio, of that audio recognization.

Speaker 2: Now we can actually return response, which is the original response we got from the email.

Speaker 1: I think now we can run the functions and now first of all we need to convert our audio to a mono audio file with the convert to mono function then we can transcript the audio inputting the mono audio as the parameter and now if I run this file it should give me the transcription and confidence so the name of the audio was sample audio instead of audio sample now it should give me the transcription of the audio And that's how you can see that it has converted to mono and you have also got the sample it which is 44100. So yeah, there's a transcription for the speech we had inputted in the function which is a taste, smell of horrible lingers, takes the heat to bring out the odor, a cold if it stores health and zest, a stale pickle tastes fine with ham tacos, al pastor are my favorite a zestful food is the heart cross one which is the exact transcription of the audio we it passed in and also it gives us the confidence parameter. This is basically representing how confidently the client recognized the audio transcript. So yeah, that's how we could use speech-to-text API for transcribing audio. This can be used in your projects. If you are using a speech-to-text client or something like that, you can implement this in your project. So yeah, thanks for watching.