Master OpenAI Whisper: Speech-to-Text Essentials
Learn how to use OpenAI's Whisper model for speech-to-text. From setup to transcribing and translating, become an expert with our step-by-step guide.
File
OpenAI Whisper Tutorial Level Up Your AI Projects with Speech-to-Text
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hello everyone. In this video, we're going to learn how to use OpenAI's Whisper model, which provides state-of-the-art speech-to-text. Speech-to-text can be used for a wide range of applications like conversational apps, voice commands, transcribing different types of audio, and translation from different languages. This API is fairly straightforward to access, but there are a few tips and tricks you can learn to make you an expert at Whisper speech-to-text. Let's start by going to the documentation. Go to platform.openai.com/.docs And on the left-hand sidebar, go down to the Endpoint section and click on Speech-to-Text. OpenAI's speech-to-text has two endpoints, Transcriptions and Translations. Transcriptions is for doing speech-to-text in the same language, so if you have German audio, then it's going to transcribe it into German text. You access the Transcriptions endpoint with client.audio.transcriptions.create And then you provide the model, which is Whisper 1, and the audio file. And it returns an object which contains the text. Translations is when you want to transcribe a foreign language into English. So for example, you have German audio, that would be transcribed into English text. And you access that endpoint with client.audio.translations.create There are 98 supported languages for Whisper, and one of the great things about it is that it automatically detects the language. So you don't have to specify the language, it will detect it and do any type of translation if that's needed. You can add timestamps to the transcription by including the timestamp granularities parameter. And Whisper can handle a maximum of 25 megabytes, so this is generally an audio file of 20-25 minutes. If the file is longer than that, then you can break it up into different segments and process it separately with something like PyDub. You can also add a prompt as an additional parameter, and this is to improve the quality of the output. So one of the benefits is that if it's a dataset that has a lot of industry-specific jargon, then adding some of that jargon to the prompt can help the model figure that out and account for it. If you split your audio up into different segments, then sometimes it can help to use the text from the previous segment for consistency. Adding an example of text in the prompt that's well formatted with punctuation can improve the output. The model naturally can remove a lot of the filler words like um and ah, but you might actually want to keep those. So you can include a prompt with text that includes these words. Why would you want to keep them? Well, maybe you have some type of a voice coaching app or some type of an app that needs to be able to analyze somebody's speech to grade it or provide advice on how to improve it. And finally, for some languages that have very different writing, it helps to include an example in the prompt. And then finally, for improving readability, if you're not getting the results that you want from the prompt parameter, then you can also do post-processing with GPT-4, basically running it through GPT-4 with instructions to provide it with a certain type of formatting. Okay, so let's code this up into an app. I'll open up my Visual Studio Code and I've created a project folder, Whisper Tutorial. I'm going to create a file, main.py. And the way this works, if I go back to the documentation, is first I have to create an audio file and then I have to provide that file to the OpenAI client. I'm going to write some code so we can do all this with Python, but first I'm going to have to install some libraries. I'll create a requirements.txt file and the libraries I'll need will be OpenAI so we can access Whisper. Python.env to help load the API key, SoundDevice to record the audio, SciPy to save the audio file, and then another library called Keyboard just so I can control when it's recording. I'll save this and open a terminal and I can install the libraries all at once from the requirements file. But first, I'll create and activate a virtual environment so that nothing that I do here will affect the rest of my system. And then I can install my libraries with pip install-r requirements.txt. Once that's done, I can go back to the main.py file and build the audio recorder. I'll import some libraries, then I'll create a function called recordAudio. This function will wait for the user to press the enter key, then it will record the audio using the SoundDevice library. I just chose standard numbers for the sample rate and number of channels parameters. Then the recording stops when the user presses the enter button again. And then finally, the audio file is saved using SciPy's write method. Now we can test this. I'll save it, open the terminal and run Python main.py. Press enter to start recording. Hello, this is a test, one, two, three. Okay, and we can see up here that it created an output.wav file. I'll open this and test it. Hello, this is a test, one, two, three. Perfect, it works. Now we can start to integrate Whisper and the speech-to-text. Go back to the documentation and in QuickStart, copy the code for transcriptions. Then go to your main.py file and create a new function called speech-to-text. Just paste the code inside the function for now. We can move this import. I'll put it up here with the other libraries. And here as well where we create the OpenAI client object. I'll just move that up here. We'll also need an OpenAI API key. I'm going to store that in a .env file. Go back to platform.openai.com. Log in or sign up for an account if you don't have one already. Then go to dashboard and on the left-hand side, API keys. And you can create a new API key for this project. And once you have that, paste your secret API key into your .env file and save it. Go back to your main.py. And we're going to use the .env library to get the API key from the .env file. And I will just add that as a parameter here. API key equals os.getenv and then the name of the key which is OpenAI API. Great, and that should all be set up now. So let's go back down to the speech-to-text function. We're creating a file called output.wav. So we need to be able to add that. Okay, so what's happening here is we are recording audio with recordAudio function. We're saving that audio to a file called output.wav. And then for the speech-to-text function, we're going to open that audio file. We're going to add it as a parameter to the transcriptions endpoint. And then that is going to return a transcription object that will contain the transcribed text. I'll create another function called transcribe. And this will be in a while loop. We'll just run the recordAudio function. And then we'll use the speech-to-text function to transcribe that audio. And then we'll print that transcription to our terminal. And then I'll have to change this here underneath name equals main to the transcribe function. Now I can save this and run it. Hello, this is a test. And I got an error here just because I didn't put the .wav on the file name. So I'll save that again. Clear the terminal. Hello, this is a test. And we can see here transcription. Hello, this is a test. So that worked. I'll press enter again to test it one more time. There are lots of applications for speech-to-text. Transcribing different types of audio like meetings, lectures, interviews, or podcasts. There's language translation. I like the idea of voice assistants and being able to interact with AI in an actual verbal conversation. And I just had Whisper transcribe what I said. And it did a perfect job with excellent formatting. That's all for this video. Please remember to hit the like button and subscribe if you want to see more of this content in the future. Thanks for watching. I'll see you in the next video.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript