Comparing Google's Chirp and OpenAI's Whisper Models
Explore a detailed comparison of Google's Chirp and OpenAI's Whisper speech-to-text models, covering accuracy, flexibility, cost, and performance insights.
File
Googles Chirp AI vs. OpenAIs Whisper AI (Speech-to-Text)
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hello. In this video, we're going to compare Google's new Chirp speech-to-text model with OpenAI's Whisper speech-to-text model. Both of these models are very similar in that they use the latest AI and ML approaches to create a text transcription of audio data. Now, there are a few ways that we could approach this. The first would be to look at benchmarks. However, this is a little tricky in that it may not be relevant for our specific use case, and there's also potential issues about these models inadvertently training on data that's used in benchmarks, which would give inaccurate results. Another approach would be to look at the theoretical behavior and constraints of these models, and then use that information for decision-making. However, a purely theoretical understanding can miss important practical details. The last approach, which is the one we will use, is looking at a data-driven approach and running experiments to collect real data on which model works better for our use case. So we'll walk you through, step-by-step, how to test these different models, and we'll also comment on some important similarities and differences for these models, including accuracy, considerations around long audio files, flexibility, cost, and performance. This will give us a nice evaluation of these models, and we'll make some quick commentary in each category, as well as, when possible, explicitly note which model is better in each of these categories. So let's jump right in. We're going to use the Google and OpenAI APIs to minimize and simplify the setup. For running these experiments, we'll utilize Kubeflow Pipelines. If you choose to run your experiments in a different way, that's completely fine, and we think you'll find that the concepts that we go over here will be quite portable to other experimentation frameworks. So let's create a repo, and we'll share all the code in the description of this video. As is usual with Kubeflow Pipelines, we'll want a Python file that will compile our pipeline specification. So let's bring in some code for this, and then walk through the concepts. We'll also note that the Chirp model is just a couple of days old at the point of this video, and so the way that you consume that model is a little bit unusual and requires a bit more configuration code compared to some of Google's older speech-to-text models. So we're going to use two different files for our pipeline compilation. The first one is our pipeline.py file, which is our primary file for compiling the Kubeflow Pipeline specification. In the pipeline definition, we see that we're going to loop over some WAV file paths that will be in the Google Cloud Storage, and for each we're going to run a transcription with Chirp and a transcription with Whisper. If we look at the functions for those, we're just going to be using the Python SDKs for those two APIs. Again, we're not running any models locally here, just consuming APIs for this transcription. In the Google Chirp case, note that we do need to use a regional endpoint at the moment for this model, so we're specifying that. With the V2 version of speech-to-text, we're creating recognizers with configurations that are then used for transcription tasks. In here, we're suggesting that by default, we want punctuation, we want to use the Chirp model, and then we're going to use US English language in the audio. We'll create that recognizer if it doesn't exist, and then run our recognize method with the specification for the config and the request. Now, notably, this is going to read the content and use it directly from the API call, so we have limitations on the size of the audio file here. That said, there is a batch recognition currently available that we can use if we need longer files. Bringing up the Google Cloud documentation around this specifically, we see that this new Chirp model is available for both recognize and batch recognize, allowing for audio up to about 8 hours. We are going to use the short case here though. Once we run that transcription, we'll save it out to a text file. In the OpenAI case, we're going to get our key from a separate file and then run the transcription and again save that output to a text file. Now, note that this approach does actually put your OpenAI API key in your pipeline YAML specification, so consider hardening the configuration management here for sensitive production applications. With that, we'll get some text files for the audio that we provide across these two different transcription models and we'll look at the results there. Similar to the Google case, OpenAI does have limitations on the speech-to-text length, but unlike the Google case, this is not really something that we can configure or get around by using a different API endpoint. Full uploads are limited to 25 megabytes, which is probably about 10 to 15 minutes for most audio files. You'll need to break apart your audio into chunks if you have more audio than this amount. Before we even start to run this model, we can make some observations in this setup. For long audio files, OpenAI Whisper doesn't perform very well, as the API simply does not allow for these long audio files. Meanwhile, the Google Chirp speech-to-text has better support for this and it has another endpoint for longer files. I'll also note that while the Whisper model can, in theory, be run yourself, we've noticed that there tend to be quite a lot of issues with hallucinations for those long files, so you probably want to break apart your audio in either case because even if you're running the model yourself and you technically can run the transcription on the audio, you'll probably have a lot of issues with hallucinations. In this regard, we think that Google Chirp is certainly better in the long audio category, but let's continue to explore more areas of the comparison matrix here. To compile our pipeline, we first need to install KFP. We're going to use the KFP version 2 by specifying the PRE flag in the pip installation.

Speaker 2: Alright, and now we have our pipeline file. In Google Cloud, we're going to use a storage

Speaker 1: bucket for storing this specification. In our Google Cloud console, let's go to Cloud Storage where I've already set up a bucket for our pipeline specification as well as the outputs. If you haven't set one up already, just give it a name and then you can accept all the other defaults if you're following along. Let's take our pipeline demo file and copy it into this cloud storage bucket that we have here and we'll use that specification in just a moment to run our experiments. If we refresh, we'll see the pipeline.yaml file in our cloud storage bucket, so we're ready to go. We have some example WAV files that we're going to use, which are just audio clips from a recent video that we've created. Since we have a transcript to compare these against, and it's extremely unlikely that either of these models have seen this audio, these will make good example tests.

Speaker 2: Let's now jump into the Vertex AI pipelines to be able to run our experiment. When we create the run, we'll want to import from Cloud Storage and specify the path,

Speaker 1: or you can browse if you prefer. It's going to automatically extract some information and then we'll specify the output directory. Let's use the same bucket as before. In our paths, we're going to need to specify the files that we have in Cloud Storage and the project ID. That should be everything that we need. As you're running through this, keep in mind that there is caching for tasks that complete successfully. If you rerun your pipeline, it will try not to repeat tasks that haven't changed and have already run to completion. If you'd like, you can disable reading from cache to ensure all steps run, even if they're cached. Most of the time though, you can take advantage of this cache for performance improvements. If we open up our Kubeflow workflow, we've imported the WAV files for each of the three cases. Example 1, Example 2, and Example 3. We're now running the transcription across to both Chirp and Whisper models via their APIs provided by Google and OpenAI, respectively. Periodically click through these different interfaces to see how our models are progressing across the three different audio files. Our Example 3 is the shortest of the bunch, at about 8 megabytes, so we'll watch this one for completion first.

Speaker 2: We're starting to notice some failures across our Whisper files, so let's check that.

Speaker 1: It's not liking the specification of the OpenAI key in our pipeline, so this automatically gives us a great example of how to update these pipelines. Change this to be passed in as a string from our pipeline,

Speaker 2: and then if we specify it here, that's where we will get that required key to come in. Let's go ahead and compile again. We can use the console over here to run that compilation and upload with fresh authentication. All right, let's use the new YAML specification. To be able to compare apples to apples on the performance, let's avoid caching any of the Kubeflow components just because we did have a successful run of the Chirp speech-to-text.

Speaker 1: We want to see how the speed of transcription compares versus the OpenAI case. It's time for Take 2. It looks like we have a completed result with the Chirp model.

Speaker 2: Going across some of the other runs, let's see if we can get a better look at the Chirp speech-to-text. We can see that the Chirp speech-to-text is a little bit slower than the OpenAI case, so let's go ahead and run it again. We can see that the Chirp speech-to-text is a little bit slower than the OpenAI case, so let's go ahead and run it again.

Speaker 1: We can see that the Chirp speech-to-text is a little bit slower than the OpenAI case, so let's go ahead and run it again. OK, it's still running there, and in this case,

Speaker 2: we noticed the Whisper model completed a little bit quicker. Both models are now completing for the Example 1 file, with the Google model still running for the Example 2 and 3 files. It's almost complete and we just have one more transcription to finish out of the six

Speaker 1: total across the three different tests. We're done. So across the three cases, the Whisper model was quicker for two and the Chirp model was quicker for one. If we look back at the pipeline, there's a little bit more room to optimize on the Google side. Specifically, we're trying to create this recognizer every time, so we're making an extra API call and we may not need to for each run since after the first run, we're always going to have a 409 already exists error when we try to recreate the recognizer.

Speaker 2: But in any case, it looks like the performance is pretty close between the two models.

Speaker 1: Let's mark that in our matrix here as sort of a toss up across the two models for performance. The performance is pretty good across both of these. So you know, we took about two minutes for the Whisper model and about a minute and 42 seconds for the Chirp model. If we look at the other iterations, we get a somewhat similar result of about two minutes for this audio file. Next up, let's now look at the transcript outputs across the models. Oftentimes transcription accuracy is the most important metric in our experimentation.

Speaker 2: So we'll look at that with side by side comparisons.

Speaker 1: If we look word for word, this odd non-word that the Chirp model generated when we've actually said prompts. Otherwise, the words here are pretty identical. But as you can tell from some of the punctuation and capitalization, there is a difference in how the two systems have dealt with proper nouns and sentence structuring. Across our testing, it seems to be the case that the Chirp model uses the audio to dictate the punctuation. So for example, when we have long pauses, it will put in a period. On the OpenAI side, it seems to add an insert punctuation based on a more statistical model. This creates a transcript through OpenAI that is a little bit more ready for use in publishing, but that may not be quite as reflective of the actual language use in the original audio. Let's note some of these differences in our evaluation and create another metric for punctuation. Our guess here is that this is a more statistical method, and this is a more audio-based method. It seems though, like the OpenAI Whisper is probably going to be better in most cases. The punctuation just comes out clearer, its capitalization of proper nouns seems to be a little bit better, and it might have a win there. In terms of word accuracy, both are very good. However, so far it seems that based on that one misspelled word on the Google Cloud side, we may have a slight edge with OpenAI Whisper.

Speaker 2: Let's look at a couple more examples real quick to see.

Speaker 1: Alright, so we've again got the Chirp Google model on the left side and the Whisper OpenAI model on the right. Again, we see important differences in the punctuation, but again, there's a little bit of an odd word misspelling, just a single one in this transcript. We have a misspelling of inferences here, whereas we have it spelled correctly in the OpenAI case. So really similar observations, and I can see that across a number of different example files, these tend to be reflective here. Indeed, let's keep the winner on the OpenAI Whisper side, even though both models are very good, and again, your use case is really going to dictate how well each one of these two models performs. Now let's talk about flexibility. As we alluded to in the OpenAI Whisper API discussion, the Google Cloud model is highly flexible because it's part of a broader, long-standing speech-to-text suite. We could use some more traditional flagship models in our transcription, like the default or video-enhanced transcriptions, which could very well be better for this use case. Chirp is not going to always outperform some of these bleeding-edge models that Google has already pioneered in their speech-to-text. Across all these interfaces, we have tons of flexibility to use different model adapters, recognizers, configurations, languages, and ultimately, different ways to structure our transcription input and output. Meanwhile, we're really constrained on the Whisper side, so there's very low flexibility on the Whisper side and very high on the Google side, with the clear winner being Google with their cloud offering. The last piece to touch on is cost. If we open up OpenAI's pricing, we can see that the audio model for Whisper costs about

Speaker 2: 6 tenths of a cent per minute.

Speaker 1: If we compare that to Google's pricing for the Chirp model, which is included in the standard models and is the V2 API, it starts at about 1.5 cents per minute. So just a little over double the price of OpenAI. However, eventually it does drop down to a price that's lower than OpenAI's. For most use cases, though, OpenAI is going to be a little bit cheaper, even though we're not talking about large percentage differences. Wailmark both is competitively priced, but with the winner leaning towards OpenAI. As you can see, the comparison of these two models is a bit nuanced, and hopefully we've highlighted how you can start to evaluate these for yourself, since benchmarks or generic speech-to-text tasks may have low relevance for your specific dataset. The nuances of your audio structure, the environment, and noise are important. You may see completely different performance characteristics for your own applications. With that, we hope this was a helpful introduction to these two models and want to thank you for watching.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript