Nova AI: Unmatched Speed, Accuracy, and Affordability
Discover Nova, Deep Grim's top AI model excelling in speed, accuracy, and low cost. Ideal for varied audio scenarios, offering easy, affordable access.
File
Nova The worlds most powerful speech-to-text API Deepgram
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hello, Internet. Deep Grim has just released a brand new AI speech recognition model. It's called Nova, and it's our best model yet. Here's why. First of all, let's talk about performance. Can it understand what you're saying even if there's a bunch of noise in the background? The answer is yes. Here are the stats. Today, we'll cover Nova's accuracy, speed, and cost. Then, we'll discuss how Nova works and why it's become the new benchmark for speech to text. Alright, first things first, let's talk about accuracy. The main statistic to talk about here is word error rate. As a refresher, word error rate is the number of mistakes a speech recognition model makes when transcribing divided by the number of words in the true transcription. Specifically, those mistakes are insertions, deletions, and substitutions. Word error rate has become the gold standard measurement of quality for speech recognition models industry-wide. A lower word error rate means a better model. And when you compare Nova's word error rate to that of other industry models, you'll find that it has them beat. We tested Nova on hours upon hours of human-annotated audio from real-life situations. These audios include phone calls, podcasts, IVR calls, and more. Not to mention, the dataset encompasses diverse audio lengths, accents, environments, and subjects. This, therefore, ensured a practical evaluation of its real-world performance. The result? An overall word error rate of 9.5% for the median files tested, which is a 22% lead over the nearest provider. Long story short, from OpenAI's Whisper to Google's Speech-to-Text, Nova far outperforms its competition. Let's take a look at the stats as we stratify by domain. This box plot depicts the range of word error rates for each model. The number in the middle of each box represents the median word error rate for that model in that domain. For example, when transcribing podcasts, we see that Nova's average word error rate is a mere 4.3%, while models like Whisper, Google Video, and Conformal One have an average word error rate of nearly double that. The same story can be seen with phone calls, meetings, and videos. Nova makes fewer mistakes than its AI counterparts across all of these real-world use cases. But what about speed? How long does it take for Nova to transcribe an hour of audio, and how does that compare with other models? Well, it turns out that Nova is faster than all the other Speech-to-Text models as well, as we can see in this graph. Here, we display the median inference time for each of these ASR models. And as you can see, at 12.1 seconds, Nova is at least 23 to 78 times faster than comparable vendors. Oh yeah, and we diarized the transcript too, meaning we were able to distinguish each speaker in the audio. Note that, for the speed test, results are based on numerous trials of the same file for each vendor. That way, everyone is given a fair opportunity to showcase their performance. So once again, not only is Nova more accurate than its competitors, it'll also produce transcriptions much more swiftly. An order of magnitude more swiftly, to be exact. But alright, we've talked about speed and accuracy, what about cost? Is Nova expensive? The answer is no. You won't have to break the bank to use Nova. It costs less than a cent to transcribe a minute of audio. Check out this chart for more information. As you can see, Nova is the most affordable ASR model out there, being 3 to 7 times less expensive than any other full functionality provider. This was not an accident. This was a design choice. We wanted Nova not only to be the best AI out there, we wanted it to be the most accessible and most affordable. So we made it happen. But perhaps that begs the question, how? How did we manage to pull this off? Well, here's the secret recipe. 1. Our transformer architecture is unique. Far from the classic vanilla CNN transformer model. 2. We applied advanced data preparation techniques. So not only does our model listen in a unique state-of-the-art way, but what it listened to when training was also unique and state-of-the-art. And finally, we used a meticulous multi-stage training approach. Or to simplify these punchlines, Nova has a unique build, Nova consumed a unique dataset, and Nova trained in a unique way. Let's break it down. Our architecture and its underlying algorithms are proprietary, so we can't reveal any deep secrets here. But just know, here's what we do. Given an input audio, we chop it up into small chunks. That is, we discretize the spectrogram. These chunks are then used to create an embedding, a numerical representation of the audio. This process is called encoding. We've taken a recording of, say, a phone call, and encoded it as a bunch of numbers. So the computer can then treat this audio like a number, taking this numerical representation and then decoding it into words. Again, we don't want to reveal our specific recipe. But that's how our encoder-decoder model works. You can read more about such transformer models and the math behind them in our blog post here. Link in the description. Next. Nova has consumed a unique dataset when training. This dataset contains carefully curated, high-quality, domain-specific data. That is, we don't just train on datasets in the public domain. We also train on real-world audios, from medical calls to meetings. Finally, our multistage training approach distinguishes Nova from its alternatives. Initially, the model is trained using the aforementioned vast dataset of unique audio data in a weakly supervised manner. Then it undergoes fine-tuning in multiple stages with some more domain-specific data. This training process is ultimately what leads Nova to outperform other ASR models in speed and accuracy. In fact, the transcript and subtitles of this very video were created using Nova. Oh yeah, and as a cherry on top, DeepGram is also releasing a fully-managed Whisper API. That is, you can use Whisper through DeepGram. Why would you ever do this? Well, here's a few rapid-fire reasons. On OpenAI, Whisper has a 25MB limit, meaning if you have an audio file that's over 25MB long, aka 23 minutes on an MP3, you'll encounter this error. Yuck. Use DeepGram Whisper Cloud and you won't encounter that issue. If you want diarization, Whisper will not offer that out of the box. DeepGram Whisper Cloud does. If you want word-level timestamps, same story. DeepGram Whisper Cloud also allows for on-prem deployments and access to all Whisper models. So yeah, needless to say, we've been a bit busy. But we're doing this for you. You, who's building an app or website that requires speech-to-text. You, who records podcasts, videos, and social media content. You, engineers and data scientists who just love AI, DeepGram has your back. Sign up on our site and you'll receive over 40,000 minutes of free transcription without even having to put a credit card down. We even offer pre-written notebooks that allow you to use Nova without even having to copy-paste any code. Just take your API key and run. Further video tutorials on those notebooks are available on our YouTube channel. And as usual, follow us at DeepGram for more AI content.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript