Why WER Is Failing—and How to Benchmark ASR Better (Full Transcript)

As ASR improves, flawed truth files and WER’s equal weighting distort results. Learn correction workflows, semantic metrics, MER, and production A/B testing.

Download Transcript (DOCX)

Speakers

Add new speaker

[00:00:00] Speaker 1: Hi everybody, just waiting for people to to jump in. Hope everyone's having a good day. Well, it's 1231, so we're going to go ahead and get started here. I'm Zach from the Applied AI team here at Assembly, and today we're going to be showing you some of the trends that we've noticed around benchmarking and how you can fix your benchmarking in production. I'm joined by my colleague, David. David, do you want to go ahead and intro yourself?

[00:00:49] Speaker 2: Hey, how's it going everyone? David, I've been with Assembly for a few months now, and I'm the Applied team with Zach.

[00:00:57] Speaker 1: We're going to go ahead and do a few live demos, but first I'm going to share my screen and kind of walk you through a presentation to guide you on some of the market trends we've noticed. So as I said, this workshop today will really be about how you can fix your ASR benchmarks, and we've noticed some pretty crazy trends since we've launched our newest model, Universal 3 Pro, that have led us to believe that this word error rate metric that has been the most popular metric in the market for benchmarking is probably broken. And not only that, your truth files are probably broken too. We're also joined by Ryan Seams, VP of Customer Solutions. Anyway, for those of you who haven't seen this metric, word error rate is a very simple, screwed metric that shows how good your transcription is from an AI transcription vendor. So basically, it measures in the numerator all of the errors that occur within your transcription, be that substitutions, whether a word's replaced with the wrong word, a deletion, words that were removed from the transcription that were actually in the ground truth, and insertions or hallucinations, so words that weren't available within the ground truth that actually appeared in the AI transcription, right? And for years, this metric has been kind of the go-to metric for speech-to-text evaluation. And as over the past 10 years, as these AI models for AI transcription have gotten just better and better, we've seen a very steady decline in word error rate. Obviously, word error rate is better the lower it is because that means there are less errors in the transcription. However, I want to walk you through pretty much how our customers typically evaluate word error rate benchmarks is they'll take about 10 to 20 of their audio files that are representative of their use case, submit them off to a human transcription provider to get the ground truth. Basically, this is what's going to be in the denominator of that metric. And then they'll submit those audio files to various speech-to-text models through different vendors, get the predictions. And then a really important step before our customers will submit them for this word error rate-based evaluation is they'll use a normalization model. And what these models will do is they'll remove punctuation, capitalization, and formatting from the transcription. So that way, simple capitalization differences or punctuation differences doesn't impact the calculation of the metric. And then they'll run word error rate-based evaluations across all the vendors and compare the scores of all those models. So right off the bat, this word error rate metric is just kind of flawed to begin with because not all the errors are treated equally, right? So for example, there are very harmless errors that get dinged with the word error rate-based metric. You know, the very obvious one is OK versus OK. Truly if these are different in the transcription versus the human ground truth, like this indicates to the user that an AI model is performing worse when in reality it's exactly the same. And this is treated equally to an incredibly dangerous error, something like a company name being wrong or someone's name being wrong, or even worse, hallucinations, which are words that like the AI might hallucinate in the transcription that would have never appeared at all. If you read some of our old blog posts and research papers, Whisper is an excellent model that OpenAI released, but it's famous for these types of hallucinations. So my favorite one is one of our researchers actually recorded themselves like eating an apple and it produced some pretty wild hallucinations in the transcription. But, you know, those hallucinations are treated the same as these simple OK versus OK or gonna versus going to errors, which keep all the same context and content, right? So when we launched Universal 3 Pro, a really crazy thing we noticed from some of our customers who are testing the model with their existing benchmark evaluation methods is they started to report that the word error rate metric was actually performing worse with the Universal 3 Pro model than our previous Universal 2 model, which is also a very good, robust model, right? And this was very confusing to us. All of our research team, our tech or we took a good look at what was going on with this trend because it was very clear to us that this Universal 3 Pro model was not only like better, but like state of the art in comparison to any other model that we've seen in the market. Right. And so we went, we then as the Applied AI team, as the go to market facing folks at the organization, took on the painstaking mission of kind of going insertion by insertion through all of these audio files to see who was right. Like were we wrong in these cases or was our transcription actually more accurate than the human labeled ground truths? And what we found is that the Universal 3 Pro model is so good and accurate that actually was producing insertions in the transcription that the human transcribers were just not catching in the ground truth. And one of the ways we kind of identify this is that when we looked at the metrics, right? Like if I go back to this word error rate based metric, you have substitutions, deletions and insertions. And like we noticed like an incredibly large amount of insertions on these benchmarks that our customers were sharing with us when they were using Universal 3 Pro. So when we dug into those insertions in production, what we found is that we were actually getting it correct and the ground truths were actually wrong in this case. This is all covered in a blog post that we released recently. However, we went ahead and we created a solution for this issue for our customers. And it actually is in the form of a tool that just launched in our dashboard last night ahead of this. So I'm going to go ahead and show you this tool. It's called the truth file corrector. And basically how it works is you upload your audio files that you have human labeled ground truth files for. And what you can do is actually use some open source files that you might find on Hugging Face or, you know, or other, you know, open source corpuses of audio files with human labeled transcriptions on the Internet and plug them into this tool. That's actually what we're going to show you here in a second. And then you can transcribe those files with Universal 3 Pro and then compare every single insertion or deletion or substitution, any of the errors of word error rate based errors that you find between the two transcriptions. And then you can click buttons and by the end, you actually get a much more accurate word error rate for the specific files that you're evaluating. But it has the added benefit that at the very end, you can then copy and paste this truth file into your existing evaluation and get much more accurate, you know, evaluations just through your built, built, already prebuilt benchmarking pipeline that you've used forever. So this tool is in the dashboard now. So I'm going to go ahead and jump over to the dashboard to show, show this. So it's right here. If you go to your dashboard, it's right at the bottom left. This truth file corrector tool. And what I'm going to do is I'm going to go ahead and take a real open source file that we have to showcase the power of this tool. The truth file. And this audio file. And then I'm going to go ahead and click compare transcriptions. And what this is doing is the truth file is going to pop up on the left side of the screen and the universal three pro transcripts going to pop up on the right. And this file is a is a fake medical conversation. Here you go. So you see, here's the ground truth file. Here's universal three pro and you can see all the differences here on the right side of the screen. I don't, I might not be sharing my can Ryan and David confirm if you can hear this.

[00:09:24] Speaker 3: I should know. I want to talk a little bit about your family.

[00:09:27] Speaker 1: Oh, perfect. OK, cool. So you can see this makes it really easy for for users to listen back and see whether the universal three pro is correct in this case or if the ground truth was correct. Right. So in addition to this file, the first issue that we see here is this. I want is in the universal three pro transcription, but not available in the ground truth. Right. So if I click here again to listen back, I should know.

[00:09:50] Speaker 3: I want to talk a little bit about your family history. OK, I want what I want to talk.

[00:09:53] Speaker 1: So you see right there, this I want was actually in the transcription, but the human transcriber actually missed it. So what I can go ahead and do is I can click on the right side of the screen here that the S.R. was right. The S.R. was right here. And then as I click these, we now get an updated ground truth file that I can then copy right here and use later in my later evaluations with much more accurate truth files. I know this is a bit of a painstaking process, but it's really the only way to improve truth files for our customers. And it's much easier than the initial process that that we were going through of listening to the audio file and checking it back to back. Right. So let's go through a few more of these. So if I click here. So you'll notice the user said, OK, but very quietly and kind of like it was like almost like a back channel. Right. What back channels are in an S.R. is when two people are talking and one person kind of talks over someone for a moment and like might maybe cuts into their conversation. And these are things that Universe 3 Pro is very good at grasping, that if human listening back to the file as they're transcribing it real time might miss. Right. So I'll go ahead and say that's OK. That's correct. Go here. Parents. Mm hmm. OK.

[00:11:09] Speaker 3: All right. What I want to know is, do you have any concerns about any illnesses? No. No. OK. Any concerns? What can you tell me about any problems that your parents or your grandparents have in terms of their health?

[00:11:24] Speaker 1: So I'm going to skip going through the rest of this file because you might have guessed it, but we get everything right here in this file. So if I go ahead and I, you know, I click through the rest of these with A.S.R. is right. You can see it's updating in real time. At the very end, I get this updated truth file that I can then use for later evaluations to compare with other other A.I. vendors. Right. And this is in our dashboard right now. I would really encourage your team, you know, not only for, you know, not not only for assessing assembly A.I.'s like transcription quality, but also because you can correct your truth files with this tool, you can actually improve your benchmarking pipeline for all the evals that you're going to run in the future. Right. An important point I want to note right here is that and this we're going to chat about this more and we have some other tools that David's going to show to the to everyone. But you can see that right here. We transcribed all right in the transcription, but in the original it was actually labeled. All right. And so because of this difference, this is actually contributing to word error rate, which is which ties into the piece of semantic word error rate that I mentioned earlier. Right. So if I go back. So what were the issues here? Number one, flawed ground truth. Reference transcript is wrong. The better models are going to score worse, which is actually kind of a crazy thing, because what it what it leads a lot of customers to believe if their truth files are wrong, is that worse models that aren't going to catch all the information that's actually occurring in these audio files, that's going to lead customers to believe that worse models are actually are actually better than they truly are. The second piece is the semantic equivalence point that I mentioned earlier. And these come up in the form of substitutions. These are all different things that are going to be labeled as errors when they're exactly the same, you know, Mr. Smith versus Mr. Smith, Gunner versus going to. Now, this is a bit different because it's not the exact words. But ultimately, if you think about it in terms of retaining context, because, you know, pretty much all of our customers are using the transcription downstream with like various LLM processes or or, you know, they need to retain the context of their conversations. These are essentially equivalent. Right. And then the key piece about the Whisper Normalizer is actually it does catch some of these cases. Right. Like, for example, the Whisper Normalizer and you can find it right here. It's a it can easily pip install it. It catches things like don't will change to do not. But it doesn't it doesn't keep the doesn't convert stuff like industry jargon, variations, domain specific variations, you know, improved formatting of words like health care, offsite, things like this that might be, you know, combined accurately for for better formatting and grammar. But like for a human like labeling it, it could be different than than what it should be. Right. So going through these right flaw ground truth, you can use a truth file corrector to improve your truth files long term. Semantic mismatches. You can use a semantic word list to essentially find and replace corrections before submitting them off for word error rate based evaluations. And then the last piece I want to touch on here is is really what I think is probably the best method I've seen our customers. And this is kind of a newer trend, I would say, that I've noticed my time at Assembly AI is most of our customers. Customers now are AB testing models in production. Right. So typically it's a much better indicator of it's a much better indicator of how how these transcription models are performing in production. Right. Because ultimately, the way that a lot of customers will interact with their speech to text in terms of like how good it's performing is how many customer tickets do I have? Right. How satisfied are my are my customers with the performance of our transcription? How nice do the transcripts look? Things like that. And really, by kind of doing this AB testing with a percentage of traffic and like seeing how that impacts customer satisfaction rates and support tickets and how often do humans like fix the transcription themselves is really a better indicator of how well these models are performing in the real world. Right. Another key piece, too, is a lot of our customers are using like LLMs downstream to showcase the how good these models are performing. And David's going to show that in the in the STT SDK that he's going to demo here in a sec. But. Yeah, so basically, just to quickly summarize what we've talked about, word array benchmarks can lie and they're kind of like maxing out in terms of like their effectiveness right now. And really, the only way to improve it is to go in and improve your truth files, which is, you know, a bit of a manual process, which is not very fun in this world. But the good news is that the AI tools that we we provide here makes it much easier. And then using semantic word lists and AB testing are also really important ways to improve the quality of your benchmarking. So I'm going to go ahead and throw it over to David to talk about the benchmarking SDK.

[00:16:56] Speaker 2: Cool. Thanks, Zach. So I'm just sharing my screen. You guys see my screen? Yep. Cool. So I'm sharing my ID here, but I have a little browser open so you guys can see the SDK. So this is in GitHub. This is in our solutions repo. And it's a Python SDK for benchmarking speech to text. It comes with a bunch of different features of different metrics that you can leverage for your for your evals. So standard metrics like word, CP word and DERF for speaker and diarization benchmarking. We've included semantic word and missed entity rate. So those are two things we're going to dig in today. I think another interesting workflow that we've seen with some success is kind of this concept of LLM as a judge or LLM via eval, right? Using the LLMs to infer semantic equivalence and really understand the various outputs from the models and see where the differences are and where the similarities are to make sure that you can make the best kind of decision on the outputs and what's going to provide the best performance. So definitely recommend that you check out the SDK has a bunch of instructions on how to use it. If you'd like to use it with multiple vendors, if you'd like to use it as a Python library, it should be pretty flexible for you to use in this way. So I'm just moving the zoom control here. And what I'm going to do is expand my terminal up a bit. And I'll open up the file directory here. So we can see I have a couple of demos I'll step through. The one thing I want to do first is just showcase the example that Zach went through. So if we look at the audio that he used, can you guys hear that?

[00:19:01] Speaker 1: I cannot. Okay.

[00:19:04] Speaker 2: Well, it is the same audio that Zach used. So that's good. So you already heard a little bit of that. But first, let's go into the first eval using this audio. So what we're doing here is we are using both the original truth files and the corrected truth files. That kind of is an output of using that truth file corrector tool. And I ran this ahead of time just so we didn't have to kind of sit here and wait for transcriptions to happen. But what we can see here is when we compare Universal 3 Pro to an open source model like Whisper, we can see on the original truth file that actually Universal 3 Pro had a higher word than Whisper. We can see we have more insertions. And they had a couple insertions and deletions. So, you know, that's interesting, right? But when you go through the exercise of modifying your truth files and really making sure that it's like it's what you're hearing in the audio, you can see that corrected truth file had a pretty significant impact on our performance as a model. You know, we have perfect word. We have no insertions or deletions. You know, we got the audio as it was spoken verbatim, which is really awesome to see. And then what you can do as a next step from that is start to pass those results to an LLM to act as a judge and act as a way to kind of sift through the noise here so you're not looking word by word and start to dig into, you know, what helped impact the results. So when we do that, we can see that, you know, the LLM, in this case, we passed it to Claude. This is actually using our LLM gateway product. So if you're not familiar with that, it's a pass through that you can send your transcripts and prompts to Frontier LLMs. So it's a great way to tag on that additional functionality without having to use separate APIs. It's a single unified API, which makes this really easy to do. But you can see here that it picked Universal 3 Pro as a decisive winner. When it gets into the details a little bit, you can see here there's some semantic things that it picked up, some stutters or filler words that it picked up that were in the corrected transcripts, but a model like Whisper is not going to pick up right away. And human transcribers also might not pick up those kinds of nuances as well. So this is why I think, you know, semantic were and modifying your corrected, your truth file is so important when you're running these evals. And you can imagine this was like a one-off. But if you're running this at scale over many, many files, you can see how this could get worse over time. So doing this practice is a good exercise. The next example I want to run through is missed entity rate. So we spoke a little bit about what missed entity rate is. But just to set the stage, right, word treats every word the same. It could be medical terminology that's misspelled. That's going to be a problem and impact your word. Missed entity rate tries to solve for this. So it allows you to send, essentially extract your named entities from your ground truth files. You can send those to an LLM. And then you want to compare that to the model's prediction, right? So you want to check for all the entities in the truth, how many survived in the transcription. And that will-

[00:22:37] Speaker 1: Just adding on, typically we've found, this ties into the piece we were bringing up earlier where word error rate, it can only take you so far. What's great about missed entity rate is it really focuses on the core things that you care about in your transcription, right? If you're building a voice agent that needs to pick up credit card numbers or email addresses and things like that, you can use a metric like missed entity rate to really hone in on how well a certain speech-to-text model is performing on the things that are most important to your use case. But, yeah. Sorry. Go ahead, David. Just wanted to add that piece.

[00:23:16] Speaker 2: No, that's awesome. Thank you. And when we run this missed entity rate eval against Universal 3 Pro, which we're using in medical mode against Whisper, we can see that there's essentially a couple things that are happening here. So we can see that Universal 3 Pro doesn't really miss any of the entities. So these are the entities that were found by the LLM and what we were looking to match against in the transcription. And I think a really important thing to call out here is that we got this drug name right. And when you're doing like medical scribes or anything that relies on accurate information like that, things like that can make or break the performance of the model and the value that it's providing and the service. And when you look at the Whisper model results, you can see that while it got most of the exact matches, the main thing that it didn't get was the actual drug name. And that's going to be pretty critical when you're deciding what model to use. So you can actually see here that Armour was zero because we got the exact matches, whereas Whisper was about 8.3%. And what we can actually do here is we do pass this to an LLM, which we'll look at in the next example. So the full recipe here is doing the same audio, but you can see that we're running the full recipe here. So we have semantic where we're capturing insertions, deletions, substitutions. We're calculating the missed entity rate for both Universal 3 Pro and Medical Mode and Whisper. And then we have an LLM evaluate both the results and provide a score or ranking. And we can see that across the board, we performed better here across WUR, across missed entity rate, and across the insertions, deletions, and substitutions. You can see Whisper has many more substitutions. But I'd really like to dig into this a little more. So if I go into my results here, and again, this is available in the repo if you download it. I'm just going to open this one up.

[00:25:46] Speaker 1: A key thing, too, I want to just add is that everyone here at Assembly, we're definitely heavy Cloud Code users. So what you can do is you can actually take this benchmarking SDK is available. We've open sourced it. And you can just give the URL to Cloud and give it your data and tell it to basically go nuts and start creating benchmarks. That's honestly one of our favorite ways to run some crazy experiments or do evaluations and stuff. So definitely just want to mention, we're going to share all these resources after the workshop. But I want to mention that feel free to improve this with some LLM workflows.

[00:26:30] Speaker 2: Yeah, I think some of these demo scripts are baked in with LLM gateway and passing them to an LLM. You can certainly generate results and interact with Cloud Code or a coding agent to kind of do that same kind of ad hoc analysis after the fact. But here's one of the outputs of passing to an LLM after you get your transcripts and you do your metrics. This allows you to do some additional kind of analysis and insights into the nuances between the transcripts. And you can see here that the way Whisper, even though it does miss that drug name, which is honestly a make or break, I think, transcription or output of the transcription. But you can see it also has a couple of things here that we did not miss that really are going to help you understand the flow of the conversation. For the person that's reading this, it's really let's double check instead of it's prudent to double check. You know, something like that might not be as important. But lifestyle changes low versus lifestyle changes alone. Like that's fundamentally a different thing. That's something that's getting transcribed. So picking up on those kind of nuances in kind of more ad hoc ways is really helpful in understanding the performance of your model versus just like a word or some sort of computed metric like that. So again, yeah, critical thing is really that drug name that we got. If you haven't tried medical mode or you have a medical use case, definitely recommend you try it. But it's a new feature of Universal 3 Pro. And it's a parameter that is available now in the API. So you can see here domain equals medical V1. That's what allows us to do the post-processing step in the pipeline.

[00:28:27] Speaker 1: And this is huge for medical MRR, missed entity rate, right? Because, you know, the last mile issues of like the medical domain in particular and, you know, one of the really awesome pieces about Universal 3 Pro long term as we continue to enhance the contextual piece of prompting is like there is specific terminology for these different domains that are making make or break for whether or not the transcription is actually good, right? If like if you're using a speech to text model for like a medical conversation and it's not getting any of the medication names correct, it's not getting the procedures correct, it's not getting, you know, it's not getting the doctor's name correct. It's like, well, how is that at all usable downstream, right? And these are all kind of the last mile issues that we're looking to solve here at Assembly. So, yeah, there's a couple more. I'm going to share my screen again. There's a couple last things I wanted to talk about and then we'll throw to questions. Do you want to pause and answer the couple questions that are there right now? I think they're kind of relevant. What was there?

[00:29:27] Speaker 4: Unless you want to wait. No, no, no. Let's do it. There was one that was just like, are there specific scenarios that you've seen where semantic word doesn't do well beyond entities with the ground truth source files?

[00:29:42] Speaker 1: It's a good question. So, basically, are there specific scenarios where using semantic word can actually cause like a negative impact downstream? I think it depends on how you're doing semantic word, right? For example, like if you look at the example that David just showed, like if you're doing some sort of automated method for semantic word, like, or semantic, yeah, semantic word, like you might label like a misspelling of like the terminology will go V as something else. Right. But if you're really doing a manual list and really like curating the list yourself for for the semantic were and kind of like building on a list that you can have long term. Right. And the and in that SDK, it allows you to pass a list and, you know, reuse it and add to the list. It's essentially it's very simple. Right. It's a find and replace. Like, take, take a word. Yeah, exactly.

[00:30:37] Speaker 2: Sorry for not showing that before.

[00:30:39] Speaker 1: No, I'll get it. But take a word and over time, continue to add to it. Right. And if you do that and keep this curated list specific for your data set, like long term, your benchmarks are just going to improve and you're just gonna have a better sense of how specific models and vendors are performing on your data in general.

[00:31:01] Speaker 4: I think maybe I would add there's like domain specific things that you would want to look for. So, for example, like the word doctor versus D.R. period. Right. Like those types of like acronyms, shorthand, formatting, et cetera. The way Universal 3 Work Pro works under the hood because it has an actual LLM in the decoder, it's much more likely to do like formatting that you would expect as like a human reading something rather than like a traditional ASR that like literally outputs every word. And so a lot of those like nuances for semantic word become like domain specific and like formatting specific that are things to look out for. But being able to add them one by one is great.

[00:31:45] Speaker 1: Are there any other questions?

[00:31:46] Speaker 4: Yeah, there's one other one was around like experimentation of using the outputs of the SDK analysis that we just looked at as inputs to key terms prompting in U3 Pro. Can you make like a corrective model in the loop given this ability? Like, what are some of the early results you could share? Maybe we could talk a little bit about that.

[00:32:09] Speaker 1: Yeah, so we've done some testing around kind of like this process of like using an LLM to to kind of like take previous transcription and use it to improve and improve the overall accuracy of the of the next next time you transcribe. Ultimately, what we found in the short term is that this produces pretty like mixed results. I'd say in the short term, you could have like almost like a human in the loop version where it's like an LLM is identifying certain words that are appearing again and again. And you go to like a human like, is this right? And then based on that, you update your key terms list with potentially like names or different terminology that's appearing again and again. But we have thought a lot about this piece in particular, and it's one of the main things that we're focused on for the next versions of Universal 3 Pro that we're training is kind of improving the contextual awareness of the model itself. Right. So long term where we'd love to get to a state where you can do things like pass previous context directly to the model and use that to improve the accuracy of future transcriptions. But even better is something like, for example, let's say you're transcribing cardiology appointments or I feel like I keep using a medical use case because we just launched medical mode. But let me cardiology point is a good example. Right. So let's say that you're transcribing cardiology appointments like there are certain words that are going to come up again and again. Right. You know, medications related to cardiology, you know, like different cardiology related organs, things like that. Right. So the goal really is that over time, the Universal 3 Pro model is going to get to a point where it's so smart that if you pass context like this is a cardiology appointment, the accuracy of those rare words related to cardiology are going to be accurate within the transcription. Right. And then even more, even like we're thinking even like of improving transcription in general with that kind of thinking where it's like at some point, like you won't even really need to pass this in the prompt because the the U3 Pro model has like contextual awareness because the decoder was built directly into the architecture. So you can think about it like if you placed it in the, you know, imagine you placed a transcriber who's just really, really smart in the room and like any sort of random meeting, like one of your business meetings. Right. At first, it might be like, what's going on here? I have no idea. But after a while, I'd be able to pick up enough context from the conversation to be like, OK, this is a meeting about, you know, like, you know, tire sales. Right. And then any sort of terminology related to that, it's going to boost because of its LLM brain and be able to improve the accuracy overall. So right now, I don't think there's a and sorry, that was a long winded way. I'll get off my soapbox. But that's definitely things that we're thinking about in terms of how we improve the transcription based on context. But there's not really a good flow right now without a human in the loop for being able to like take previous transcription and create key terms lists. Ryan, David, I don't know if you have any any additions

[00:35:04] Speaker 2: to add to that. I think that's fair. I mean, in the streaming context, you know, I think it is nice to have that we can update key terms mid-session, you know, so if you are detecting things or common terms that you might want to, you know, instruct the model to do more appropriately as the trans as the conversation goes, I do think that that is a really nice feature and something to look into to iteratively improve it, you know, mid-session. But, yeah, I think I do think, you know, there's there's something to be said about, you know, looking at the transcripts and what's actually getting returned. So you can kind of modify the configuration over time as it relates to your specific use case, right? It's hard to generalize these things.

[00:35:50] Speaker 4: Nice. There's a couple more here, Zach. I don't know if you want to go through your stuff or keep answering, but we have a couple more if you want to. Yeah, I mean, we yeah. So one was basically asking, what is your data show as a reasonable AI transcript to human review slash edit split with respect to time spent? For example, does Assembly AI do 80 percent of the work in a human editor? Does 20 percent if you're using like the corrector tool, like what does that look like?

[00:36:19] Speaker 1: So basically, what is the what like is it talking about based? The question is like how much time is really spent? Yeah, like a human versus AI in the loop with with transcription. So so, yeah. So when when, you know, this is the traditional method that like a lot of our customers, they have these existing data sets that they've they've had forever. So a lot of this is is has been done and they just continue to iterate on them. Right. But the typical process is that customer will get together like 10 or 20 of like either either their audio files, which is the best way to really see how transcription models are performing on your data or representative data. Right. Maybe something that's open source or or or just data that's similar to whatever the audio is going to be transcribed. Right. And then they'll and then a lot of times we'll help our customers submit for human transcription. And what a lot of actually the human transcription market will do is actually use AI as a first pass, like some of them, some of our customers. Right. They'll use AI as a first pass to generate an initial transcription. And then a human will go through, you know, transcript by transcript and kind of identify specific things that the AI might have missed or, you know, and improve the transcript that way. Right. But even with this process, like, you know, the Universal 3 Pro model is only like a month or two old. Even with this process, like any of the truth files like that, you know, weren't weren't used with Universal 3 Pro taking a first pass or even like if if a human transcriber has specific things that they don't are instructed not to transcribe like things like false starts or stutters or, you know, things that that Universal 3 Pro model is is capable of collecting really like you need to have like a human in the loop kind of go back. And I think that that's what's amazing about this tool is that it makes it a lot easier to like click buttons and basically, OK, you have a good truth file that you can use again and again. But yeah, I mean, AI is doing a lot of the lot of the transcription even for human transcribers these days. And, you know, eventually, like I will be doing like the majority of this work. But yeah, that's a that's that's kind of where things stand now. Ryan, David, not sure if you have any additions.

[00:38:35] Speaker 4: I almost feel like thinking about it as like 100 percent time is the challenge. Like you don't even need to listen to the whole audio file anymore. You just need to listen to the parts where, you know, there's some disagreement between your ground truth and the model or as some folks are even talking about further in the chat, like an ensemble of models versus, you know, a certain model. Those parts where there's inconsistency are really the ones like listen in and spend time on. But the amount of time spent is like, yeah, not even close to 20 percent of the whole file. Next one is how do you handle accented speech in your experiments where articulation may not necessarily be accurate? I mean, I'm happy to do it. Like generally speaking, our model is trained on a number of like different voices and inputs. And so ultimately, we want the model to be as robust as it can be to accents. Now, of course, there are certain cases where accents, background noise, crosstalk, how far I am from the microphone, for example, can influence the results of what you ultimately see on the other side. That's where things like providing key terms, providing a prompt can help steer the model, right? If it's unsure and it hears a word that it thinks is something and it is heavily accented or hard to hear, you can steer the model to be like overcorrecting in those scenarios. But for the most part, like you should find that the model generally handles these situations quite well. And things like key terms, prompting and prompting just help you increase accuracy over time. A really good example of like where you might use this. We were talking to a customer. They have a bunch of accents in the UK. There's a very specific word that's said over and over and over again in a lot of their conversations. And so they've actually added that word as part of their key terms prompt so that it's boosted even under some of these weird scenarios with accents. Anything to add?

[00:40:32] Speaker 1: Yeah. I mean, the only piece I would add is that, you know, audio quality is an interesting piece of the discussion, right? And like specifically with, you know, sometimes like with audio quality, accented speech, like in, you know, let's say it's like a lot of our voice agent customers will use like eight kilohertz mu law. It's like it's really hard to hear. You can prompt our model to tag things as inaudible or unclear in the audio, which a lot of times is a good thing to use in your benchmarking as well. Like just because if you ask the model to be as verbatim as possible, you know, guess make a guess no matter what it will. Right. But the odds of it being incorrect are high if you if you command the model to make guesses in every scenario. But if you if you ask the model like, hey, if the audio is truly unclear, just label it as unclear. You can actually go back to those specific scenarios and see where the U3 Pro model is labeling things as unclear and go and listen to yourself and go, wow. Yeah. OK. I don't even know. Like if I'm a human, how do I even transcribe that? Right. And this is actually a pretty funny thing, too, because a lot of times human transcribers for their ground truth files will do the exact same thing. Right. So if the U3 Pro model is coming up with an audible in the same sections that a human transcriber is, I don't think anyone in the world is labeling that audio correctly. But, yeah, that's all I have to add. Nice. And there's one last one here. So what is

[00:42:07] Speaker 4: the main STT streaming use case for you? Is it AI voice agents? In that case, evaluating STT becomes quite complex because certain words or entities may need to trigger tools or downstream actions. What do you think is the best way to evaluate ASR in that setting?

[00:42:25] Speaker 1: Yeah. And actually, I have a few slides, too. I'm happy to like, because I wanted to talk

[00:42:30] Speaker 2: a little bit. I was going to say, is that a good segue to some of those? Yeah. Yeah.

[00:42:34] Speaker 1: Let me let me share my screen and show some of the slides I had on. On streaming benchmarks in general. Oh. Wrong button. Yeah. So streaming doing streaming evaluations in general is just it's it's different, obviously, than async based evaluation. And it's it's actually more difficult to to benchmark models than async models. Right. The reason is that with async models, you can send an entire audio file up front. You get the entire transcript back. You know, you can go like it's a very quick process. It transcribes a lot faster than real time. But with streaming models, you have to buffer the audio bit by bit. So like basically, if you have an hour long file, you have to keep a WebSocket session open for an hour in order to stream the entire file. Right. So things differ a lot in terms of in terms of the metrics when you're when you're evaluating things like latency. However, like for accuracy metrics, they're the exact same between async and streaming. And really, the focus is on this missed entity rate metric. Right. And if I go back, this entity rate is missed entities in the in the transcription that were transcribed correctly. Right. And you can be very specific on what entity this is. And for voice agent use cases in particular, you can imagine things like credit card numbers, email addresses, phone numbers. If these things aren't captured in some of these voice agent workflows, like basically the voice agent has failed the discussion. And then the people who are building the voice agents don't get paid for that voice agent discussion. Right. So it's critical to see how well the speech to text component of the ears of the of the agent is is hearing what the user is saying. Right. So, though, that metric is exactly the same between async and streaming. You just might want to focus on different things. Right. So, for example, if you are if you're doing a voice agent discussion and and credit transcribing credit card numbers is by far the most important thing in your flow, you want to set up a specific benchmarking process related to credit card numbers where you identify the credit card numbers and both the ground truth and the and the and the the A.I. based output. And then you create scores based on that, you know, and you might even you could even do this in like like missed entity rate is like missed entities in the transcription over total entities. But honestly, sometimes it helps if you have a large amount of file you're working with, just do like a binary one or zero. Like did the model get it right in this scenario or did it not? Right. And you could run that through one hundred files if you have it. And like whoever gets the model correct in the majority of the situations for, say, credit card numbers like is is better because if you expand that out. Right. And let's say you're doing a million calls a day for capturing credit card numbers. The difference of 10 percent means like it could potentially mean millions of dollars that you're losing or flying off the table if you don't get that right. Right. So the accuracy metrics pretty much exactly the same between async and streaming. But something that's really important to to note with streaming in general is that we've we've we've noticed a trend with streaming latency where a lot of customers don't really know how to evaluate streaming latency properly. The time to first byte or time to first token metric is a very is a very popular metric in the market because obviously the popularity is a large language models. And it's a really great, great way to understand how well large models are performing. Right. You send off a prompt. How quickly are you getting that first token? Because then you can begin to use that downstream. Right. However, for streaming speech to text, that metric of time to first byte or time to first token is really not as valuable as some of the other metrics for that that you can calculate with the streaming speech to text model. Right. The two metrics that we recommend our customers really focus on versus time to first byte or time to first token is one admission latency. And this is for use cases where if you care a lot about partial transcription, right, every word that you say, how quickly you're getting that word back in real time. And that's what this emission latency metric is. Basically, word was spoken. How quickly after the word was spoken, do I get that back? Right. That's a much more impactful metric for your downstream use case, because like that, that showcases how quickly you can actually use that information in your in your pipeline. Similar to with large language models, time to first byte or time to first token. It's basically like the difference is that you're submitting a prompt. You click you click enter how quickly you get information back. The same goes for audio data. If you submit audio data, how quickly are you getting usable information back? So mission latency is a much better metric for calculating latency of a streaming speech text model. And then really for voice agents, this time to complete transcript metric is by far the most impactful in production, which is basically after I've done speaking. Right. Like I've completed my user turn as I'm talking to the voice agent turns complete. How quickly do I get back all that finalized text after it is spoken? Because that that is really the piece that determines how quickly a voice agent can then respond to that information in production. Right. So to kind of sum up there for streaming models in particular, accuracy metrics, the exact same between async and streaming, you just might want to focus on on different things. But the streaming metrics are very different. Yeah. And that's that's pretty much right. I would just add that.

[00:48:18] Speaker 4: I think, look, if you're really early on and, you know, you don't have a lot of users, I think, you know, there's great ways to get metrics like this. But like simulating using your voice agent like you're a real customer is going to teach you a lot. We often have, you know, multiple instances of different voice agents in front of us that we're testing and comparing models. And so I think that's one way to like actually like do it yourself and simulate real customer scenarios and see what happens. And the other piece is actually if you do have users, it's really easy to run an A-B test. And so, you know, once you actually have some number of users, you could actually roll out, you know, changes to the agent, changes to the models, changes to the orchestration as part of that, like roll out to your users and actually see like how some of your core metrics change. And so, you know, if you're really early, that's harder. But if you have actual scale, it's really easy to swap models in and out and make changes and see what the actual results are on some sort of like outcome metric, which in the end, I don't think you really care which speech to text model has the best word error rate. You actually care, right? Are users completing some goal more often? You know, are they making more bookings, having their support inquiries resolved, whatever that outcome is. And so measuring, you know, the models in their end-to-end with like a real A-B test, in my mind, is like way more powerful than anything else you can do.

[00:49:38] Speaker 2: Yeah, I think, Ryan, to your point, it's much more outcome-based, right? You know, if the user is getting interrupted or there's a lot of bargains that aren't expected, that's like a degradation to the experience. So it's a little bit different in measuring the performance of the voice agent.

[00:49:52] Speaker 4: I mean, the same goes for even if you're using like a post-call model too, right? Similar story, right? If you have metrics that are your core business metrics, plugging in models and seeing which of them drives that outcome is ultimately the model you should probably pick, rather than, you know, whichever model shows the best benchmarks or has the best vibes or whatever, right? It's really like what ultimately is driving users to an outcome.

[00:50:15] Speaker 2: Yeah. And we have an example of omission latency. I think we should port that over into the SDK along with time-to-first transcripts so that it'll make it easier for you guys to build out some evals against your voice agents. But that's coming.

[00:50:35] Speaker 1: Cool. Any more questions? An important piece of this whole discussion in general is we are available. So if you have any questions that come up after the fact or you're looking to, you know, better evaluate speech-to-text vendors, like we're happy to share any of this information at any time. We're around. So, you know, let us know anytime if there's anything that we can do to provide resources for your teams to better evaluate in your pipelines.

[00:51:11] Speaker 2: Yeah, we'd also, I mean, I think it'd be great to hear, you know, your success stories with running evals and benchmarking and what's worked well for you all, what haven't. And, you know, maybe there are ways that we can also be better to align with what people are doing in the space. So feel free to let us know, reach out, and we can partner on these things together.

[00:51:33] Speaker 1: Cool. Well, anyway, great chatting with everybody today and, you know, excited for people to correct their truth files. So and let us know if there's any other tools that we're going to keep adding more tools to the Assembly AI dashboard to help our customers with their, you know, their journey in speech-to-text. So if you have any good ideas for things to throw in the dashboard or, you know, any fun GitHub projects related to speech-to-text, feel free to let us know. Have a great day, everybody. Bye.

Summary

AssemblyAI’s Applied AI team (Zach and David) presents a workshop arguing that traditional ASR benchmarking with Word Error Rate (WER) is increasingly unreliable as models improve. They explain how WER counts substitutions, deletions, and insertions equally, penalizing harmless formatting/semantic variants (e.g., “OK” vs “okay”) the same as critical errors (names, hallucinations). With the launch of Universal-3 Pro, some customers observed worse WER versus older models; AssemblyAI investigated and found many “insertions” were actually words humans missed in the ground-truth transcripts—meaning the truth files were wrong. To address this, they launched a Dashboard “Truth File Corrector” tool that aligns ground truth vs U3 Pro output, highlights differences, lets users listen to audio at each discrepancy, and mark whether ASR or ground truth is correct, producing an updated truth file for reuse in existing benchmarking pipelines.

They introduce an open-source Python benchmarking SDK (in AssemblyAI’s solutions repo) supporting WER, CPWER, DER (diarization), plus newer metrics: Semantic WER via semantic word lists (find/replace normalization for domain variants) and Missed Entity Rate (MER) focusing on key entities (drug names, credit cards, emails) that matter for downstream outcomes. They demo MER showing U3 Pro (including Medical mode) capturing a drug name that Whisper misses, and show using an LLM (via AssemblyAI’s LLM Gateway, e.g., Claude) as a judge to compare transcripts and surface meaningful differences beyond WER. They recommend A/B testing models in production and outcome-based evaluation (tickets, satisfaction, task completion) over offline WER.

For streaming/voice agents, they note accuracy metrics are similar but latency evaluation differs: time-to-first-token is less useful than emission latency (word spoken → word emitted) and time-to-final transcript after end-of-turn. They discuss handling accents with robust training plus key terms/prompting and optionally tagging unclear audio as “inaudible” to avoid forced guesses. Q&A covers when semantic normalization can mislead, using eval outputs to refine key terms (mixed results without human-in-loop), and practical approaches to evaluate agent performance end-to-end.

Copy

Download

Title

Fixing Broken ASR Benchmarks: Beyond Word Error Rate

Copy

Download

Keywords

ASR benchmarking Remove

Remove

word error rate Remove

Remove

WER

Remove

ground truth Remove

Remove

truth files Remove

Remove

Universal-3 Pro Remove

Remove

AssemblyAI Remove

Remove

Truth File Corrector Remove

Remove

semantic WER Remove

Remove

semantic word list Remove

Remove

missed entity rate Remove

Remove

MER

Remove

LLM as a judge Remove

Remove

LLM Gateway Remove

Remove

Claude

Remove

Whisper

Remove

medical mode Remove

Remove

hallucinations Remove

Remove

insertions Remove

Remove

deletions

Remove

substitutions Remove

Remove

A/B testing Remove

Remove

production evaluation Remove

Remove

streaming ASR Remove

Remove

emission latency Remove

Remove

time to final transcript Remove

Remove

voice agents Remove

Remove

key terms prompting Remove

Remove

accented speech Remove

Remove

Copy

Download

Key Takeaways

WER can misrepresent model quality because it weights trivial formatting differences the same as critical mistakes and depends heavily on correct ground truth.
As ASR models improve, human-produced truth files can become the limiting factor; better models may score worse if reference transcripts miss words.
Use a truth-file correction workflow: compare model vs reference, listen to disputed segments, and update ground truth to improve future eval reliability.
Augment or replace WER with metrics aligned to business risk: semantic normalization for equivalent variants and Missed Entity Rate to prioritize crucial entities (names, drugs, numbers).
Use LLMs as evaluators to summarize meaningful transcript differences and semantic impact, especially at scale.
Prefer production A/B tests and outcome metrics (task completion, customer satisfaction, ticket volume) over offline benchmarks alone.
For streaming/voice agents, evaluate latency with emission latency and end-of-turn finalization time rather than only time-to-first-token.
Accents and poor audio can be mitigated with prompting/key terms; consider configuring the model to mark unclear audio instead of guessing.

Copy

Download

Sentiments

Neutral: The tone is practical and technical, focused on explaining benchmarking pitfalls and demonstrating tools/metrics. It contains mild enthusiasm about new capabilities but no strong emotional language.

Copy

Download

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file