Universal 3.5 Pro boosts STT with richer context (Full Transcript)

How contextual prompting, dynamic mid-call updates, and agent conversation context improve transcription accuracy—especially in noisy voice-agent calls.
Download Transcript (DOCX)
Speakers
add Add new speaker

[00:00:00] Speaker 1: Hi, team. I want to quickly walk you through the amazing contextual awareness of our newest model, Universal 3.5 Pro. First, I want to showcase how amazing prompting has become with this new model. We've significantly improved the contextual awareness of this model. Then, I want to showcase a new feature that we're calling conversation context, or agent context. Essentially, we retain the previous speech-to-text transcriptions within the agent session, and allow users to also pass spoken messages, like a voice agent's LLM-generated TTS responses, to give the model current conversational context so it can more accurately predict what was spoken. First, let's dive into prompting. There are two amazing things I want to highlight about the improvements we made to prompting. First, by passing information about the audio content, we can actually improve the overall accuracy of the model within a domain. For example, passing medical consultation call in the prompt improve accuracy in the medical domain. The more specific you get in the prompt, the better the accuracy will be in that category. For example, this prompt says, cardiology consultation about chest pain systems, and finally, the most detailed one on the bottom says, cardiology consultation between Dr. Smith and elderly patient regarding chest pain, ECG results, and medication adjustment for hypertension. The more contextual information you pass in the prompt, the more information the model has to make a more accurate prediction on what's actually being spoken in the audio. In addition to this contextual prompting, prompting now also allows you to apply context on top of your key terms. Historically, key terms prompting has allowed users to pass key terms that will be in the audio so the accuracy of those terms can be boosted, and our key term prompting works extremely well. However, key terms are devoid of context. I'm going to demo what I mean in the playground. So, take my last name, for instance, Klebanoff. If I add this as a key term, it will get accurately transcribed in the audio. My name is Zachary Klebanoff. So, as you can see, it was able to transcribe that correctly. However, the model doesn't know if this is a person's name, a product name, a company name, etc. So, it just knows that this terminology will be present in the audio, but it doesn't have the contextual information to apply the key term correctly. So, take a sentence like, take the club and cough. That's semantically similar to last name Klebanoff, right? Take the club and cough. So, you can see that the key term here was applied to incorrect context. So, what's really amazing about our contextual prompting now is that you can actually pass the contextual information of what the key term is to the model, right? So, let's say I pass this information. The user's name is Zachary Klebanoff. So, again, it's going to transcribe my name. So, my name is Zachary Klebanoff. However, look what happens when I use the example from before. Take the club and cough. You see that it doesn't blank and apply the terminology despite acoustic similarity because it has the contextual information of what's actually being spoken in the audio. Another really cool thing that you can do with prompting now is you can actually dynamically, mid-call, change the prompt. So, this capability isn't currently in our playground, but it is available through our API. So, for example, let's say that this is a voice agent. The speech-to-text use case is a voice agent for transcribing bike shop customer service calls. So, it's transcribing shop customer service calls, and let's say that you get mid-call, Lance Armstrong calls, and he says, hey, you know, my bike tire is popped. What can we do about this, right? So, given this information, you can actually make a tool call, mid-call, to adjust the prompts to something like the caller's name is Lance Armstrong and his bike tire is popped. You can dynamically update this midstream, and this will supply the model with even more information to more accurately transcribe and understand what is being spoken in real time. Okay. So, those are all the updates I wanted to talk about specifically with prompting. However, we also have a new feature that's called conversation context. So, what this feature does is it improves the accuracy of the model by providing the model with situational context from the current conversation. So, for example, if the previous transcription is what is your email, the model is now expecting the following transcription to be an email. If the previous transcription is would you like to make that a large, the model is now expecting a Boolean yes or no response. It's important to note that the model isn't over biased on this context. In the case where the previous transcription is what's your email, if the next thing spoken isn't an email, the model won't try to make it an email. Essentially, the model is just nudged toward expecting an email the same way as a human. If I ask you what is your email, I'd expect you to respond with your email, but if you respond with something completely different, I'll be able to understand that as well. The memory of the speech-to-text transcripts are retained within the model session automatically, but you can also pass agent turn text directly to the model. This is important for voice agent use cases because you can pass the LLM generator responses of a voice agent directly to the model. Now, I just want to show you a few statistics that we've done on this. We saw a significant reduction in word error rate on these voice agent data sets and we're going to be continuing to train the model to be even better with agent context so you can expect these stats to improve even further. Now, let me demo it in a situation with bad audio conditions. Here's a little playground I put together. The first demo I'm going to show with this is a food ordering use case. I'm ordering food at the Krusty Krab and basically everything that gets passed here as agent context, you would expect a voice agent to be speaking within the call, right? What's really amazing about this agent context feature is that by passing the agent context to the model, even if the audio conditions are pretty poor, which I'm going to try my best to replicate some pretty terrible audio conditions, the model is still very good at understanding what's occurring in the call based on the situational context of the conversation. You can say this is the prompt. Now, actually before I do this session, I'm actually going to move these key terms over to the prompt. Let's say menu items include body and here we go. Now, the voice agent, this thing is sort of like responding something like, for that kelp shake, what size would you like? Then maybe the voice agent responds something like, will that be all? So you can see, silly example, but the truth is that with poor audio conditions, by passing this agent context directly to the model, the model does typically do a much better job of transcribing in these really difficult conditions. Now, let me try a more intensive configuration. So let's try something like, you are an audio transcriptionist for Serocorp. Say the first thing that the agent responds with is new desk at Serocorp. This is Serobot. Who am I speaking with? Okay, now let's give him a shot. Hey, it's Dara Okafor from the Northwind account. Nice to meet you, Serobot. So the bot now responds with something like, hi, Dara, what's the account ID on file? And maybe we want to do a dynamic update to the prompt to include something like, the caller's name is Dara Okafor. So let's say that we sent that agent context over. Yeah, it's A as in alpha 774-K29. Now let's send a voice agent that says something like, what's your email address? Actually, before that, let's imagine a tool call is going on in the background that looks up this account. And then when the account is looked up, the model discovers they recently purchased a product known as Seroflow Halo with Vantix. So now the model is aware of the products that the user previously purchased and therefore can expect them to potentially come up in this conversation. So let's update the configuration of that. And then let's say, yes, that the next thing spoken in the conversation was what's your email address? Yeah, my email address is dara.okafor at northwind.com. Let's say the email address was transcribed successfully. And let's say the bot then asks, perfect, which product are you asking about today? Now it's important to note that within your voice agent set, obviously this would all be automated. Yeah, I'm calling about the Seroflow Halo tier and the Vantix add-on. And because we pulled this up, obviously we're doing well on the transcription for that because we did tool call, make call. And then let's say that to finish the call, the bot says something like, want me to renew your plan for that product? And then maybe the user says something like, just Seroflow Halo, hold on the Vantix one. Okay, well, that was a quick little demo of the context carryover. We're really excited about the possibility of combining this contextual prompting and context carryover within voice agent use cases. Please reach out if you have any questions or are interested in these products. Really excited to see what you build.

ai AI Insights
Arow Summary
Speaker introduces Universal 3.5 Pro’s improved contextual awareness for speech-to-text. Two prompting upgrades are highlighted: (1) richer domain/context prompts (e.g., specific medical scenario) improve accuracy; (2) key terms can now be given contextual meaning to avoid misapplication (e.g., distinguishing a surname “Klebanoff” from acoustically similar phrases). Prompts can also be updated dynamically mid-call via API (e.g., adding caller name and issue). A new “conversation/agent context” feature retains prior STT turns in-session and allows passing agent TTS/LLM responses as context so the model can better anticipate likely next utterances (email after “what’s your email?”, yes/no after “make that a large”), without over-biasing when the user says something else. Demos show improved transcription under poor audio for food ordering and enterprise support scenarios, including using tool-call results (account lookup, purchased products) to update context and improve recognition of entities like product names and email addresses. Word error rate reductions are mentioned, with ongoing training planned.
Arow Title
Universal 3.5 Pro: Contextual Prompting and Agent Context for Better STT
Arow Keywords
Universal 3.5 Pro Remove
speech-to-text Remove
contextual awareness Remove
prompting Remove
domain prompting Remove
key terms Remove
contextual key terms Remove
dynamic prompt updates Remove
API Remove
conversation context Remove
agent context Remove
voice agents Remove
TTS context Remove
session memory Remove
word error rate Remove
bad audio Remove
tool calls Remove
entity recognition Remove
medical transcription Remove
customer service transcription Remove
Arow Key Takeaways
  • More specific domain/context prompts can materially improve STT accuracy within that domain.
  • Key terms alone can be over-applied; adding semantic context for key terms reduces incorrect substitutions.
  • Prompts can be updated mid-call via API to incorporate new facts (caller identity, issue) and improve real-time transcription.
  • Conversation/agent context retains previous turns and can ingest agent responses to nudge expectations (emails, yes/no) without forcing them.
  • Passing tool-call outputs (e.g., account/product history) into context improves recognition of entities like product names.
  • Agent context can boost transcription robustness under poor audio conditions, reducing word error rate in voice-agent datasets.
Arow Sentiments
Positive: The tone is enthusiastic and promotional, emphasizing “amazing” improvements, significant WER reductions, and excitement about what users can build, with demos showcasing benefits.
Arow Enter your query
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript