Universal 3.5 Pro Realtime boosts STT with context (Full Transcript)

Demo shows prompt steering, Conversation Context for voice agents, 19-language code-switching, and Voice Focus noise suppression for real-time transcription.

Download Transcript (DOCX)

Speakers

Add new speaker

[00:00:00] Speaker 1: Universal 3.5 Pro is our latest and most powerful model to date. It's state-of-the-art accuracy across 19 languages with native code switching built-in. It's a fully promptable interface that allows you to give context to the model. Let's dive into the Realtime API so we can see this in action. The model was trained to use context about the audio, its domain, topic, or scenario. So to better recognize the vocabulary that context makes likely, we can provide a prompt that will help steer the model towards the right entities, the right product names, the right terms that would likely come up in that conversation. Let's start with this prompt here. Transcribe this cardiology consultation call. So I'm going to start the session now. I'm going to simulate being a doctor and a patient. I'm Dr. Suarzy Llewellyn, your cardiologist, and your echocardiogram showed an ejection fraction of 35%. Your EKG picked up atrial fibrillation and a rapid ventricular response. We'll start metoprolol succinate, 50 milligrams once daily, and switch you from warfarin to apixapine, a brand name Eloquiz, so you can stop checking your INR. So you can see here that the model was accurately transcribing the different medical terms, cardiologist, echocardiogram, properly capitalizing EKG and INR, and it even picked up the brand name Eloquiz and apixapine. Let's try a different prompt. Think you're an order status check-in call, and we need to mention our order ID, the product name we're going to buy. We can use this prompt, transcribe this order status check-in call, and once we start this session, I'm going to say that I have an order for the Bubble Gun 3000. So you can see here, because the model knows that this is an order status check-in, it takes that Bubble Gun 3000 and capitalizes it like it's a product name. Now if I say the order status is AB underscore 703, you can see the model properly capitalizes it, applies the right formatting, so that an agent will be able to properly interpret this order ID. For real-time transcription, we know that voice agents really need the right kind of formatting on the right entities at the right time. Doubling down on the prompting and contextual information that we can provide the model, we've also introduced a new feature called Conversation Context. Conversation Context is a really great feature for voice agents because it allows you to tie back the TTS agent responses, the spoken agent text, directly back to the model. So out of the box, we're going to pass back the three previous turns from the STT, so what we've transcribed, and allowing you to dynamically update the model configuration with the agent context. Let's take a look at that in action. So let's say I'm calling a voice agent to book a reservation or go through some sort of call center with some sort of menu. The agent might say, select an option out of the three. So options A, B, or C. So when I say C, as if I'm responding to one of these options, the model is going to infer that by default I'm saying C like in Spanish. It's a multilingual model out of the box, and therefore is likely going to predict that that's what I'm trying to say. Now if I feed in agent context options A, B, or C, we can see the model now accurately predicts the letter C. Phonetically, this sounds the same, but for a voice agent, this is kind of a make or break situation. The agent would get caught in a loop asking me to repeat myself, hopefully getting to the right transcription. Being able to steer the model correctly with agent context gives the model the correct transcription from the start. Now the model won't get confused, and the agent can continue moving on in its sequence. Let's dive into the multilingual and code switching capabilities of the model. So like I had said before, the model can support 19 languages and code switches out of the box. Let's call my friend Abhi who can speak Hindi and English. Hey Abhi, thanks for picking up. I'm demoing Universal 3.5 Pro now, and we want to showcase Hindi and English. You up for it?

[00:04:15] Speaker 2: Yeah, let's do it.

[00:04:16] Speaker 1: Awesome. How are you doing today?

[00:04:20] Speaker 2: I'm great. How are you?

[00:04:24] Speaker 1: I'm doing great. Coming off the Knicks win over the weekend, did you catch the game?

[00:04:29] Speaker 2: I did. I hope we can run it back next year.

[00:04:36] Speaker 1: Yeah, I hope so. What a team, what a run. It was awesome. And we can see the transcription here is coming through perfectly. It's bouncing between the Hindi you're saying and the English words in the same turns, and the model is code switching out of the box. All we're using is a prompt that says, transcribe this recording with multilingual speech, and the model will know what language to pick up. Thanks Abhi, really appreciate it.

[00:05:02] Speaker 2: Yeah man, I can't wait to try it.

[00:05:05] Speaker 1: All right, so we just saw Universal 3.5 Pro natively code switch between English and Hindi, and a little bit of Hinglish. Let's bring in my colleague Nebo, who can speak native Hebrew, and we're going to code switch between English and Hebrew to see how the transcription plays out. Hey, how's it going? Where are you coming from in the city today?

[00:05:26] Speaker 3: Awesome. Did you watch the Knicks game over the weekend?

[00:05:36] Speaker 1: Incredible game, incredible team, incredible run, 15 points down every game. It was amazing to watch, amazing for New York City, and I know that this Thursday the parade is going to be absolutely crazy.

[00:05:57] Speaker 3: You're going to go to the parade? I'll try. Awesome.

[00:06:01] Speaker 1: Let's take a look at the transcription and see how it qualifies. So you can see here there's some native code switching involved, you can see areas where I'm talking. How's the Hebrew quality?

[00:06:11] Speaker 4: So this is perfect. You can see that it got, like, Mitsui, and it got it perfectly. And where I'm from, everything is just, like, right in order. It's perfect.

[00:06:22] Speaker 1: Amazing. So with a custom prompt transcribed in English and Hebrew, we're telling the model to contextually steer transcription towards those languages. So while we support 19 languages out of the box, we're going to be able to use language steering to ensure that transcription is accurately going to the languages that you need for your use case. Awesome. Thank you. Thank you. Another feature we want to highlight with Universal 3.5 Pro Realtime is voice focus. Voice focus allows you to isolate the primary speaker and suppress background ambient noise that might be impacting accuracy of your transcription. This model is already our most accurate to date, and we want to allow you to get last mile gains by tailoring this to your audio environments. Let's take a look in the playground here. Under voice focus, we can see there's a couple new options. So you have voice focus near field, which helps with headsets, handsets, and other closed talking microphones. And you have far field. So think conference rooms, drive-thru speakers, laptop mics, other distant capture setups. Along with the two values for near field and far field, we also have a threshold parameter. And this allows you to control how aggressively background audio is suppressed. The higher values are more aggressive. So let's try an example now. I'm going to play on a YouTube video here, 10 full hours of people talking. So we can see that even with this audio, it can still properly pick up the transcription as I'm talking in real time. It's suppressing this background noise and latching on to my voice. This ensures that when people are talking in the background, it could be background noise, people behind me in the office, it will only pick up my voice because I'm what's closest to the microphone. This is great for drive-thru orders. When you think of it's really bad quality audio, there's people talking in the car. There might be people talking on the phone. A lot of different things are happening in that environment. And using something like Voice Focus will ensure that you're going to get the right audio to make sure your voice agents or your downstream processes are properly transmitted. All right, so that wraps up our demo for Universal 3.5 Pro Real-Time. We're extremely excited to see what you build with the new model. We can't wait to see the new products and services that emerge in the market. Thanks.

Summary

The transcript is a product demo of Universal 3.5 Pro Realtime, highlighting state-of-the-art speech-to-text accuracy across 19 languages with native code-switching, prompt-based contextual steering, and new features for voice-agent scenarios. The speaker shows how prompts improve recognition of domain terms (e.g., cardiology vocabulary, drug names, acronyms like EKG/INR) and structured entities (product names and order IDs). A key addition, Conversation Context, feeds prior STT turns and agent TTS responses back into the model to disambiguate short utterances (e.g., recognizing the letter “C” instead of Spanish “sí”). The demo also showcases multilingual transcription with English/Hindi (Hinglish) and English/Hebrew, and introduces Voice Focus (near-field/far-field plus threshold) to isolate the primary speaker and suppress ambient noise for challenging environments like offices, conference rooms, or drive-thrus.

Copy

Download

Title

Universal 3.5 Pro Realtime demo: prompts, context, code-switching

Copy

Download

Keywords

Universal 3.5 Pro Remove

Remove

Realtime API Remove

Remove

speech-to-text Remove

Remove

STT

Remove

prompting

Remove

contextual steering Remove

Remove

Conversation Context Remove

Remove

voice agents Remove

Remove

multilingual transcription Remove

Remove

code-switching Remove

Remove

Hindi

Remove

Hebrew

Remove

Hinglish

Remove

cardiology transcription Remove

Remove

entity formatting Remove

Remove

order ID

Remove

Voice Focus Remove

Remove

noise suppression Remove

Remove

near-field Remove

Remove

far-field

Remove

threshold

Remove

Copy

Download

Key Takeaways

Universal 3.5 Pro Realtime supports high-accuracy transcription across 19 languages with native code-switching.
Prompting with domain/scenario context improves recognition of specialized vocabulary, entities, and formatting (medical terms, acronyms, product names, order IDs).
Conversation Context lets voice-agent responses (TTS) and prior turns guide transcription, resolving ambiguous short inputs like menu selections (e.g., “C”).
Language steering via prompts can bias transcription toward desired languages even though multilingual support is built-in.
Voice Focus (near-field/far-field plus adjustable threshold) isolates the primary speaker and suppresses background noise, improving performance in noisy settings such as drive-thrus and offices.

Copy

Download

Sentiments

Positive: The tone is enthusiastic and confident, emphasizing improved accuracy, useful new features for voice agents, and excitement about what developers will build.

Copy

Download

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file