[00:00:00] Speaker 1: Universal 3.5 Pro is our latest and most powerful model to date. It's state-of-the-art accuracy across 19 languages with native code switching built-in. It's a fully promptable interface that allows you to give context to the model. Let's dive into the Realtime API so we can see this in action. The model was trained to use context about the audio, its domain, topic, or scenario. So to better recognize the vocabulary that context makes likely, we can provide a prompt that will help steer the model towards the right entities, the right product names, the right terms that would likely come up in that conversation. Let's start with this prompt here. Transcribe this cardiology consultation call. So I'm going to start the session now. I'm going to simulate being a doctor and a patient. I'm Dr. Suarzy Llewellyn, your cardiologist, and your echocardiogram showed an ejection fraction of 35%. Your EKG picked up atrial fibrillation and a rapid ventricular response. We'll start metoprolol succinate, 50 milligrams once daily, and switch you from warfarin to apixapine, a brand name Eloquiz, so you can stop checking your INR. So you can see here that the model was accurately transcribing the different medical terms, cardiologist, echocardiogram, properly capitalizing EKG and INR, and it even picked up the brand name Eloquiz and apixapine. Let's try a different prompt. Think you're an order status check-in call, and we need to mention our order ID, the product name we're going to buy. We can use this prompt, transcribe this order status check-in call, and once we start this session, I'm going to say that I have an order for the Bubble Gun 3000. So you can see here, because the model knows that this is an order status check-in, it takes that Bubble Gun 3000 and capitalizes it like it's a product name. Now if I say the order status is AB underscore 703, you can see the model properly capitalizes it, applies the right formatting, so that an agent will be able to properly interpret this order ID. For real-time transcription, we know that voice agents really need the right kind of formatting on the right entities at the right time. Doubling down on the prompting and contextual information that we can provide the model, we've also introduced a new feature called Conversation Context. Conversation Context is a really great feature for voice agents because it allows you to tie back the TTS agent responses, the spoken agent text, directly back to the model. So out of the box, we're going to pass back the three previous turns from the STT, so what we've transcribed, and allowing you to dynamically update the model configuration with the agent context. Let's take a look at that in action. So let's say I'm calling a voice agent to book a reservation or go through some sort of call center with some sort of menu. The agent might say, select an option out of the three. So options A, B, or C. So when I say C, as if I'm responding to one of these options, the model is going to infer that by default I'm saying C like in Spanish. It's a multilingual model out of the box, and therefore is likely going to predict that that's what I'm trying to say. Now if I feed in agent context options A, B, or C, we can see the model now accurately predicts the letter C. Phonetically, this sounds the same, but for a voice agent, this is kind of a make or break situation. The agent would get caught in a loop asking me to repeat myself, hopefully getting to the right transcription. Being able to steer the model correctly with agent context gives the model the correct transcription from the start. Now the model won't get confused, and the agent can continue moving on in its sequence. Let's dive into the multilingual and code switching capabilities of the model. So like I had said before, the model can support 19 languages and code switches out of the box. Let's call my friend Abhi who can speak Hindi and English. Hey Abhi, thanks for picking up. I'm demoing Universal 3.5 Pro now, and we want to showcase Hindi and English. You up for it?
[00:04:15] Speaker 2: Yeah, let's do it.
[00:04:16] Speaker 1: Awesome. How are you doing today?
[00:04:20] Speaker 2: I'm great. How are you?
[00:04:24] Speaker 1: I'm doing great. Coming off the Knicks win over the weekend, did you catch the game?
[00:04:29] Speaker 2: I did. I hope we can run it back next year.
[00:04:36] Speaker 1: Yeah, I hope so. What a team, what a run. It was awesome. And we can see the transcription here is coming through perfectly. It's bouncing between the Hindi you're saying and the English words in the same turns, and the model is code switching out of the box. All we're using is a prompt that says, transcribe this recording with multilingual speech, and the model will know what language to pick up. Thanks Abhi, really appreciate it.
[00:05:02] Speaker 2: Yeah man, I can't wait to try it.
[00:05:05] Speaker 1: All right, so we just saw Universal 3.5 Pro natively code switch between English and Hindi, and a little bit of Hinglish. Let's bring in my colleague Nebo, who can speak native Hebrew, and we're going to code switch between English and Hebrew to see how the transcription plays out. Hey, how's it going? Where are you coming from in the city today?
[00:05:26] Speaker 3: Awesome. Did you watch the Knicks game over the weekend?
[00:05:36] Speaker 1: Incredible game, incredible team, incredible run, 15 points down every game. It was amazing to watch, amazing for New York City, and I know that this Thursday the parade is going to be absolutely crazy.
[00:05:57] Speaker 3: You're going to go to the parade? I'll try. Awesome.
[00:06:01] Speaker 1: Let's take a look at the transcription and see how it qualifies. So you can see here there's some native code switching involved, you can see areas where I'm talking. How's the Hebrew quality?
[00:06:11] Speaker 4: So this is perfect. You can see that it got, like, Mitsui, and it got it perfectly. And where I'm from, everything is just, like, right in order. It's perfect.
[00:06:22] Speaker 1: Amazing. So with a custom prompt transcribed in English and Hebrew, we're telling the model to contextually steer transcription towards those languages. So while we support 19 languages out of the box, we're going to be able to use language steering to ensure that transcription is accurately going to the languages that you need for your use case. Awesome. Thank you. Thank you. Another feature we want to highlight with Universal 3.5 Pro Realtime is voice focus. Voice focus allows you to isolate the primary speaker and suppress background ambient noise that might be impacting accuracy of your transcription. This model is already our most accurate to date, and we want to allow you to get last mile gains by tailoring this to your audio environments. Let's take a look in the playground here. Under voice focus, we can see there's a couple new options. So you have voice focus near field, which helps with headsets, handsets, and other closed talking microphones. And you have far field. So think conference rooms, drive-thru speakers, laptop mics, other distant capture setups. Along with the two values for near field and far field, we also have a threshold parameter. And this allows you to control how aggressively background audio is suppressed. The higher values are more aggressive. So let's try an example now. I'm going to play on a YouTube video here, 10 full hours of people talking. So we can see that even with this audio, it can still properly pick up the transcription as I'm talking in real time. It's suppressing this background noise and latching on to my voice. This ensures that when people are talking in the background, it could be background noise, people behind me in the office, it will only pick up my voice because I'm what's closest to the microphone. This is great for drive-thru orders. When you think of it's really bad quality audio, there's people talking in the car. There might be people talking on the phone. A lot of different things are happening in that environment. And using something like Voice Focus will ensure that you're going to get the right audio to make sure your voice agents or your downstream processes are properly transmitted. All right, so that wraps up our demo for Universal 3.5 Pro Real-Time. We're extremely excited to see what you build with the new model. We can't wait to see the new products and services that emerge in the market. Thanks.
We’re Ready to Help
Call or Book a Meeting Now