[00:00:02] Speaker 1: Hello, hello, everybody. Thank you all for joining. I think everyone is actually entering from the waiting room now. We're going to give everybody a minute or two here to join and get set up. While we're doing that, just some quick introductions. My name is Ryan. I head up all the customer facing teams here at Assembly AI. I'm joined with Zach and Griffin from our Applied AI Engineering team. We're excited to just go over all things Universal 3 Pro, prompting with you all. Some quick housekeeping. This is a Zoom webinar, so you all should be able to submit questions while we go. Please send those in. We will try to answer those in text. One of us will be talking, the other two won't. We'll tag team some of those responses here. Send them through and we will answer them. Hi, nice to see you again, too. As we enter those, while we're live, we can actually answer it live and show a quick demo. We will. Of course, we'll answer them in text. If there's any at the end that we want to go back to, hopefully we're going to leave some time for that. Our goal is to leave at least 15 minutes for Q&A. This session is being recorded. We will send it out to all of the participants afterwards. Zach Griffin, grade me. How'd I do? What'd we miss?
[00:01:29] Speaker 2: Great work, Ryan. We've all been prompting absolutely non-stop with this model and we're really excited to show it to all of you. I'm Zach, by the way. Nice to meet you all. So far, so good. I'm Griffin.
[00:01:43] Speaker 1: Cool. I think we've got about 50 people. Instead of just having you wait here awkwardly, let's get started. People can join late and they can catch up as we go. So for those of you who are unfamiliar with our new model, Universal 3 Pro, it's a promptable speech-to-text model, which means for the first time, next to your speech-to-text requests, you can also include a natural language prompt to customize the results of the transcript that you're ultimately getting back from Assembly AI. Instead of doing a blog, slides, marketing pitch, we're going to jump right into the demo in this case and we can send you some of those materials afterwards. What we're going to use for this particular demo is this tool right here that we've gone and built. It's going to allow us to do quick comparisons of Assembly AI speech-to-text models and customize some of the results that we see. I'm going to send this here to you as well if you want to load this up and you want to play around with this while we're talking. But on the left, we're going to have our current production, sorry, our prior production model, Universal 2. This model really leads the market in terms of price per performance. You might find models that perform better, but you're not going to find models that perform better at a cheaper price. So this is kind of our old state-of-the-art model. And on the right, we're going to do Universal 3 Pro. For the purposes of this particular session, I'm going to go ahead and actually pick some files that I have stored locally that we want to use. I'm going to pick to start a meeting that's actually from GitLab. You can actually find this file on YouTube if you want to as well, and we'll send it kind of the show notes afterwards. But effectively, this is a YouTube file of a meeting, similar to a webinar like this, but some folks chatting internally at GitLab. They have a bunch of these meetings available online that you can go and check out. And so we're going to compare to start Universal 2 to Universal 3 Pro without a prompt, just so you can quickly start to see what does the model look like before, what does it look like after, before we introduce prompts, and start to just draw some comparisons to establish a baseline before we hop in. On this, you'll see that all the API requests are logged as well. So if you're using this tool, you'll actually see what is it doing. It's uploading the file. It's actually going to go and do a transcription request for both Universal 2 as well as Universal 3 Pro, and it's going to output those into the actual UI so that we can start to do our comparison here. I also debated pulling all these up beforehand, but I feel like if you don't do it live now, people think there's some like dark magic happening behind the scenes and you're like replacing or changing things that are actually showing. So if we are having challenges, we're just going to have to debug them and fix them together. But this had worked up until right before this demo. And of course, now it feels like it's slow. All right. While we're doing that, let's do some advertisements. So there's a prompt engineering guide. If you haven't seen this, I'd highly recommend checking this out. We will share this in the chat right now. This is a great resource for you to understand, not just like how prompting works, but a lot of the different things that you can do around prompting. And so some of these capabilities like verbatim transcript, getting better entity accuracy, doing code switching and multilingual, we're going to demo today, but we're not going to be able to demo all of them. And so it's definitely worth you taking a look at that guide and seeing what you want to do. So once this audio is done, we actually do an AI evaluation with Claude Opus 4.5, and it's evaluating Universal 2 on the left versus Universal 3 Pro with the prompt on the right. And it's trying to actually pull out some of the differences in this audio file so that you can actually see like which of these is better. Now, in this case, it's picked Universal 3 Pro. That's great. There's some insertions, deletions, substitutions, no hallucinations. But I think where it gets really interesting is where you start to notice some of these nuances around, you know, some of the names that are in here, some of the actual like proper nouns, etc. I'm going to actually play the first little bit of this file, just so we can all look at it together. And you can kind of see it playback and hear some of that live and take some of your own conclusions. Hopefully you all will be able to hear this. If not, definitely just let me know.
[00:06:16] Speaker 3: So it is the SEC, meaning secure and govern growth and data science, meaning applied ML, ML ops, and anti-abuse team meeting. That's a big mouthful. We might get a better name over time. And that's our meeting for September 14th or 15th in APAC. And hi, Alan. Glad you're here. Why are you here when it's midnight? We can talk. Glad you're here.
[00:06:44] Speaker 1: So if you're listening closely, the first thing you'll notice is like the second word was actually wrong in Universal 2. He does say it is. As the speaker continues to talk, you actually heard probably some like stutters and hesitations happen in the way that they were talking. You don't see those in either of those, these audio files right now, right? You're not seeing like stutters, disfluencies, etc. I will say Universal 3 Pro actually punctuated here. It sounded like maybe Universal 2 might've been right on this like, and hi. But one that's really critical and interesting in this example is like, why are you here when it's midnight? We could talk. And then this example, Universal 2 is why are you here? When it's midnight, we can talk. It's a completely different meaning in our prior model than our new model. And so you can start to see really quickly how you can contextually look at one of these models versus the other, changing the meaning of some of these sentences. And so this is with no prompting out of the box, we're starting to correct some of these errors. We're able to actually go and like pick up some of these like context clues and fix the meaning of sentences. So now let's actually talk through, how are we going to go build a prompt for Universal 3 Pro? That chart starts to tease out some of those different things that were missed that weren't perfect on the first pass. So I'm going to pick Universal 3 Pro for each of these. And what I'm going to do is start prompting on the right. So the left is kind of like our baseline, which is the Universal 3 Pro we just looked at. And on the right, let's start to actually add in some of these prompt instructions to see how this is going to change what the model is going to transcribe. So the first prompt that we've added in here, and we'll zoom in because maybe it's a little easier, is mandatory. Preserve linguistic speech patterns, including disfluencies. And so what this is going to do is it's going to cause the model to look over the audio file. And whenever it hears a disfluency, it's going to try to try to enumerate it. That might be things like, um, uh, like, you know. But what's interesting is the model itself actually seems to characterize different types of speech into what it thinks are like patterns. And so disfluency is just like one of these patterns that you might go and look for. And so you'll be able to see kind of what it looks like in the results up there. So if I scroll down and we look at what we get coming out the other side, you'll very quickly see, okay, there's like some it may, there's a comma, there's some changes in punctuations. Down here, we've added like to the meeting this, but it's not really enumerating as many of these things as I might want to, right? You can see that it's added some more verbatim to the context, but it's not getting all of the like ums and uhs and stutters that actually exist there. And what we found in testing and playing around with this model is the key is how the model interprets, uh, these, these patterns that you're looking for within the actual transcript. And in this case, if we just write disfluencies, it's actually not enough. It's trying to determine what that means, but it's not really sure, like, what is a disfluency? Is it an um and uh, is it adding in this or to the, and it's trying to interpret that, but we're not being specific enough what we want to see. And so it's, it's kind of shy. It's like hiding back some of the capabilities because it's not exactly sure what to, what to do in this case. Yeah.
[00:10:05] Speaker 2: Ryan, just to add onto this piece. So one of the key skills that this Universal 3 Pro model has, which is kind of, uh, you know, pretty amazing is that it can contextualize the audio in a way that previous transcription models just can't. So what this kind of filler words, disfluencies piece that Ryan's showing now, and you'll see it as we, as we show more prompts and examples here is that the model is capable of like interpreting an audio event. And based on the context that we provided up front with the prompt, determining how to represent that information within the transcript. Right. So like one piece is like, you know, it might not transcribe ums if you don't tell it to, but there's other pieces like that as well that we'll, we'll show later on. So that's just a, that's a little trailer for what to come.
[00:10:54] Speaker 1: And so we'll quickly add in on top of this, right? Maybe disfluencies is the wrong word. The model's thinking that um, and uh, are actually like filler words in this case, right? Maybe it thinks that they're hesitations. These are again, like some of those like different features you're going to start to tease out. And the more specific you can be with these, the better the model is going to be able to go look at all of that context and be like, oh, I should be transcribing um, and uh, this is clearly what I'm looking for rather than having a scenario where it's like, unsure, is this a thing? As we do that, I did see in the chat there were some questions. I can't actually see them on my end. So let me just take a, are they being answered somewhere else? All right, here we go. Let me open this one too. Okay, cool. Um, yeah, cool. Yeah. This is a live today. Um, so with, uh, the Q and a that we had, uh, by the way, like we'll go through at the end of this chat, how you can actually run your own evals, whether that's on your own dataset and you want to get the best prompt, or if it's a scenario where like you actually want to take your files, for example, and run that against a bunch of different models that are out there. But I do think what you're going to see is like out of the box with no prompt, this model is going to be much better than the previous generation of models. And then if you start to prompt it, like if you have a dataset that's very unique and specific to your context, maybe this is something like medical finance, legal, um, any sort of like specific context, you can actually use, um, this evaluation set, uh, to, to, to go and find like the best prompt for your specific use case. And I feel like, of course, when you're doing the live demo, something like excessively slow, which is great. Um, but, uh, we, we do have a question, so we, you can answer this live, actually, I don't know if Griffin or Zach, if you want to take it. Yeah.
[00:13:02] Speaker 4: Um, so Adam asked, uh, can it interpret and include in the transcript diarize non-verbal, but audible signals like coughs, sniffles, throat clearing, et cetera. Yes, it does have, um, uh, audio tagging, uh, capabilities. Um, so it kind of depends what in the training data we have seen most with these audio tags, but it does have the ability to pull out things like laughter, silence, noise, coughs, et cetera. Um, we'll see later in this demo, we're actually going to insert like certain tags, uh, on top of this that aren't even just, um, speech events. Like if the audio is unclear or not, you can actually, um, change the output so that it can note that it's unclear rather than guessing. Um, but yeah, it does have that capabilities out of the box.
[00:13:49] Speaker 1: Harrison of some of these, um, while we're actually talking. Um, so this will be the next one that we go to, and I'm going to go and stage it here so you can actually see it. Cool. Uh, so this one's back. Uh, what you'll actually see in this example, again, we're looking at disfluencies, filler words, hesitations within our prompt. And so with this, uh, it's going to try to pull out more of these different features in the audio set. Right. And so now we're all, now you're seeing very quickly, like, um, and, uh, have now showed up in this transcript. Uh, it may, which like is probably some sort of like speech hesitation that's happening there. Right. Uh, more arms, more us, just a lot more enumeration of all the little nuances that are currently occurring in the transcript. And we're teasing those out by continuing to add more and more of these different like linguists, linguistic patterns, um, into the actual prop prompt itself. If we actually, uh, go back and see this next one, what we're going to do here in this case is, uh, include even more. So previously we kind of stopped this prompt at hesitations, right? What if we add in repetition, stutters, false starts and colloquialisms, which is like, uh, most people say gonna versus going to, but a lot of times when you look at a transcript, it says going to, right. It's, it's trying to like guess the words. And so you can continue to kind of layer on these problems and you'll see that you keep getting different results within, uh, the actual, um, results that you have here around, um, all these different speech patterns that you, that you have within your audio. And what we think is super cool about this is, you know, because you have the control with prompting for different use cases, this may be good and this may be bad, right? Uh, if I was maybe evaluating my performance on this webinar, was it good or bad? You probably want to hear ums and uhs to understand if I was speaking clearly and well and professionally and whatever else you want to judge that on, right? In other context, maybe you actually prefer this original one where it's very human readable and it's gotten rid of all these ums and uhs and you want it to read a little bit more like a transcript or a book or a novel, right? You can actually control this. And so depending on your use case, you can change the output that you ultimately want to see, which, which is really exciting. Okay.
[00:16:11] Speaker 2: A key piece too of what Ryan was, was touching on with colloquialisms, right? Going to versus gonna, is that because Universal 3 Pro, um, you can prompt it to be, you know, very verbatim and kind of show exactly what was going on in the audio file, is that, um, if you're using a traditional like word error rate based evaluation, uh, you should really pay attention to specific insertions. What we've found a lot of the times is that Universal 3 Pro will do a better job than, than a human transcriber on some of this, in some of this audio data, particularly if it's very difficult. So, uh, something that we've been doing is like, if we see insertions, uh, in Universal 3 Pro's output, the word in the truth file, go back and listen to the audio file. I know in a, in an AI world, it can be tough to, to want to go back and listen to the audio, but you can actually hear a lot of things that Universal 3 Pro will pull out that a human transcriber might've missed when they were initially transcribing the audio.
[00:17:08] Speaker 1: And I think that's, that's a good, good point to transition to our next, next prompt here. And so, so far we've been looking at teasing out these different linguistic patterns, right? But let's talk about now what Zach was talking about, which is how do we get the model to start guessing and specifically like, how do we control how it's going to guess what word should come next? And what we've done for this is in addition to the prompt that we already have, which is, I'll just read it out so everyone has it in case it's like small and whatnot, but it's mandatory. Preserve linguistic speech patterns, including disfluencies, filler words, hesitations, repetitions, stutters, false starts, and colloquialisms in the spoken language. We're adding this additional one now, which is always transcribe speech with your best guess based on context in all possible scenarios where speech is present in the audio. And so the model, as you might imagine, is generating predictions for any of the words that it's ultimately going to output into the transcript. And we've now added in a rule here where we're telling it, Hey, take your best guess based on context. It may or may not be right, but like take your best guess. I want you to guess and try to figure out what should be here. And this latter part is in all possible scenarios where speech is present. And this says, make a guess. And if you're unsure if it's even speech, make a guess on that too. And so it's trying to get the model to actually go in and just take a guess at what might be happening in the audio versus looking at some of the predictions that it has and saying, actually that's like out of domain or it's too low of a prediction. I should ignore putting that within the result set. And so if we go and start to look at this particular file, it's a little bit hard to see all the differences compared to the original, but this particular sentence, right? If you, if you remember this, why are you here when it's midnight? We could talk kind of pauses and then says, glad you're here. That's like actually what the speaker said in that scenario. And that's where this idea of like disfluencies and stuff comes into play. It really makes a difference when you're able to have this prompt, pull all those different linguistic patterns out because it really captures the meaning of what the person said. And I'm going to play it back really quickly, just so you can hear that specific segment. Cause I think it's, it's really cool to see how it, how it comes out in this audio.
[00:19:34] Speaker 3: And hi, Alan. Glad you're here. Why are you here when it's midnight? We can talk. Glad you're here. Don't make it a habit to come to this meeting since it's really late for you.
[00:19:43] Speaker 1: And if you were doing some sort of analysis or putting this into LLM or kicking off a workflow, right. It's really important to figure out how that person spoke, not just what they said. And that's what you're starting to see when you get it to start guessing some of these different things and see that within, within the results that you ultimately have. So with that, I think what we've done now is we have a pretty good prompt for this specific audio file. I think if you go and listen to this, yes, there's probably some nuances of things that are like different or wrong, but it's definitely like headed in the right direction. What's challenging though, right. Is if we just keep making this prompt better and better and better, right. It's going to overfit to this file versus representing like the diversity of audio that we might have from our user base. And so instead of continuing to dive deeper in this file, I'm going to go ahead and pivot to a completely different file. And so instead of a meeting, we're actually going to take a transcript of a file from Miami Corpus, which is actually a Spanglish data set that's available where it's recording folks in the Miami area, actually talking in mixed languages. And so we're going to do the same prompt as before. And just to see like how it affects a file where there's actually code switching involved and you have users who are speaking English, Spanish, and Spanglish, right. All potentially within some of the same sentences here. Just to maybe clarify Mark-Andre, if we split the question you have is if we split the conversation for asynchronous transcription, is there a way to preserve the overall context between the different parts? I presume you're talking about like chunking the file when you go and actually send it through for transcription. There wouldn't be a way today for the model to know these like five chunks are all part of the same file. However, what we found in testing and doing these prompts is the actual context of the call is less important than the instruction that you give it. And what I mean by that is if you were thinking about medical, for example, it's let's take medical. If you wrote, hey, this is a doctor patient visit, you know, and that was like your prompt, that actually doesn't tell the model how to control what it should do in that scenario. It's like, cool, but like, how should I guess this word? The way to write that prompt would actually be something like what we have here, which is this is a doctor patient visit. You should prioritize accurately transcribing medications and diseases wherever possible. And that would tell it like, hey, I should actually like really be thinking about medications and diseases as I make guesses rather than like this is a doctor patient visit, for example. So I'll mark those answered live. Hopefully that helped answer. If you want to follow up, feel free. We will get to the next question actually in a second as we go through this example. Did I just delete everything? Okay. I deleted the prompt, so I'm sorry that I can't read it. So same prompt as before. If we scroll down and look at what we have here, this actually looks pretty good, right? It looks like we're adding in some like hesitations. It looks like, you know, there's some like speech hesitation here with gauge ums and uhs. So it looks like we've actually improved this file pretty well. But what happens is although it looks good, and this is why using the LLM as a judge can be sometimes misleading, it's still helpful to see what patterns it shows. But although it looks good, you can actually end up in a scenario where like something looks great, but the reality is it's not actually what the user said. And so I want to actually go and get like a specific example from this file and allow us to listen to it together. So let me zoom out one more time so we can get right on the spot where we want to be. Okay. Cool. So let's play this part of the audio right now and look at it together. If you listen to that, right, they're actually talking in Spanish all right around this particular segment here, right? There is some Spanish used before it says like with George, there's some Spanish afterwards. And so while this like looks right, if you're using an LLM as a judge, it actually translated this to English. Like it very clearly put this as English, right, in the ultimate results that you're going to see. And so what we want to do here is actually now start to tell the model how should it handle these scenarios where you see different languages spoken. And so I'm going to go ahead and move the current prompt we have to the left and I'm going to go in and add a new prompt on the right. And sorry, this is like the next one. Let's not do that one yet. Let's do this one. And this prompt is the same with all the same linguistic elements, always transcribe speech, but we're going to add one additional instruction to the model here, which is preserve the original languages and script as spoken, including code switching and mixed language phrases. What this tells the model is if you hear some different languages on this file, transcribe them as spoken, right? Like answer them or sorry, transcribe them as the user said, not do some sort of translation or something like that as well. And as we're talking, I think the critique this question is particularly relevant. What are the different languages that supports and will more be added soon? So Universal 3 Pro today supports English, Spanish, French, German, Italian, and Portuguese. We're well aware like many more languages would be awesome and especially for this like prompting capability. And so our research team is like actively working on Universal 4, which would include all of these different feature sets. And so yeah, six languages today, but you are able to use our API for 99 languages and you could use the prior model that we had on the left for those other languages as well. So let's go and actually look at the results since they're here. We've added in this code switching instruction, right? As we scroll down and look at this, you can actually see now, oh wow, like there's like legit Spanish being spoken right around that particular utterance. And by telling the model, like don't translate this thing, we've actually caused it to fix it. Now there are pros and cons, right? I haven't even listened to this part of the audio, but like, is he saying gauge or date? I don't know. We should probably listen to that too. But you start to see how having these different prompts is going to change the way that it's transcribing. And it's picking up all of these like small Spanish words, right? Within the audio file that previously the model did not tease out. And so this allows you to really get specific with code switching, especially if you have like mixed language files, pick up some of these nuances of the different speakers saying different things across them. So we'll go ahead and do one more here, and then we'll switch to talking about like evaluations and answer any of the questions that we feel would be good to do live. So I'm going to again, move the current prompt to the left, just so we could see it. And then I'm going to do a new prompt on the right. And I will run this. And we've been talking a lot about how the model is going to ultimately like guess, right? And we're trying to guide it how to make these guesses. And so the current prompt, we're telling it to like take its best guess. What if you don't want it to guess? What if you want to be very specific and you only want to write a word if it has very high confidence, right? You can change the guessing methodology by writing different instructions. And so in this case, let's do the same thing. Transcribe speech with your best guess when speech is heard. Mark unclear when audio segments are unknown. So instead of trying to guess now, we're instructing it like, if you're not sure, be not sure. And go ahead and write that as your answer. I think, Adrish, you actually asked this. And I see like another reply on top of that. So in any case, let's scroll down and just kind of look at what this looks like. So what you'll actually see now is those specific Spanish segments we just teased out are actually where the model is most unsure about the prediction that it's making. And those are the things that you're going to start to see in these unclear tags that are coming in the audio. It's also actually picking up on potential background noise. And this is where we could start adding in audio tagging, for example, in other use cases. But you could use this very quickly to figure out like, OK, well, the model predicted the left. It was its best try. The model was unclear about the right. And you can even use both of these workflows or sorry, both of these transcriptions to kick off all sorts of workflows. You could imagine even like the left is like almost like a pseudo-labeled human transcription. And the right, you would just go have a human look at the parts that are unclear. And that's how you could get like a great human-labeled file, for example. Zach, Griffin, anything you would like to add? Yeah.
[00:29:34] Speaker 2: Sorry, I was diving into the Q&A here.
[00:29:37] Speaker 1: Yeah, I saw. So that's bringing you back.
[00:29:40] Speaker 2: Yeah. So what's really cool about this, you know, the ability to prompt for kind of like unclear language within the audio is that we found it kind of gives the most accurate representation of like what's in the audio, right? Also, if you're doing word error rate-based evaluations and using like a normalizer, like Whisper Normalizer, it's going to pull out those brackets. So, you know, basically, if there's audio that like a human wouldn't catch either, we're not being forced to make a guess. And then it's kind of just important for that piece of it. What I've also found as well is that this unclear tag is great, but we've also experimented with using a tag in brackets, masked in brackets. I've found that using this tag in particular has led to tagging more areas of the audio where there's unclear language. However, this masked tag is also commonly used to tag profanity within a transcript. So it also has the negative effect of potentially removing profanity within a transcript. So that's why unclear, when we've tested a lot, you know, Ryan Griffin and I have basically become mad scientists with this model. We've done tons of experiments up day and night. And, you know, there's certain things that we feel are just probably the best for these types of use cases. And we've landed on unclear as probably the best way to represent that data. Of course, if your use case, you don't care about, you know, curse words being covered up, then I'd probably recommend you go with masks for this specific use case.
[00:31:28] Speaker 1: Yeah.
[00:31:28] Speaker 4: And just to touch on that, if you have a style guide that you're using for human level transcriptions, which is common, you know, that your word evaluations are like stripping those things. That's something that this is great for, for like kind of pseudo labeling near human level transcription. You can just prompt based on your styling guide and tailor to that. So whether you want masked or unclear or any of those different tags, that's something that's definitely possible with this prompting.
[00:32:04] Speaker 1: So I'm just going to do this live. We'll see if it works. This is actually what we do a lot of times when we get questions like this. So I see this one is PI redaction any better in universal three pro. And then Adam actually, I think put like, Hey, here's some things that like you might actually go and try to like put into a prompt and pull this out. And that's kind of exactly the type of thought process we would end up on here. This is probably not going to be the best prompt on, on, on try one. We're just trying to one-shot this. It's not going to be perfect, but the idea is like, if you give the instructions to the model of, okay, look for PII and personal information and like tag that as private, it's going to go and try to do its best to do that. Now the nuance is, I don't actually know, like, does the model know the word PII? What does it call personal information? And this is where like the prompt engineering comes into play. Right. It's like, okay. It looks like George is still here. So that didn't really help anything in this case, but it's now tagging private for him. Like it's hard to control this if we don't give it very specific instructions. And so you would actually want to enumerate something exactly like what was in here, which is like, okay, put all of these specific things as private. And this would give you a much better result because now we're saying, you know, always transcribe. It's going to see names, addresses, contact information now and be like, oh, I can figure what that out, right. That out is. And so it should actually now instead be like, okay, is this a name and address or contact information? Like, let's go ahead and mark that as private. And this is like the kind of back and forth that you would ultimately get into trying to do this within a file.
[00:33:45] Speaker 4: Yeah. And outside of prompting for that information or redaction, we also have our speech understanding PII redaction suite that many of you are probably used to using at this point. And this model via prompting will have better entity detection. And that's ultimately what that PII redaction model is based on. So even if you are not even using prompting here, or if you're using prompting for entities, not necessarily PII redaction, our existing PII redaction feature should see increased performance from this as well.
[00:34:17] Speaker 2: Ryan, I just sent over in the chat to this prompt, transcribe the audio verbatim, include speaker change markers. So what's really cool about this model is that like, obviously we're, we focus, we're focusing here on like, probably the most used use cases of this model. And we want to get the prompts perfect for those use cases. Right. But there's so much potential and so many possibilities of what you can transcribe with this, with this model. And this prompt in particular is really interesting. It's one that we've been experimenting with a little bit internally of like, because the model can understand the audio upfront, it does a really good job of like tagging the end of speech utterances between speakers. And you can actually see the events there. So there's a lot of stuff with the audio tagging that I think, you know, will be really fun to experiment with. And definitely encourage everyone to experiment with it. But yeah, you can see like, every time the speaker changes, it has a different speaker tag on there.
[00:35:23] Speaker 1: And something that we're working on is, this is pretty cool. It's like a experimental feature. Well, let me do this first. So like, you can do speaker change markers. And like, you can do this, like try to guess the speaker's name or role based on context. This would actually address the ask in here as well from Joseph, or sorry, not Joseph, from Paul, which is around like, can you actually label the speakers in the output? The answer is yes. But right now I would say it's experimental. And the reason I say that is because behind the scenes, when we actually process this audio, it's not processed all at once as all 30 minutes, right? There's some amount of like chunking going on on our processing pipeline. And so we're actually working on how we persist some of these like speaker changes across those different chunks. And the reason that like speaker labels might be strange is that in this case, it doesn't even know, right? It's like, I don't exactly know who these people are. This is just like two people talking, actually picked up some additional changes down here. But in the case of like a doctor patient type visit, it could try to guess the doctor patient. But if the call is long enough, and a certain chunk doesn't contain the doctor, the patient, it's like not sure who to tag. And so you might see like the first chunk look good, the middle look bad, the last part look good. So all these things are like, we're learning together with you, we're experimenting with. And ultimately, we want to actually take the best of both worlds and combine the speaker tags, you're seeing the model emit with our native speaker diarization feature. And so you're going to get a much better performance on overall diarization, because it's not just going to be based on the embeddings of the speakers and the timestamps of the audio, but it will also include this information from Universal 3 Pro, which is based on the audio embedding itself to understand if the speaker change or not. And so that's more of like a coming soon teaser of like what you would see in the future as well. Let me just look. So I do see a question around like mandatory, always required. Also, for medical accuracy, do you recommend prompting versus key term prompting? So maybe we could just address both those. And I will go into like the prompt guide here, just to really like, add to this. What we've seen is when you use authoritative language, the model is much more likely to follow your instructions than if you say something that is like soft. And so if you're like, take a guess, the model is going to be like, yeah, whatever, that's not that important. Versus you should always take a guess based on any audio that you see. It's like, oh, I guess anytime I see audio, I should try and take a guess here. And so it's not necessarily that like, those words probably don't change the outcome. Ultimately, it's more about the fact that you're being very specific and pushing the model to do something rather than telling the model just like generally, try this thing. It doesn't respond well to things that are like soft and don't have like authority and where you want to push it. So that's that. And then on the medical accuracy piece, I do think it's a good point to maybe bring up, you know, there is kind of two features within Universal 3 Pro. So we have prompting, which we've been discussing. We also do have key terms prompt. I know we brought it up earlier, but this allows you to, for instance, like specify right terms that you want to boost within the audio beforehand. I think really it depends on your use case. If we had, you know, a virtual meeting like a webinar like this, we actually know that I'm Ryan Seams from Assembly AI and I have rseams at assembly is my email address. We could actually put all those things into the key terms from because we think they might come up and that would help for sure. But where prompting is really useful is if you don't know the context of the audio file, it's going to be actually really hard for you to pick the right key terms to boost it in your audio. Right. And so ultimately, key terms prompt is great if you know the context. But if you don't know the context, prompting is going to get you so much further than key terms prompting because it's going to allow you to be very general with the guidance that you want to give and allow the model to infer the context rather than you providing it in the key terms prompt. Cool. Adam, I see you're talking about the emotion model and emotional labeling for this. Maybe we should go back and just quickly try to tag some of these things in. So let me just do this instead. So include all audio tags for non speech wherever encountered. Again, this is an experimental feature that similar to speaker labels, we want to be able to improve over time and make it like a native feature of the API. We're really excited that the model is exhibiting these capabilities, but it will potentially be like on one file versus the other very different in the type of responses that you're ultimately going to get. So I'm going to run this and we'll see what we get. And I see that the next question here is actually pretty similar to this from Turner, which is how about tagging when the media is played, when audio videos played by the participants in the recording being transcribed. These are actually kind of similar. Like some of these audio tags are things like noises, but some of these audio tags are actually things like size or silence or happiness or whatever. And so these audio tags are really going to be it's attempt at any non speech markers in the audio. What does it think that that audio actually is? And it's going to try and go tag that specifically. Griffin, Zach, have you tested some prompts with this? I've seen it go crazy. I don't know what your experience is.
[00:41:10] Speaker 4: Yeah, I can be overeager sometimes potentially, but I've seen it be really accurate as well. I think just like giving it as many specifics in terms of audio tags that you're trying to guess as a guardrail is usually the best performance I've seen. And Zach, I think I cut you off.
[00:41:30] Speaker 2: No, I was just going to say I saw Adam's question about like emotion detection and being stuck with basically the situation that you're in. Emotion detection is definitely something that the team has talked a lot about internally. And in the future, we definitely will be powering those use cases.
[00:41:52] Speaker 4: There's a great question from Dan, and I think Ryan, you touched a bit on this. Yes, right now they are mutually exclusive between key terms prompting and prompting just open field prompting. However, because prompt parameter is open, you can actually include key terms if you'd like in that open prompt. And that way you can kind of combine the two, but like at the parameter level, yes, they're mutually exclusive. But we've actually seen if you say, hey, here's the key terms that I want you to look out for and boost, that's a way to get around that restriction. Yeah. Anything to add there, guys?
[00:42:27] Speaker 1: No, we're going to edit the docs right after this meeting so that you have the latest and greatest, and that'll be in the materials that we share as well. I do want to come back to this like audio tagging, emotion tagging, et cetera. So we did run this prompt, just to be clear, this was include all audio tags for non-speech wherever encounter. And what you see here, right, is like this was that part of Spanish that we heard earlier and made it foreign language. We were also seeing other things further down in this audio file, which is like, there's some sort of noise in this file before he says corrosive. It sounds like there's music later on. This file is actually really interesting because later on, it's like a child singing. It's really low volume actually. And you see all these different things starting to come out around music and applause, all these different things. And so again, I would call this like an experimental feature in that, like, you're going to get different performance on different files. And what we're trying to do is make the model more robust to these scenarios and then allow that to be like an actual API level capability where you're getting consistent performance versus like trying to provide this every time. So that one's done. Yes, Dane. So prompting currently costs an additional 5 cents an hour on top of the list price. That being said, I mean, yeah, we're more than happy to talk pricing or anything like that. We're pretty easy and friction-free. So if you want to talk to us afterwards, great. But that's what it is out of the box for the pricing. It's 21 cents an hour list and then 5 cents an hour on top of that. And then Turner, I see you're actually, yeah, go ahead, Zach. You nominated yourself. You're taking it. Yeah.
[00:44:12] Speaker 2: So I was about to type an answer. So I figure it's probably easier to just talk it through anyway. So what we've found with kind of prompting the model is that providing some context upfront on kind of what the audio data that's going to see does help, but the model is really, it's very instructional, right? So like you can see that a lot of the prompts that we gave it were kind of like instructions on how to transcribe the audio or transcribe a word this way. You can attach names, you can key terms, prompts, stuff like that. When it comes to attaching domain level context upfront, the thing is about it is that because it's a contextually aware model, it's going to find out by just listening to the audio very quickly that it's a political conversation. For example, this conversation is about politics. A minute into the audio, it's going to be aware of that. So attaching that domain level understanding upfront doesn't necessarily have a huge impact on the accuracy. Now, if you, let's say that there's like local politicians, you're transcribing the audio file that, and you can attach their names upfront, obviously that's going to impact the accuracy there. Or if there are specific sounds that you want to transcribe within the audio, like shouting or claps or stuff like that, if it's like a speech or something like that, that's going to improve outputs as well. But just attaching that kind of context upfront may have some effect, but not a major effect. So I think we're all just reading Chris's question. Generally speaking, a lot of the features, speaker utilization, speech understanding, et cetera, seem to be limited with the stream method versus pre-recorded method. Is the answer to have my developers record audio as well and send it to post-processing on their recording? Interesting question. So I'll let Ryan take this one.
[00:46:14] Speaker 1: I'll hop into there in a second. I did just want to say, I just put this up here to do the prior example that was about the context that we had here. This was what, Turner, right? So on the left, I was like, this is a conversation about a family in Miami hanging out and discussing their day-to-day. And then on the right, it was like, hey, transcribe mixed language phrases wherever you see them, pay close attention to transcribing wherever you can. This is like an ambient mic. I even had a typo. It was like low quality. And just to kind of show you the difference that comes out here, you're already getting more of the disfluencies and the grammar of what's said, at least some of the Spanish in what they're saying further down. And again, the reason is when the model sees the left, it's like, okay, cool. What do you want me to change? I'm transcribing this file. What should I do? This doesn't tell me what to do versus the right here is very specific. Like, okay, well, if we have a Spanish call and a family in Miami, what do we want to pay attention to? And that's kind of the difference of how you would prompt this model ultimately. Chris, I would go to your question now, answer live.
[00:47:26] Speaker 2: Yeah, I'll just, yeah. So two things. One, we are launching speaker diarization for streaming. That's coming very soon. And the other piece is that we are going to be launching a streaming version of this Universal 3 Pro model that is fully promptable. So be on the lookout for that. It's coming very soon.
[00:47:42] Speaker 1: Yeah. So I think speaker diarization, we actually tested like an alpha, um, whatever. We're live. I don't care. It's in the API. You can try and find it yourself. I won't guarantee the performance, but Chris, we'll reach out to you one-on-one afterwards. If anyone else wants to try it, we can send you the information as well. It's actually in the live API, just like, you know, we're still testing and verifying it. On the speech understanding piece, I totally hear you there. We need to make it easier to do things like redact PII out of the box and streaming. And so we are going to find ways to go and do that. I think the teasing part that Zach was getting to, um, if you've paid attention, we've been using this the whole time. There is a streaming button actually in the top right corner. And, um, this will actually allow you to play with Universal 3 Pro streaming as it exists today. Uh, what this model does again is like you can add context and prompts, but I'm just going to record like a quick clip to try to show you some of the nuances of like where it performs versus what we have out there today. Let's hope the app actually works. All right. Hello. Hello. Awesome. Uh, this is Ryan seams. I'm here from assembly AI. Uh, my email address is our seams at assembly.ai.com. My phone number is 5 5 5 1 1 1 2 2 8 8. Super excited to be here on this prompt engineering workshop. Yes. Yes. All those things that I just did were the hardest things to do in voice agents, right? And specifically here, like we got my email right directly head on. The phone number was correct. We didn't like skip a number. You heard me at the end. I was whispering. Yes. I mean, that's pretty good to start to pick up on those things. And so all of these things around, like the model being context aware and really getting this great live transcription is going to be coming to streaming as well. And of course you can start to prompt that too and say things like the audio quality is terrible. Always try to pull out anything someone says, or, uh, you've got, you just got off a call about this, Zach, you have a customer that always says very specific terms. Right. And they're like, we just want you to really get these like five words because it's 50% of what people say to us. Like, how can we prompt that out of the model? Right.
[00:50:08] Speaker 2: Yeah. And so obviously the, the core of this was about the, the async part, but we're obviously very excited about the streaming piece as well. We're getting things, uh, you know, building out custom term detection for this model in particular. And obviously the promptability piece is really amazing because if you have a voice agent and, uh, you know, and you can dynamically prompt it, right. So you can increase accuracy of your voice agent on the fly. For example, you can be like, uh, you can ask a specific question from the voice agent and then be like, these are the types of answers that are coming and boost the transcription accuracy of those coming. So, uh, yeah, I know we're primarily focused on the, the async piece though. So let's, uh, we'll jump to those questions, but yeah, you, y'all can all go test it.
[00:50:52] Speaker 1: Um, I'll actually jump to Emil's question here about like iterative transcription with prompts, how to actually like get to the right prompt. I think that's a great segue because we have eight minutes and we're way too excited to answer your questions and we should talk about how do you actually do this yourself at scale. And so I have a couple, uh, GitHub repos here. Uh, you could probably go explore these. So this is from our, uh, head of product and technology, Alex. This is a command line tool that allows you to actually pull in, uh, datasets from hugging face and, uh, run different prompt simulations on those datasets and see how that compares to the ground truth files ultimately. And so using this, you could actually pull any public dataset and I'll, uh, we can share these as well. They'll be in the post post-show materials. Um, you could pull a public dataset, you could compare the word there and you could just keep iterating through prompts. So like find, uh, the ultimate, like best thing for that dataset. Now, uh, I think this is really useful for like quickly trying some of these simulations. Uh, me personally, I've been like much deeper in a meal, like what you were talking about, which is like, how do we find the best prompts for a certain scenario? And the reason I've been doing that is because we just talked about the streaming model. We want to have the best system prompt possible for this model. And so we've been running simulations on our streaming model of all these different datasets that we have and actually pushing them into, um, uh, like different eval sets. And so this particular, uh, one is the one that we've been using. Uh, it's actually public as well. And we can share this with you afterwards, but basically all of these are using like a optimization technique where it's going through and defining different prompt components. It's trying like extreme versions of that positive negative middle of the road and running a bunch of trials to try to figure out like which types of components really influence the output. And then within that component, like what style of that component works. And when I say component, you know, we talked about all the things we talked about earlier, which is like, there's a disfluency component. And then within the disfluency component, there was like the really short one. We did, there was the medium when we did, there was the long one we did. Those would be like the features that we start to test and running these optimizations. You can really quickly start to converge on like, what's the best were for this particular dataset. One thing additional to add is this particular, uh, repo is using a traditional normalized work that could be great in your case. In this particular repo, we're actually doing something different, which is called semantic and semantic where is more nuanced. It's actually defining a bunch of rules around word error rate that you want to put up front and have the LLM interpret the judging for. And so we're finding that with this model, most human labeled datasets are actually wrong. Like humans have been making transcription errors for a really long time. And it's only with this new model that you're actually starting to see those errors come out the other side. Zach, I know you've been working with a customer, you had one file you were working on with them and you said, you give the example.
[00:54:00] Speaker 2: Um, sorry. I just, I was thinking about the four, the custom formatting piece and coming up with a, you know, a conversation points on that. So sorry. What were you saying?
[00:54:10] Speaker 1: All good. I was saying like, you have a customer that you're working with where you're like, you actually looked at their human labeled files and they were just like straight up wrong.
[00:54:17] Speaker 2: Oh yeah. So, so it was interesting because when we'd run kind of like large scale evals and I kind of, I hinted at this earlier is that you might see if you have human labeled data that you're running these on, like the word error rate from Universal 3 Pro is actually higher than Universal 2, which you're like, what, why, why is that the case? This is our state of the art model. So I actually went through for one of our customers, went through one by one, every single insertion that Universal 3 Pro or difference between Universal 3 Pro and the human labeled truth file and identified the differences within it. And pretty much like, I think like 95% of them all came back where Universal 3 Pro and the human labeled truth file was actually incorrect. So as, as you evaluate the model, like this is kind of a crazy thing that we've been thinking about a lot internally of like how, how we explain this to customers, because it honestly breaks a lot of the traditional human truth file based word error rate evaluations. So yeah, the better that you're, the better your file is with kind of all the, of transcribing all the audio data available within it, the more accurate your word error rate based evaluation is going to be. That's also why for when you're, when you're doing these word error rate type benchmarks and evaluations, including that in the prompt, something like label unclear or inaudible audio data as unclear or masked actually improves the word error rate because in that case, the model's less likely to transcribe audio data that it actually does hear, but it doesn't, but that a human isn't just not going to transcribe at all. So, you know, as you're evaluating, you know, feel free to reach out to our team and we can, we can help kind of like, you know, with these types of evaluations, explain what, what might be going on, but definitely something to look out for. Like we're, we're hitting a pretty crazy point with, with these AI models where they're doing better than, you know, a human at evaluating it. So definitely be on the lookout for that.
[00:56:27] Speaker 1: And I don't know if you follow artificial analysis, they do benchmarks with tons of models. And I thought this was super interesting. They had to create their own proprietary dataset that providers don't train on. They went through earnings 22 and Vox Populi and manually corrected all of them because the ground truths were wrong. And like, you should watch out for that when you're picking datasets off the shelf, like you need good labels to do good optimization. You can't use something like this and do a statistical analysis if your labels are bad and they even remove datasets that were just like wrong as well. And this last thing, this improved normalization is getting closer and closer to semantic war. And so we're starting to see like the market converge on those things. I did want to address one quick thing within the folks that are like in here, trying out speaker labels in Universal 3. Speaker labels is for sure a experimental feature right now. If you're going to play around with this, like it's very clearly like your mileage may vary. This is not going to be something you should put in production. We have a speaker diarization feature. This is what you should be using on your API request. If you want stable, correct speaker labels today, we're going to be folding in the power of the model and the acoustics that we showed earlier into this API. And so that'll be coming soon. But if you want consistent results, you should be using speaker diarization, not using the speaker tags from the prompting right now. So that's something to note.
[00:57:51] Speaker 4: And then on top of that, it kind of says it right there in the doc, but you can use speaker identification on top of that, which is an additional feature that based on the context of the call, it'll use the speaker labels to assign actual names to them. So while this model is really good via prompting, identifying like speaker boundaries, it's still quite not there yet for like full speaker labels. So just something to look out for there. Definitely try to use the speaker diarization first with this model at the current point.
[00:58:18] Speaker 1: And that would address some of the comments around there, around like the diarization being wrong, as well as like the speaker labeled hallucinations. You won't see those in the diarization and identification features. Those are just in Universal 3 Pro for the time being.
[00:58:33] Speaker 4: Ryan, we were mentioning kind of like how to iterate on prompts here. And I think that'd be a good time to mention the prompt repair wizard that we have as a new tool in the dashboard. We've been doing a lot of this testing ourselves, iterating on prompts. So it becomes kind of hard at scale. And like these repos, we've shown our way to kind of do that. But this is another tool that is now available on the dashboard where you can essentially put in your prompts, describe what issues you're seeing in the output, and it will analyze. And based on what we know about prompting best practices, scanning our docs, it'll output ways you can improve your prompt.
[00:59:10] Speaker 1: So we'll run this just for everybody. I know we're at the top of the hour. I don't think we got to like every single question that's in here. We're more than happy to jam on this with you. If you want to have another follow-up session one-on-one, if you want to come and join us in Slack, we're going to send a means for you to jump in and join us afterwards. Email, live chat, whatever you want to do to contact us, we're more than happy for you to try this. Again, this model is very new. We're only two weeks in, but it feels like two months. We are so excited about the capabilities. You can see some of them as we're doing some of this prompting. And as we learn more and more, we're going to be bringing these features to market as like full-fledged features in our API. I appreciate everyone for the time today. We're here for you if you have questions. Thanks for coming out there. And there we go, Griffin, right on time. You can actually see all the different things it's recommending for you to go try to get a better prompt. And so definitely try this out, play around, and we can't wait to hear your feedback. Share feedback, let us know how it goes. Let us know what you find. We're constantly learning with you. And yeah, excited to see what you prompt. So thank you all. Bye, everybody.
[01:00:18] Speaker 4: Great questions.
We’re Ready to Help
Call or Book a Meeting Now