What It Takes to Ship Voice Agents in Production (Full Transcript)

Panel insights on deploying voice agents: redundancy, scripting, latency targets, QA metrics, and why outbound vs. inbound priorities differ.

Download Transcript (DOCX)

Speakers

Add new speaker

[00:00:00] Speaker 1: All right. Awesome. Well, great to see the turnout tonight. My name is Ryan. I'm here from Assembly AI. If you have not heard of Assembly AI, we build voice AI infrastructure for developers and builders like yourself. So tonight we have an exciting panel here to talk about voice agents. You may have seen some posts from our team on LinkedIn, on social last week, but we released a state of voice agents report. There is a QR code not behind me. Oh, yeah, there it is in front. Or is this the playground? There's QR codes for our voice agent report over there if you want to grab a copy of that online. This discussion will refocus more, though, on what people are actually building in production today around voice agents. And so we're going to spend about 25 minutes now doing a little like fireside chat here with the group, and then we will open it up for questions from the audience after that. So with that, let's go ahead and do some introductions from the panel. Maybe, Blesen, do you want to start us off?

[00:01:04] Speaker 2: Yeah. Evening, everyone. I'm from Chicago, by the way, so I brought the snow and I'm not bringing the cold yet. I'm going to go back to the cold tomorrow. But great to meet you guys. Great to be here. I'm the CEO and co-founder of Aviary AI. What we do is we provide voice agents for the financial services space. So credit unions, banks, and life insurance companies. We're specifically focused in on outbound right now. We have not gotten into inbound.

[00:01:28] Speaker 3: I guess that's my cue. My name is Craig Bedoin. I'm the co-founder of Trellis. We're a YC Winner 22 company. We're really kind of a voice company, so we've done an outbound parallel dialer for many years, and we started recently introducing voice agents as well, primarily inbound as well as some kind of web-based experiences for folks who wanted to practice or do other non-telephony applications.

[00:01:52] Speaker 4: Hey, I'm Luca. I'm head of real-time here. I lead research, engineering, and product at Assembly AI. That's all. I'll keep it short.

[00:02:04] Speaker 1: And just to give some context on myself, I head up the customer-facing teams at Assembly AI. I'm based in San Francisco, and it's quite cold here, so I had to bring some specific clothes just to get here this week. So I am going to do a little bit of a cheesy thing and give some nods to this voice agent report as we go, but there's just going to be some intros to the questions to kind of set the stage on kind of what we've been seeing in the market. And so in that report, something that's really interesting is we had 87% of the people respond saying they've actually deployed a voice agent to production, but just as interesting, 75% of those respondents said that they're not satisfied with the voice agent that they have, which leaves us with what? Something like 12% of the people who actually have a voice agent they're happy and proud of. That number's pretty low, right? I think we're all here tonight to get that number much higher over time. And so maybe to kick things off, I'd love to hear what are some of the lessons you've learned about deploying a real production voice agent that would be worth sharing with the audience? And maybe, Blessin, you want to start us?

[00:03:02] Speaker 2: Yeah. To give you guys a little bit of metrics behind the industry we service, only 18% of them make outbound calls with human beings today, right? So the opportunity for us is they're doing nothing today, right? So from us introducing outbound voice agents for these banks' credit unions, so think about your brand new customer doing a welcome call so that you get your card set up the right way. You have gone inactive on your account, we're reaching out and making sure you reactivate your account. We're doing collections calls, right? So those are kind of the series of use cases that we end up doing. So for us, what's worked has been one is, again, we're going from zero to one for them because they're not doing it today anyway. So we're teaching them, we're introducing really a new function for them beyond just AI, just the idea of making outbound calls. But the secondary thing that we've really established is, hey, what does success actually mean to you guys when we make these calls? Some of them are informational. We have a great amount of alignment when we do our onboarding with them to say, what does success actually mean to you? So that we can measure it during these calls, we can do post-call actions, we can do grading and auditing to determine whether or not the call actually ended up the way that we did. So that's how we've been able to start to see success where most of our clients start off with one to two use cases originally, and then now they're starting to expand into seven, eight different call types per client. So that's the thing that results in thousands of calls coming out per month, right? So we're building up confidence with them by, number one, getting agreement on what success looks like, two, doing something that's brand new for them that they've never done before, and they're seeing real ROI as quickly as possible. That's why we're starting to see the growth that we're seeing in our space.

[00:04:41] Speaker 3: Yeah, I guess there's two things I would add to that. So there's kind of always a business perspective of what's a good application for this. Is your customer bought into actually doing this at scale that it's going to be material for folks? I think the best analogy is really just if they're really serious about doing this beforehand. We have some customers who come to us, and maybe they want to do something, but they're not currently doing it at a scale which would make sense to automate, that deployment's probably not going to go very well. But if somebody's got a call center full of 50 people doing exactly this thing, and they're pretty motivated to upload this with AI, that's much more likely. I think to what Blessin said, that also tends to align with folks who have more precise metrics really and are being metric-driven in their decision-making. So that's my main flag on the business sense. On a technical sense, I think that the industry as a whole has gotten a lot better. I think for folks who maybe have tried this in the past, it's probably easier now. I think otherwise, I would say that we work pretty hard on really on redundancy. So I think that there's so many failure points across these stacks. And I think that what we've seen at least is that you're going to hit one of these, every one of your components at some point, at some part of your load will go down, or will just be a couple seconds latent. And so we do a lot of work to make sure that we have a fallback, ideally one that's running at the same time and is almost immediately available for those situations.

[00:06:01] Speaker 1: I think that's super interesting to maybe dig into a little bit more. Can you give some more to the audience? How does it actually work technically? And what does it look like end to end?

[00:06:09] Speaker 3: Yeah, sure. So some folks here may be running voice-to-voice. We're running a stage pipeline. So you're doing transcription. You're receiving audio from your telephony provider. You're doing transcription. You're doing something that's going to ultimately produce text. And then you're generating some speech, and then you're dispatching that to your telephony vendor. I think that in general, everything before bytes go over your outgoing WebSocket that you deliver to your telephony vendor, you have full control of it. You can run that in parallel. So if you take those stacks, transcription, assembly is great. I'm happy to be here. I love these guys. There are also other great transcription vendors. You should totally use assembly, but also maybe use these other guys in addition. It's okay.

[00:06:49] Speaker 1: It's not a sales pitch. You can be honest.

[00:06:51] Speaker 3: Please pay me pizza. That's what you get for one slice of pizza.

[00:06:53] Speaker 4: Little hurtful, but it's okay.

[00:06:56] Speaker 3: But so I mean, you can, right. So I mean, obviously there's a question of what you want to do if you think one of your vendors is latent or if they disagree. But I think fundamentally, if for instance, one of them is latent or just gives you an error status code, it's pretty obvious that you want the other one. For us, I think, again, in higher volume applications, we see a lot more scripting. So I think when folks are kind of naively enthusiastic, they're like, oh, yeah, anything it says is awesome. Just give me that. But those people are generally not very serious. The people that you want to do business on have really strong opinions on every single word you say. And because that matters to their business, and they've been doing this for a decade with 50 people. And so when you look at that, I think we try and drive out the margin for creativity, sort of. And so we, as much as possible, would prefer to kind of script things out for them and play exactly the words they said. So if you do that, if you basically have a way, and I can tell you about our solution, but if you essentially have a way of realizing what you need to say, then you can kind of get redundancy prior to that. And then you're just left with the speech generation part. That, I think, primarily, again, if you knew what you were saying ahead of time, you can make that offline. And then you don't have to be there at the last minute scrambling to be like, why is my text-to-speech vendor latent right now? Obviously, if you're doing things live, you're kind of out of luck. But at least you got some of the time, hopefully, that you could hit a cache. And then I think the telephony vendor, again, you're kind of single-threaded on. The telephony vendors are pretty good. So the likelihood that they were able to deliver you, for instance, an incoming call or take your outcoming call, and then they drop your WebSocket doesn't seem too high to me. When we had the big AWS outage, it's kind of like everything with your telephony vendor broke. So I don't have any redundancy tricks up my sleeve for that one. But I can't say you need one.

[00:08:34] Speaker 1: Yeah.

[00:08:35] Speaker 2: I'll just echo what Craig said. He described almost our entire tech stack right there, right? We do the same. Copyrights. Yeah, copyrights. That's our moat. So we do the same kind of flow. I would say two things that we've learned through this entire process. One is, redundancy really, really matters because concurrency is an issue of vendor to vendor, especially when you have high volume calls that are happening all at one time, where we've run into issues with DeepGram, right? Where concurrency was an issue and we had to scramble to go find somebody else in the early days to figure out what we're going to go do. Fortunately for Assembly, here's the sales plug. We don't have that concurrency issue, right? So that's a great thing. But overall, we try to have redundancy across the board, whether it's using Twilio and Telenex, whether it's using Cartesia, Levin Labs, even DeepGram for voice, using DeepGram and Assembly for the transcription side of the world too. The secondary thing that I would say too is just kind of going over to speech to speech and I know we have a topic around that. I don't know about everybody else here, but we've really struggled with speech to speech models. One is, it's kind of dumb. They have really dumb responses and task adherence is really, really poor. The performance on it is really, really poor. Like as much as I would love to shift over to using speech to speech, especially when we're doing outbound calls, the fact that it doesn't follow instructions is a pretty big deal because the industry that we're in, like for us, it's pretty simple. None of our customers care about our voice agents until it gets deployed. When it gets deployed and it's calling their customers, they give a shit. Like that's when they care. They care about the quality of what's actually happening. To back up what Craig was saying, they all think they're unique and they all need to have different things, but one of the big wins for us to get consistency has actually been getting caching in responses too so that we're not having to go back to the LLM each time because if we're calling you about card activation, we have a probably 70% reliance that we know what the responses are going to be. Why wouldn't we cache those responses back and determine whether or not it's a good enough of an answer rather than going to the LLM each time. That's how we've been kind of building it, but Craig already described our tech stack.

[00:10:44] Speaker 1: I'm kind of curious, maybe pulling on that thread a little bit more. One of the things that we talk about is these guardrails that you might have to put in place as you're going and building these voice agents. You kind of described the full stack. How many guardrails are actually in between each of those steps and how often, I guess, are you changing them? Is this an ongoing whack-a-mole battle? Describe it to us.

[00:11:04] Speaker 3: I would just say that I try and get my more serious customers to script more. I think that any time you let an LLM decide what to say, you're at risk. Most of the time it's going to be fine, but we had a customer start talking about suicide. I didn't want them to do that, but then of course the LLM's going to respond about suicide and then the client's like, why are you talking to this person about suicide? I have no interest in playing whack-a-mole. I think that categorically the way to do this is to, if you can find your application in a truly high volume standardized regime, you're going to be set up for success. If you need to speak to everyone about everything, you're going to be fighting an uphill battle.

[00:11:41] Speaker 2: Guardrails for us, it's a little bit different for us because we don't do as much of the scripting. Obviously the opening line is scripted, ending line is scripted, voicemails are scripted. Part of what we do is, one is when you're doing outbound calls, it's a very pointed reason why you're calling. We provided a knowledge base for it to actually happen. When I say knowledge base, it's not going and doing a tool call to pull from a knowledge base just because you don't want it to be a wide trough and latency obviously matters in this. The big part of it is feeding in the FAQs. It's really about prompting at that point to make sure that it doesn't go outside of the rails of whatever you've given it within the context window. That's how we've been keeping it in working. The second thing that we've invested a decent amount into, again going back to my earlier statement, that when do customers actually care about these voice agents is when the calls are being made. They care about monitoring. We've decided as a team, again legacy-based industry that we're servicing, that we're actually not trying to focus on deployment, like making it so that it's self-service deployment. We'll handle that. We'll take care of that because it's the 80-20 rule, but where we're going to push it back onto the client to now manage and monitor or take care of is monitoring and QA. Making sure that we build up as much on that front as possible so that they can do natural language querying against their data to see, hey, what was said during calls? What was the tendencies of calls? They can do all that. That's freely available. They can go ahead and monitor it. We add those in and they can point out pieces to us. They do call reviews too independently to tell us if a call went great. Did it meet their standards? Did it not meet their standards? We haven't had anybody knock on wood on complaining about hallucination or anything. We try to put that back onto the customers more and more now.

[00:13:28] Speaker 1: Luca, we heard a little bit about reliability, latency, redundancy. I'm curious, any perspectives from your side around how some of these things go into the way that we're training and thinking about bringing these models to market at Assembly?

[00:13:39] Speaker 4: Good question. For the beginning, it's very easy from the R&D perspective, it's very easy to get things wrong. We still get some things wrong and that's why we have a close partnership with you guys to iterate over them really quickly. I feel like the biggest role is coming from just spending a lot of time with customers and understanding how your models perform and just going back to first principles and then designing systems. The thing is, sometimes we really want to have some very elegant and cool solution. Hey, why don't we do a speech to speech right away? We wanted to do that in the beginning because it's pretty cool. But at the end of the day, it always just comes down to what is the simplest way I can answer to this problem. We have published a few papers. The technology we used, some of it is very, very novel. Some of it is actually from 2015s to early 2020s because they are very reliable and it's easy to make sure that we know the performance of the systems, but we also know all the bad things you can do and that amount is very, very small. It's very iterative approach we have to take and unfortunately saying no to cool things, that's the way we can ensure reliability from our perspective.

[00:15:24] Speaker 1: I'll switch gears a little bit. One of the things from the voice agent report was around what are people prioritizing when they're actually building voice agents and some of the things that they're evaluating. Some of the things people talked about were speech to text accuracy, conversational understanding, latency, integration capabilities, background noise detection, human sounding voices, accents and dialects, the list could just keep going on and on. But I think what's interesting is one of you is an outbound use case, one of you is an inbound use case. Maybe you could talk to the audience a little bit like how for your particular use case, some of those different characteristics matter more for your type of customer at the end of the day.

[00:16:01] Speaker 2: So outbound, there's really two things that I think our customers really care about from a voice perspective is voice quality and latency. I remember our early days, so just previous history we were a consumer app, we actually did a voice agent for consumers to call collectors back in March of 2023 when we came out of YC and it was a seven second latency. It was actually kind of great because collectors were getting pissed off because they're like hello, but whatever, we were a consumer app, right? But as we progressed through this, I remember when we launched in, when we pivoted over and we launched in B2B for banks and credit unions, we had like a three and a half second latency and customers were pissed. They were angry because we were three and a half seconds. Now we're sub 1.6 all in, doing Twilio, getting it all the way through, not just from a model execution perspective and they love it, right? So they care about latency one. Within that, they care about time to first token, right? Like when's your bot actually going to reply back because they care about that first initial pickup as that part of it. The second thing that they care about is the voice quality. They want it to sound very, very conversational. So I think it does matter who you pick from a voice provider perspective. Some have performed better than others. I will say DeepGram has a voice that has done really, really well for us and we kind of give our options to our customers of who they want to go and utilize. We are starting to now leverage Rhyme a little bit more too, just to kind of give an idea of the vendors that we're looking at and working with. But really those are the two key things that they care about is how quickly is the voice agent obviously replying back from an overall perspective, but also from that first part of the conversation. How good is the voice so that it actually sounds conversational? Then there's all the other stuff too, right? Like we were having a side conversation around background noise and how it's been impacting some of our results lately. I mean, like those are things that clients don't even think about during calls. Those are all kind of like added things that we care about. Clients don't even realize it or care about it or know about it.

[00:18:05] Speaker 3: I agree with all that. I think that we tend to show our clients a transcript. So if they, they can always go listen to the call, but most of them don't. I think, again, when you're very early on and your client tests your agent for the first time, they're going to be hypersensitive to latency and to how the voice sounds. But when you're thousands, tens of thousands, hundreds of thousands of calls in, they're not going to be listening to these calls. They're going to look at the transcript and they're going to ask like, did this, sorry about this echo, did this like do my business objective? Did it have the right conversation for my business? So again, that does absolutely like, you know, you can both tackle both sides. If anything is really wrong, like if your latency is too long, the conversation is going to be off the rails because the person's always going to be like, hello, hello, hello, where are you there or not? Right. I mean, if the voice is really bad, they're gonna be like, you sound like a robot. But if they're not saying those things, then we try and keep our focus on, again, if we like conform to their requirements and we had the conversation they would want a human to have.

[00:18:59] Speaker 1: Maybe diving a little bit deeper on like this idea of like human QA, what measures success? I'm sure like your customers define this differently maybe than you internally. Like maybe how are you using different like qualitative and quantitative metrics to measure the success of these calls and improve your voice agents over time?

[00:19:16] Speaker 2: Yeah. This is, it's actually really interesting when you're going into a market where like again, and we were just having a pre-conversation about this, but when it's the CFO of a bank that's never done up on calls and he certainly wants all calls to be a hundred percent perfect and you can't even get that with humans. Right. So that's kind of a little bit of an uphill battle of like, he doesn't even care about the quality. He cares about the fact that there was a double pot or he cares about the fact that there was the voice agent didn't reply because it couldn't hear the person because of poor cell service. Right. Like those are things that they don't really care about. So one part goes back to what I said earlier is it's really important to set the baseline of like, what do you guys define as success? What is the adherence of what you consider poor? Like for us, we try to tell them like if you think a poor call is because it didn't sound the way that you wanted it to sound, it shouldn't be considered a poor call. A poor call is, is this a reputational risk for your business or not? Like having those definitions matters so that you can set the expectations with them, especially if they've never done these before. I'm not even talking about AI calls, just outbound calls in general. That's one part of it. Right. A lot of the qualitative stuff that we started to do now is, or I'm sorry, the quantitative stuff that we've actually started to do now has been measuring out how many technical issues are we dealing with on calls. Now my dev team, including Julian, who's in the back, wants to kill me because of that, but like we want to measure it so that we can see of all of our connected calls, how many of these are things that we can actually address versus not address so that we can go in and say for every single dev cycle that we do, here's how we're going to attack this because quality matters now. Right. Like more than ever. Like it's still sexy. Everybody, like there's a lot of space where people still don't know enough about this, but focusing in on how many of these are actually quality calls. The other big measure that we have internally and we share these with clients and they love this is how many of these calls are ending in a natural goodbye. So regardless of if there was a double thought, if there was a long delay from an, uh, from a perspective, you can go out and show them is, Hey, this call ended in a natural goodbye. Like a human being had that same conversation. So really from a quality perspective, it met the bar. And so we've been measuring that and that's been a big, big sticking point to prove that these calls are actually going well. That's pretty smart. I wish I can take credit for it, but it was Julian and Jay.

[00:21:49] Speaker 3: One of the coolest ones I ever saw was we had a, we had some calls where the customer would thank us at the end and I was like, I was reading through this column and this is horrible. Like we've got to fix this Bob. And then they're like, Oh, thank you so much. I'm like, really? So I totally echo that as a good one. Um, I have a terrible non answer here, which is that honestly I look at revenue. If your customers are, if you're getting kind of all their calls and they're paying you more money, they're happy. And if they're paying you less money or you're not getting all their calls, they're not happy. So I think that all these other things can help are like more sophisticated answers and you should totally go look at those. But it, you also kind of got to start from that truth too.

[00:22:24] Speaker 1: All right. I'll end with one last one before we open up to the audience for questions. I'll start with you on this one, Luca. What are you most excited for in 2026 around voice AI?

[00:22:37] Speaker 4: Making our customers happy. Obviously. Well, um, honestly we are really close to solving a lot of like foundational issues that, uh, we have identified so far and it takes a lot of time and a lot of resources to do so. But again, we're really thankful for all of our partners because they're working with us and helping us identify those things. I would say the most exciting part is definitely just having their system that is really tailored for specific use cases like voice agent use cases, for example, or what we call conversational intelligence, which is more no, no taker or medical transcription, NBN devices and so on. Um, yeah, we are putting a lot of resources in that direction and, um, we, we are working on some cool stuff as well. Hopefully it works. If it doesn't, you know, we're going to make, um, existing systems work better. Um, so yeah, I, I know that didn't sound extremely exciting. Like when I look back to it, but in, yeah, everybody's really excited on that scene.

[00:23:52] Speaker 1: Yeah. I mean, the reality is there's so many of these use cases, right? Like building models that are purpose built for some of these specific use cases means they're more context aware, better accuracy, all that fun stuff at the end of the day.

[00:24:02] Speaker 4: So exactly. And, and unfortunately this, this field, uh, as you know, comes with a lot of good things, but also some, some hard parts are, well, if you want to improve on one thing, you kind of have to trade off something. And important question we have is you, are we okay with trading off something? And what is the thing we want to trade off to kind of perform really well in specific task?

[00:24:29] Speaker 2: I can, have you guys all seen the meme by Andre Karpathy about like the how movement or how technology has kind of been, been embraced? Like internet was started by the government, then B2B, then finally consumers did it. And then it's the reverse order with a generative AI and LLMs. So I think what's really cool about voice is number one, whether people like it or not, like there's this undercurrent that's being driven by consumers that are embracing voice more and more. Right? So if we think about every commercial now that Android does is talking directly to a Gemini, right? Apple, whatever, like whatever they're doing with Siri, whatever's going to happen with it. But like crazy for me, three months ago, walking into our New York office and seeing one of our founding engineers talking while coding, like, like, come on. Like I wasn't expecting that. Right? I fundamentally believe in the idea, bias idea, that voice is going to become more and more part of how consumers, especially in the banking world, are going to interact. Like there's no point that the UI should be the traditional mobile banking app of how you actually communicate with your bank or try to get an answer. It's going to be through voice, right? So I just think like just beyond 2026, voice is just going to be more and more a part of regular consumers' lives. I mean, look at Alexa commercials now, right? With Pete Davidson. Like they're, they're pumping this down consumers' throats. So consumers are going to demand that this is the way that businesses interact back. So I'm just really excited about all the different interaction points that are going to happen as a result of voice with AI included in it.

[00:26:07] Speaker 4: I want to add like a quick thought on that. Like, you know, probably everybody has a like annoying friend who always sends the voice notes. It's like, just text me. I feel like everybody wants to be like that when it comes to interaction with computers, including myself. It's just simpler than, you know, we're lazy to type. That's where it leads to. Yeah.

[00:26:29] Speaker 2: I mean, Mark Zuckerberg said this to, I forget the CEO, I think it was Databricks CEO at their conference last year for LamaCon, where he said, I think he said 97% of all interactions right now when their communications are done via typing. And he's like, that's not the way that we normally communicate with one another. Obviously texting still happens from time to time, but when it's important, you call, you talk to people. So they're making a fundamental bet that voice is going to be a bigger factor in people's daily lives. So that's, that's what I mean is like the undercurrent is that consumers are just driving, are going to force businesses to change the way that they're, they're able to interact.

[00:27:07] Speaker 3: That was really deep insight, Paul.

[00:27:08] Speaker 2: I've been thinking about this answer for a long time.

[00:27:10] Speaker 3: I don't, I don't got that. Sorry. So what I do think is pretty cool is that two years ago, only this visionary would have been in this room, right? Like none of the rest of us were making voice agents and now there's like a hundred people here. And so that's really neat. I mean, I hope, you know, we continue to build cool things. I hope you all continue to build cool things. And I think it's going to just be really fun in the next year to bring it back to your question, trying to see whatever in this room and elsewhere is able to build. And it's, it's all still so in its infancy.

[00:27:37] Speaker 1: Any questions from the audience?

[00:27:41] Speaker 5: What approaches do you take to minimize interruptions from the agent and to adapt to different kinds of callers?

[00:27:51] Speaker 3: So I would say that if the caller interrupts, there's a couple of things I would say here. So the ones that get you in trouble are kind of like the false starts or the utterances where you probably shouldn't have talked. And I think that in general there, you know, the trade off is really you have to wait longer, right? Obviously, like if your latency is bad, like if, you know, it actually had been long enough that the person that you, you know, you should have known that they had, they had stopped talking for long enough. You should have known, but like your, you know, your transcription was late, then you're in trouble. But if you, you know, if, if you just start being too aggressive in how quick you talk after they talk, the only option is to really like talk slower. The converse of that is if they interrupt you, that one's pretty easy. You just kind of shut up. I mean, let them, let them go.

[00:28:35] Speaker 1: Yeah.

[00:28:36] Speaker 2: I mean, most models are pretty good at the, Hey, if they start talking, interruption takes place. It to Craig's point, it's really more of the end of thought. Like it's the ums, the uhs. I think a lot of models have gotten better at being able to embrace it. What Craig said is spot on. It's like, you got to find the right timing of whether it's, I'm making it up, but 500 milliseconds versus a second full second before you actually allow the voice agent, if it thinks that the end of thought has actually taken place to actually find a response back is where we're seeing, but it really is a moving target. Like we've seen it where client to client, it varies where, you know, we're, we've got life insurance companies that we work with. That's an older crowd. They talk very, very, very slow. Right? So you've got to adjust the end of thought on that so that it doesn't interrupt and confuse the individual too. So it's kind of a moving target.

[00:29:28] Speaker 3: It often can be slower than you think, right? I mean, if you look at just natural human speech over the phone, you know, I think when people say very quick things, like if you ask me a question, I say, no, you're probably going to start talking pretty fast. But if we had like a full, fully coherent thought, a second, second, even up to two seconds is not uncommon for just live human beings. So I think, you know, we, we, we can all, we all kind of concord, contort ourselves sometimes trying so hard to be so fast and it's not, again, this is one of the things that like people when they're piloting are probably very sensitive to this, but I think that again, I feel like I never read a conversation and the person like churned out of there early because it was two seconds. Like that's just, so I think it's almost like you have room to chill out and you just got to take it.

[00:30:10] Speaker 2: I think the other thing too is like, it's easy to over index on the idea that like you have it perfect. Ultimately for every single call, whether it's inbound or outbound, did you complete the task that you were meant to go do is the ultimate side of it. Right? Like, so I think that's one thing too, to just remember as you guys are building this piece out is we all want perfection. That's what we all strive for, but level setting the expectation that, you know, again, did it accomplish what you intended it to do? Yeah. Maybe there was some redundancy in the conversation that was okay, but like it, it accomplished what you needed it to do. So it is a moving target. I can just tell you that we're still playing around with it on a constant basis.

[00:30:50] Speaker 6: Hi, I'm Yua Liu. We're building a in-store salesperson. So one of our hypothesis is that we want to use different voices to drive, we want to test different voices to test conversion result. We're close to getting into the stores. We're launching our pilot sometime next month. But I'm just curious if you guys have experience or have done ABCD testings of different voices on driving your business result and also personalize the voice, right? Basically it's not a one right voice for all of your customers, but different voice for different customers, male, right? Someone sounds younger, sounds older, all that stuff. And also related is that the vocabulary you use, right? In your conversation.

[00:31:53] Speaker 2: I do. It's an idea. I keep on floating to our dev team, but nobody will adopt it from a grab bag. So thank you. Cause he's right there for stating this. But like, I do think it's a big piece because we have clients that are specifically in the South, right? And there's a Southern draw that they may have or in the Midwest. I'm a fast talker, right? Versus somebody who's here in New York, right? Like there's different ways that we kind of communicate. So one of the things that I want to accomplish is exactly what you said, which is a using some baseline demographic data to determine based on age, even the voice that's actually going out there. We haven't tried it yet, but our customers have said that they, this is something that they would certainly want to adopt and want to do just from a baseline perspective. We've tested out male versus female, just high, you know, just that very, very baseline female voices have been performing way better for us than male voices have been just quality wise response wise for all of our clients. Yeah. And just, just the quality of voice, but then just even the length of the conversation tends to be better with the females, the female voices that we have. I, again, no science really behind it. We've done a few different AB testing of like voice types, but beyond that we haven't really gotten too far yet with it.

[00:33:12] Speaker 7: Hello. My question is you mentioned that right now you're working on the outbound calls, but what is the inbound future looks like? Or if you can map it out, if it's already, you know, if there's already a map.

[00:33:28] Speaker 2: Yeah, there is. So that's actually the thing that we want to focus in on this year because again, funny enough, Craig and I were talking about this in our space, at least there's about 35% saturation in the banking and credit union space for traditional IBR AI based solutions that are focused in on inbound. We took that purpose built approach of saying, we're going to go do outbound because nobody's doing outbound. So we're kind of laying that groundwork. The beauty is this for us is we're showing ROI quickly. So there's a belief system that is being built by us. Now we're getting asked by our clients into why aren't you going into inbound? So that's kind of a beautiful thing. The door is open right now. They're asking us to say, Hey, why aren't you doing this for us? So we, it is the best question.

[00:34:17] Speaker 1: Yeah.

[00:34:17] Speaker 2: So it's been great because again, the, the, the results prove to leading towards the inbound side. I do think like one of the, one of the fundamental approaches that we have taken though, is to say a lot of the reasons why traditional IVRs beyond the technology piece have kind of failed. It has been, there's been no solid knowledge base, knowledge center, knowledge management, obviously with gen AI now it gets much, much better. And it's funny because Julian and I were just talking about this on the train ride up here was the idea that like one of the issues though is like you can use gen AI for a knowledge management for your inbound agent to reference back to. But if the document is out of date, if the information is not relevant or if the, if it doesn't tie back to the answer that the consumer wants, who, like how do you know that and how do you fix it? So the reason why we haven't full fledged jumped into inbound, this is a long winded answer, full fledged inbound is we've now introduced the knowledge base for their internal teams to use in their contact centers so that we can see how solid their documentation is before we introduce inbound. Then we train it off of that.

[00:35:21] Speaker 1: Yep.

[00:35:21] Speaker 2: So for outbound, how do you guys determine if it's a voicemail or human? And how good are you if you want that? This is like our entire like conversation over there, right? Like first off, voicemail detection sucks right now. Like everywhere, every vendor has like, has been bad at this, right? So we've struggled with it. We have false positives all the time with it. Honestly, like I wish I had a good answer for you on this. I really don't. I don't know, Craig, you guys have done the, the, the, We run this as a business.

[00:35:59] Speaker 3: So like we had you with 3 million calls a month of mostly outbound and mostly hitting voicemails. I mean, so I, I think if you see enough voicemails, you can tell a voicemail, but I think that, yeah, absolutely. To a plus and said the out of the box AMDs are pretty naive. They're generally just like wait for N seconds. And then if it has spoken that long declarative voicemail, they're pretty bad.

[00:36:20] Speaker 2: I will say that the way that we see it is more it thinks it's a human. So it does its regular conversation. Cause the only scripting that we do is on voicemails too. And so we don't really run into the issue, at least from a QC perspective, quality check perspective, where it is, they think that it's voicemail and it's leaving a voicemail message. It's the other way around where it thinks that it's a human and you can hear it and it's like, Hey, you know, Jake, blah, blah, blah, blah. How are you doing? You know, so it does that rather than leaving the voicemail. So I would say like the false positive is it thinks it's a human rather than a voicemail, which you would probably rather have than the other way around.

[00:36:58] Speaker 3: But it's, you look at that if you call into business, cause you'll get like, hello, this is Craig with assembly AI. How can I help direct? And then the thing we'll say, it's a, it's a voicemail. The thing you get for human false is like, if I say Craig Benoit beep, your transcription vendor doesn't tell you beep. And so you hear two words and you're like, Oh, that sounds pretty, sounds pretty human to me.

[00:37:15] Speaker 4: Um, it's, it's pretty funny. Actually. I remember, I think it was you, right? Ryan. So I'll let Ryan elaborate on it a little bit more, but he was showing us a demo a few days ago and it was like, Oh, actually a lot of people have the problem of like classifying is this a voicemail? This is not. And he kind of orchestrated, um, or transcription and like the product we have LLM gateway where we can just hit a bunch of LLMs with a lot of requests, um, to have a pretty decent accuracy on, on classification. But obviously do you want to see 3 million of them? You're probably going to be pretty good at voicemail. It's not, is it based on just like text that you transcribed or are you doing actual like sound processing?

[00:38:03] Speaker 1: Yeah.

[00:38:04] Speaker 3: So we, we, we've done a little bit of both. I mean, I think you can make, you can do okay with both sides. I think that the temporal information is pretty, it depends on how, like if you want to be fast, you want the temporal information, right? And the, and the LLM and the transcription timestamp alignment is so, so let's say charitably. So I would look, I would try and like align.

[00:38:23] Speaker 8: So on technical side, how do you manage that? So let's say a call dropped in between, right? So how do you know on technical side, like how many percent of calls went through or how do you check the whole accuracy?

[00:38:40] Speaker 3: For calls dropping specifically or for which, which failure mode are you thinking about?

[00:38:48] Speaker 1: So let's say a transcript is generated, right?

[00:38:50] Speaker 8: So do you do analysis of that and check the accuracy or it's just like, you check the, you look at the calls as well?

[00:38:57] Speaker 3: Do transcript level accuracy checking. I think that, you know, spot checking wise, like there, these, these folks are probably 90, 95% accurate, I would say just off the cuff. I mean, like, it's pretty good. Like, and honestly, when I listen to these calls, sometimes people have weird accents and I have no idea what they're saying either. So I mean, like it's, everyone's doing their best. It's a noisy environment. So I don't, I, in general, like I think that I have never seen it go like so haywire that I feel like I need, you know, QA. It's also not so reliable that I think you should build your system assuming it's perfection. But like, I don't think, so I haven't found like a value in quantifying that. So yeah, I think that's my, that's my, I don't really do much like robust QA though.

[00:39:36] Speaker 2: Yeah. We're not doing any of the transcription accuracy pieces too. I mean, they've been, they've been really good with it. It was actually, again, today we were doing some testing and Julian was purposely speaking fast and I couldn't even understand what he was saying, but the, you guys were able to pick it up. So kudos to you guys on that. Yeah.

[00:39:51] Speaker 8: Can go out of the way, right? So let's say a call is going on and how do you check that it goes on like perfect and not just go somewhere else?

[00:40:05] Speaker 1: Yeah.

[00:40:05] Speaker 2: During the live call, there's really nothing. I mean, like we have alerts and things to know if a volume of calls are outside of our averages to alert the team to say, Hey, this needs to go looked at or alert the customers on it. We do post grading and auditing of every single call by bringing in the transcriptions to review a few different things. One is, again, we're in a highly regulated space, so we do check up against certain regulations to see whether or not if went off kilter. The second thing that I kind of mentioned about like technical issues that is based off of the transcriptions that we do receive and we run it through another LLM that does the auditing and the grading so that we can provide defined reports back out to our clients. And then the third thing is we, again, we open it up to our clients where we've got a natural language querying tool on our dashboard where they can ask the question and say, Hey, how many calls ended up where the customer was starting to swear? You know, if it's 15 calls, here's what it is. Here's all the calls that are being pulled out. So we do the post call grading, nothing alive. There's the alerts that go out to the team if something's going haywire.

[00:41:09] Speaker 3: I think I'd add is from like a, I think in general when you have kind of like non-terminal conversations, it's kind of the customer's fault. Like they probably were engaging this bot in a weird way. It does totally happen. I think that, you know, if your customer is concerned about this from almost like I was also saying a reputational perspective, like what are you guys talking about? Like you were not supposed to be having this conversation. Um, in generally in our preference, what we honestly like, what we do is we, this is what we kind of sell to our customers so they can sell it to their customers, right? We kind of run this as middleware, but it's really to try and be more structured or directed so that this thing is not like responding with anything ever. It's like, I'm going to try up to twice to get this information. And if I don't get it, I'm just going to kind of move on because at the end of the day, like that's what, you know, you don't really want to engage in free flowing conversation often, but you kind of have to this point, like a very concrete business goal you're trying to object. And you kind of have, you kind of want to be on the guardrails of that. And I think that being like explicit about that in what you ask the LLM to do and what you even allow the LLM to do, it can be helpful in reducing that variability.

[00:42:11] Speaker 1: For one more, anyone going once, going twice. All right. Wait, there's one in the back. One more, last one. And then we'll move on.

[00:42:21] Speaker 9: So, one of my questions is, it seems like a lot of the RLs that you implement have a linear approach, but have you looked into using a secondary case or model in real time in a more high-stakes environment? So maybe you have additional filters or additional methods in place that might have slightly higher latency than you can buffer the audio on the base model? Or is that an approach you guys have looked into? Because we're looking into Speech-to-Speech for the lower-stakes part of the call, but maybe we can do something below, like I know we've discussed it up front, and try having the, basically having it route to a model that might have, I guess, less risk, like more of a larger architecture. There is a risk of it going to the Speech-to-Speech model. Because there are certain ways you can use it, kind of buffering audio in cases when you're looking at Speech-to-Speech, and the other piece was, have you looked into multiple ASR models in conjunction in real time? Yeah, so, sorry, are you saying transcription models? Yeah, so in real time, one of the approaches we're looking into is finding a secondary model that has almost like, it's like a supervisor agent, so in real time, if oftentimes there might be certain limitations of the base model that we're leveraging. So we can, the thought here is that there are certain compliance interactions that we can fine-tune the secondary model for, if you don't want the base model to be overwritten. So, I have certainly not built a system like that.

[00:43:45] Speaker 3: I mean, I think that we have a lot of redundancy for transcription, things like that, but I think that in general, in my experience, it's just been to use the best model with the best guardrails, and if I'm too worried about something going haywire, I will try and just not admit that is a possible thing that could happen, rather than trying to orchestrate a second independent system on top of it. But not that I have not tried it, I just have not.

[00:44:11] Speaker 2: Yeah, I mean, the way that we kind of do the fixing is not during the call itself or having a fallback or any of those pieces. I mean, beyond what we've talked about from a parallel redundancy perspective, it goes back to that post-call auditing, grading, providing that feedback and providing input back into the original voice agent is kind of the loop that we've started to build. The secondary thing that we've started to do is providing a coach, we call it internally. I don't know what we call it externally yet, but it's going to be a bird name. It's an aviary, right? So, it basically provides coaching on what can the voice agent do better during those calls. So, for us, it's less important to fix it during the call itself. It's more to identify and then say what are the plans to fix it later on. I don't know if that helps.

[00:45:03] Speaker 1: Well, I appreciate all of the great insights. Thank you to our panel. We'll give them a round of applause. Everybody, we have pizza, drinks in the back. We'll be around until about 9. And yeah, network, enjoy. Thank you again for coming out.

Summary

A panel hosted by AssemblyAI discusses real-world lessons from deploying production voice agents. Aviary AI focuses on outbound voice agents for financial services (welcome calls, reactivation, collections), emphasizing alignment on what “success” means, rapid ROI, and scaling from a few to many call types. Trellis, a voice company with outbound dialer roots, highlights that successful deployments require serious, high-volume use cases with clear metrics and strong customer opinions about wording; they prefer scripting to reduce LLM variability. Both companies stress reliability via redundancy across ASR, TTS, and infrastructure, caching predictable responses, and recognizing that speech-to-speech models currently struggle with instruction adherence and “dumb” responses. Key priorities differ by context but commonly include low latency (especially time-to-first-token) and natural voice quality; at scale, customers care more about transcripts and business outcomes than voice aesthetics. QA approaches include post-call grading/auditing, compliance checks, measuring technical issues, and metrics like “natural goodbye” rates; live-call intervention is limited beyond alerts. The discussion covers turn-taking and interruption handling (waiting longer, adapting end-of-thought thresholds by caller demographics), A/B testing voices (female voices performing better in one case), challenges in voicemail detection, and perspectives on 2026: tailored, reliable systems over flashy approaches, and increasing consumer-driven adoption of voice interaction.

Copy

Download

Title

Building Production Voice Agents: Reliability, Guardrails, and ROI

Copy

Download

Keywords

voice agents Remove

Remove

production deployment Remove

Remove

outbound calls Remove

Remove

inbound calls Remove

Remove

financial services Remove

Remove

latency

Remove

time to first token Remove

Remove

redundancy Remove

Remove

ASR

Remove

TTS

Remove

caching

Remove

scripting

Remove

guardrails Remove

Remove

post-call QA Remove

Remove

call auditing Remove

Remove

compliance Remove

Remove

voicemail detection Remove

Remove

turn-taking Remove

Remove

interruptions Remove

Remove

A/B voice testing Remove

Remove

speech-to-speech models Remove

Remove

Copy

Download

Key Takeaways

Pick use cases where the business already operates at scale and cares about metrics; “serious” customers have precise requirements.
Define success during onboarding (business objective, compliance, reputational risk) and measure against it.
Reliability requires redundancy across vendors/components; assume failures and latency spikes will happen.
Prefer scripting/structured flows for high-volume standardized interactions to reduce LLM risk and whack-a-mole guardrails.
Caching common responses can reduce latency and variance while improving consistency.
Speech-to-speech is appealing but currently suffers from weak instruction adherence for many production needs.
For outbound, customers strongly prioritize low latency (especially time-to-first-token) and natural voice quality.
At scale, customers often review transcripts and outcomes more than audio; business goal completion matters most.
Post-call auditing/grading (including compliance checks) plus dashboards for client monitoring are practical QA loops.
Turn-taking thresholds (end-of-thought timing) vary by audience; slower demographics may need longer pauses.
Voicemail detection remains unreliable; false positives/negatives are common and need careful handling.
Long-term success favors tailored, use-case-specific systems and iterative improvement over “cool” but fragile architectures.

Copy

Download

Sentiments

Positive: The tone is optimistic and pragmatic: panelists acknowledge current shortcomings (latency, instruction-following, voicemail detection) but emphasize concrete strategies—redundancy, scripting, caching, and metrics—to achieve ROI and improve reliability, with excitement about broader adoption of voice.

Copy

Download

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file