Why AI Voice Agents Are Surging in 2025 (Full Transcript)

A live demo and discussion on real-time voice agents, low-latency transcription, workflow-based control, enterprise handoffs, and what’s next.
Download Transcript (DOCX)
Speakers
add Add new speaker

[00:00:04] Speaker 1: I'm Smitha, a Developer Advocate here at Assembly AI, and today we've got an exciting session all about building AI voice agents, which are AI-powered assistants that can listen, understand, and respond in real time. And to dive into this, I'm joined by Jordan, the Founder and CEO at WAPI. Jordan, thanks for coming on.

[00:00:24] Speaker 2: Yeah, thanks so much for having me on. I've been excited for this ever since someone on the team told me that you guys were doing these live podcast sessions. So yeah, it's great to meet you.

[00:00:35] Speaker 1: Cool. For those who may not be familiar, WAPI is a platform that is making it easier than ever for users to build, test, and deploy voice agents quickly. And today we're going to talk about why AI voice agents are taking off, the biggest challenges in the space, and of course, we'll do some live demos of WAPI's workflows platform, as well as Assembly AI's streaming API. But first, let's talk about why AI voice agents are such a big deal in 2025. Jordan, what are you seeing in the industry? What do you think more companies are investing? Why do you think more companies are investing in AI-driven voice interactions?

[00:01:13] Speaker 2: Yeah, it's really because the models, all of a sudden, and I mean, we saw this trend maybe started like a year and a half ago or two years ago, where the models across the board, the transcription, LLM, text-to-speech stack, they were all getting faster, they're all getting cheaper, and they're all getting more performant over time. And when we started the company, we extrapolated and we're like, damn, if this keeps happening, and all three of these numbers keep moving in these directions, eventually we'll have models that achieve human performance when orchestrated together. And so that was our bet early on. And we know that if people can talk to stuff like it's human, of course they will choose to talk to stuff like it's human. And so we're now in a place where models can talk like humans. And so that's really the reason why there's been such a wave, because all of a sudden, it's kind of reached the point where it can pass that human Turing test across the entire model stack. And so that's why we're kind of seeing a lot of demand right now.

[00:02:09] Speaker 1: That's actually super interesting. I've also seen a lot of companies using AI voice agents for things like appointment scheduling or hands-free control in smart devices, even real-time language translation. Are there any use cases that you found surprising in this last couple of years?

[00:02:29] Speaker 2: Surprising? Yes. I mean, when we started this company, we never expected customer service and phone calls to kind of be like an area where there would just be a ton of value to unlock. But lo and behold, there's like billions of phone calls or billions of phone calls every year in the U.S. alone. That is a lot of time and money spent and a lot of suffering for people who have to stay on hold for millions of hours. And so, I mean, that's one area where we've seen a lot of interest. But I guess the more niche interesting stuff has actually been for training and coaching like people. So think like role play training for call center agents to get them ready to like be on the phones or for salespeople to get them ready to sell or even just for more like entertainment like applications, like you want to talk to your favorite like manga character or something like that. So on the whole consumer and training side, that was quite unexpected. But I guess there's obviously a much more obvious value to unlock in that whole like enterprise like customer service angle. So we kind of serve with a bit of both.

[00:03:39] Speaker 1: Nice. I think that definitely makes a lot of sense. And also another big thing is for an AI voice agent to be effective, it needs to not just respond but also listen and process speech in real time. So I think I'm really looking forward to the demo that you have today as well as we're also going to be demoing Assembly AI's streaming speech to text API transcribes live audio in high accuracy and low latency, usually within a few hundred milliseconds. So that's a game changer. And let's actually jump into your demo and then we can have a lot more questions later on.

[00:04:15] Speaker 2: Happy to. Yeah. Let me quickly share my screen and I'll kind of walk through the platform a bit. Forgive me as our dashboard is going to be there's gonna be a lot of demo stuff all over the place. But you'll kind of get the point. So this is the VAPI platform. So the way to think about us is we are everything in between all of these models like the assembly streaming, the new assembly streaming models, and actually turning them into voice agents, getting them into production, and then seeing how they're actually performing. So I've made this assembly assistant using the assembly streaming API. And so for this one specifically, you'll see we have how much it costs across the stack, so how much the transcription costs from assembly, how much the model costs, and how much the text-to-speech model costs. Same thing with our latency budget. So in a real-time conversational application, you want that latency to be as tight as physically possible so that people don't have time to think after they finish their statement. So it doesn't break the fluidity of conversation. So in this case, across the stack, we're looking at roughly like 1,400 milliseconds, including the actual or I think this might be a little off, but roughly for transcription, we're looking anywhere from like 100 to 300 milliseconds. For the model piece, in this case, like GPT-40 Mini, about 300 milliseconds. Text-to-speech from a company like 11 Labs might take another like 300 milliseconds. So across the stack, we're trying to shoot for like 1,200, 1,500 milliseconds. I put together a prompt for this model, for this agent, where it's going to kind of role play being an assistant for a dental office. It's going to first ask for my full name, say something, say something funny, maybe a request date and time, and it's going to push all these details from the conversation live to a spreadsheet that I have open. Can you see my screen okay? Yeah, you can. Perfect. All right. So I can configure the transcriber. So like I said, we're using the Assembly AI. Under the hood, this is the streaming model and the English language for the voice 11 Labs, and I can pick any model from them as well. And I can also configure tools. So a tool that I've set up for this is the book appointment tool, which is going to hit an automation, then hits my spreadsheet. Here's my spreadsheet that I have open. It's essentially just going to collect my first name, last name, date, time, and the reason for the visit. So I'll just call that now, and we can try following along. Is that any questions from your end before I call the thing?

[00:06:35] Speaker 1: Oh, this looks super awesome. And the fact that you can also select different text to speech providers. Can you also select different large language models? I'm not sure if you covered that.

[00:06:46] Speaker 2: Yes. Yeah, for sure. So we support every LLM from any underlying provider. So for example, if we go to Anthropic, we have all the Cloud 3.5 Sonnet. Zoom will have the new 3.7 model on there. For OpenAI, we have all their, even the real-time API in here as well, so you can talk with the native speech-to-speech models.

[00:07:04] Speaker 1: And for users who are looking to use this, would they have to get the individual APIs for each of these providers, or is that something that they can do within WAPI?

[00:07:14] Speaker 2: They can configure it. So essentially, they can either just use our accounts with the underlying providers, like with 11 Labs or whatever, like we have preferred pricing, or they can just bring their own API keys. Or if they have a fine-tuned model or something with one of these, or a custom voice, they can then bring their own accounts and then use those voices or models from here.

[00:07:33] Speaker 1: Awesome. Maybe you could run the demo.

[00:07:36] Speaker 2: Yeah, yeah, for sure. Let me just give it a call. So I'll kind of follow along on the prompt as I'm walking through a phone call. Let me make sure I have the right number. Hey, this is Crosshill Dental.

[00:08:18] Speaker 1: Would you like to book an appointment?

[00:08:21] Speaker 2: Yeah, I would.

[00:08:21] Speaker 3: Oh, hey, are you there?

[00:08:22] Speaker 2: Oh, I may have hooked up the wrong model.

[00:08:24] Speaker 3: Give me one sec.

[00:08:26] Speaker 2: Sorry about that.

[00:08:29] Speaker 3: Okay. Sorry about that. No worries.

[00:08:34] Speaker 2: All right.

[00:08:39] Speaker 3: Ah, gotcha. So, like, maybe just a check-up to figure things out. When were you thinking for the appointment?

[00:08:46] Speaker 2: Date and time? Can we do, like, maybe tomorrow at 4 p.m.?

[00:08:51] Speaker 3: Tomorrow at 4 p.m.? Sweet. Let me just get that locked in for you. One sec.

[00:08:57] Speaker 2: Great, thank you.

[00:09:01] Speaker 1: Wow, that was incredible. The fact that you could actually interrupt him as well.

[00:09:06] Speaker 2: Yeah, yeah. So it's kind of gone back and forth. You can see it actually pushed that data now to the spreadsheet. It misunderstood my last name, unfortunately. But you kind of get the point where I can actually interact with tools. I can, you know, talk more naturally. And that was actually a clone of my voice that I made in 11 Labs. I don't know if you could notice that.

[00:09:22] Speaker 1: Oh, nice.

[00:09:23] Speaker 2: But, yeah. And this, like, demo in particular maybe took me a few minutes to put together. But there's one thing I do want to highlight. In this demo specifically, it just sort of, like, had the prompt and just, like, was running the whole conversation based off of this prompt. The problem with prompts as they get longer and a lot more complex is that we usually see them start to go off the rails because these models tend to hallucinate, especially if you want to use smaller, low-latency models. And so what we've been investing a lot of time in is this idea of workflows. So instead of having a prompt run the conversation, you can now have these, like, step-by-step conversation flows. So, for example, this is that same prompt, but now modeled out in a step-by-step fashion. So it's guaranteed to first confirm that this is cross-cell dental, then gather this information, then make that API request, then confirm the details. Whereas, like, with a tiny model like GPT 3.5 or whatever, it would usually go off the rails. So it's a much more intuitive and secure way of allowing an agent to actually run business logic. Does that make sense?

[00:10:24] Speaker 1: Yeah. And can users actually build these workflows, or is it a template which is built after they build that initial prompting and, you know, selecting the models?

[00:10:34] Speaker 2: Yeah, so users can build them themselves. We just launched this maybe last week. So all they need to do is just, like, create, attach it to an assistant, create a workflow, and then from scratch they can design their own using, like, our different block types. So, for example, it can say something, it can gather information, like in this case it gathered first name, last name, et cetera. It can make an API request, like pushing to the actual spreadsheet. It can transfer the call to a human, end the call, or kind of use a condition. If this, then that. Go to this block, or go to that block. And so these are the primitives that we're starting with, but we're planning on coming out with, like, a whole suite of different kinds of blocks and different integrations that are native in the platform.

[00:11:13] Speaker 1: This is awesome. So how can users actually start testing this out? Is this available for free, or what type of pricing is available?

[00:11:21] Speaker 2: Totally. We're actually making some big changes to our pricing soon, which I will not reveal now, but it will mean that a lot more developers will get to use it for longer without paying money. So that's the short version. We will, but users can just log on right now. There's, like, a free credit on the account, so they can just, like, jump in and make hundreds of calls without even having to put down a credit card.

[00:11:43] Speaker 1: Awesome. And do you have any documentations that developers need to get started, or is this very much, like, plug and play?

[00:11:50] Speaker 2: Yeah. So the way to think about our product is, like, this dashboard is 20% of the actual product. The actual product is, like, the incredibly complex platform that we built under the hood. So, like, our dashboard is just one app that's built on top of, like, the VAPI API. The VAPI API has maybe, like, hundreds of points of config so that entire products, like the VAPI dashboard, can be built. So this is how we see people building products, like entire platforms for home service professionals to accept inbound calls or collections teams to actually, like, do outbound collections calls and that kind of thing. And so it's not limited to just, like, one API call to, like, send a phone call. It's, like, an entire API platform to build voice-aided products on top of.

[00:12:37] Speaker 1: Is this also going to be, like, easy to scale in terms of, like, bigger customers who are handling a lot more volume?

[00:12:44] Speaker 2: Yes. Yeah. So there's this idea of concurrency. And concurrency just means how many calls can I take on at the same time? I think Assembly probably has a similar idea for you guys, for how many streams you have live. So similar to us. How many agents can be live talking to real humans at once? At the moment, I believe it's limited to maybe, like, 100 or so. But we're working on some stuff to allow for unlimited concurrency hopefully sometime soon. We'll be able to actually scale up to whatever, like, the actual customer demand is without any limits or blocks.

[00:13:16] Speaker 1: Awesome. Thank you so much for giving this demo. I would like to ask a couple more questions on, you know, your thoughts on voice agents. Based on what you've seen with VAPI and, like, type of users you have, who are, like, some of your biggest customers in terms of use cases who are, you know, using and building with VAPI today? Yeah.

[00:13:35] Speaker 2: It's a mixture, right? There's kind of those two buckets. The way I like to think about it is, like, new call volume and existing call volume. So existing call volume are companies that currently accept their own phone calls for whatever reason. So think, like, insurance companies, health care companies, travel companies, like, anywhere where you have a slightly older population to serve as well as a younger population. So people are kind of used to speaking on the phone natively. Outside of that, in the other realm, which is, like, powering a bunch of voice products, it's, like, kind of all over the map. You'd be surprised the kind of big name companies right now that are planning on launching, like, voice agent products. I can't name them, unfortunately, because it's all, like, under NDA, et cetera. But there are some very big platforms, including, like, public companies right now that are working on, like, deploying voice agent products for their actual underlying users as well. And it's, like, a mixture of, like I mentioned, platforms for home service professionals or platforms for customer support in your software app to walk you through, like, how to use your app. They'll actually have a voice agent in there to, like, point and click, like an onboarding guide, that kind of thing. So it's kind of a mix right now.

[00:14:47] Speaker 1: It seems to be, like, there's a really varied amount of use cases in which, like, companies are building this with. In my experience with developers, I think one of the trickiest things, which I've seen workflows handle it right now, is when people actually interrupt the bot as it's talking back to you mid-sentence, and it doesn't handle that really well, that can lead to a really frustrating user experience. So how does WAPI actually help to solve that? And, like, how did, you know, your team navigate that? Yeah.

[00:15:19] Speaker 2: So there's two components. One is you need something called a VAD, or a voice activity detection model. And it kind of looks at the audio that's coming in and says, hey, does this look like speech? And if it looks like speech for long enough, we're like, okay, time to, like, back off a little bit. So we might, like, lower the volume slightly. Then if you have a transcription model that's fast enough to rely on, like, for example, the new assembly streaming model, within, I think, what's the latency on that? 300 or so? Yeah.

[00:15:47] Speaker 1: Just under a couple hundred milliseconds, yeah.

[00:15:50] Speaker 2: A couple hundred. That's great. So within a couple hundred milliseconds, if I can then confirm that that sound is a word, then we actually will, like, back off and be like, okay, the user's talking now. And because we have the actual transcription, we can even do super quick additional analysis to see if the user's going, like, uh-huh, right? Where that means, like, you should continue. Or, like, wait, wait, wait, wait. And that means it should actually, like, back off and stop. So that's why, like, super fast transcription for interruptions is actually super critical.

[00:16:19] Speaker 1: Interesting. And I definitely see where Assembly AI plays a huge part in that. Another challenge that I can imagine is keeping the conversations contextual. Like, if a user references something that they said earlier in a conversation, how do you make sure the agent actually doesn't forget what's happening? Is that something that can be built on top of the LLM?

[00:16:40] Speaker 2: That is tricky, actually. So in the prompt-based assistant that I showed you first, in that one, like, it's a long-running conversation, and so the context is always in the conversation. Just maybe after, like, five or ten minutes, there's so many tokens that the model might get confused and miss things. But if you have something like workflows, which I showed, we're actually now working on the ability to have, like, global state or global memory. So even though that specific step doesn't have any context in what was talked about before, it'll actually have the ability to save things to its memory to then pull up, like, as, like, quick snapshot context later in the conversation.

[00:17:19] Speaker 1: Nice. That's super cool. And I think as I'm building with voice agents, and we've created a lot of tutorials here at Assembly AI on building voice agents on our YouTube channel, another big challenge I hear a lot about is handling noisy environments. So AI voice agents often have to deal with noisy background. Like, maybe there's multiple speakers or music playing. And even as a human, that's hard to decipher. So how do you see AI voice agents tackling that?

[00:17:49] Speaker 2: Yeah. I actually would hope that Assembly tackles it for us so we don't have to. Because essentially what we found is, like, background noise cancellation, very solid problem. They've had, like, 20 years on this thing. Every iPhone has, like, an awesome background noise cancellation model inside of it. The problem arises when you want background voice cancellation. Because these transcription models are all tuned to, like, listen in on every single thing that sounds like voice and transcribe that voice, including the kid in the background or the TV in the background. And so that's why, like, we actually had to deploy a custom background voice cancellation model that somewhat works well but obviously can't catch everything because it's kind of indeterminate. What is background speech and what isn't? It's kind of hard to tell. So we try to put, like, a filter on it before it goes to Assembly for transcription. And that tends to help. But ultimately, we do need smarter, more promptable transcription models that can be told, hey, there might be background noise in this clip. Please ignore it. We need a bit more intelligence on the transcription side.

[00:18:56] Speaker 1: Yeah, that's really interesting because real-world environments are rarely quiet. And if an AI voice agent can actually successfully filter out those distractions and focus on the main speaker, that's a huge game changer. Next, I also want to talk about some use cases and adoptions, right? So, you know, you said that you've talked a lot about, like, the type of use cases that a lot of WAPI customers are doing. But also, have you ever seen a lot of companies adopting AI voice agents but then also blending them with, like, live human agents? Yeah.

[00:19:33] Speaker 2: That's actually more common than the rip-and-replace model. For the most part, companies, especially enterprises today, are not comfortable deploying voice to replace their entire voice operations. It's more like, let's use voice to replace our, like, IVR system that sits in front of the human agents. Or this one thing that the human agents hate doing, which is, like, this one very transactional call that they have to do 10,000 times a day. Maybe we can replace that and put them on higher-value work or higher-value calls or escalations. And so it's more of a pairing with the voice agents than not. And so that's why, within our platform, we have many different ways to allow voice agents to escalate to humans or transfer to humans or even do warm transfers to humans. Which means, like, as the call is transferring, it'll quickly, like, whisper in the ear of the human to tell them, hey, by the way, this is, like, a person. Here's a quick summary. Are you ready to transfer? They say yes, and then it transfers the call. So, like, we kind of invested in this, like, handoff mechanism to make it smooth.

[00:20:34] Speaker 1: I can see that being super useful for, like, customer service agents especially.

[00:20:40] Speaker 2: Yeah. Yeah. It's pretty critical. But, yeah, I mean, like I said, they do tend to move slow. But over time, you can capture more and more workflows. And so that's why, you know, a big focus for us is workflows, because we want to be owning as much of the business logic as we can over time and eventually have that call tree represented as a VAPI workflow entirely.

[00:21:01] Speaker 1: Awesome. We've also seen AI voice agents being used in, you know, a lot of, like, meeting transcriptions, scheduling, customer service. Do you think we'll see them in mass adoption in more critical applications, like, for example, telehealth, in a setting where what's being said is much more sensitive and privacy-focused? Do you see AI voice agents playing a bigger role?

[00:21:25] Speaker 2: Definitely. I think costs are usually higher to staff phones in these, like, more regulated industries. But there are much more concerns around customer data, and specifically, like, patient data, if you think about, like, healthcare examples. And so they all want guarantees of, hey, is this data going to be persisted anywhere? Is it going to be trained on anywhere? And so we try to provide guarantees, like, contractually and actually how we build our technology to make sure that no data is stored or trained on so that we can serve even the most sensitive applications. Because those very sensitive use cases are such high ROI for these companies that we want to put in the effort to be secure enough to serve them.

[00:22:11] Speaker 1: Very nice. I think I'm going to wrap it up with the final question on maybe, like, a future aspect, which is what excites you the most about the future of AI voice agents and VAPI? Where do you see this technology going in the next three to five years?

[00:22:24] Speaker 2: Yeah, three to five years is a very long time horizon with how fast everything is moving right now. I would say in the next year, I'm excited to see speech-to-speech models pick up. We've been waiting a long time for this new model architecture, where instead of having, like, three separate disparate models that all have to, like, play telephone with each other and they kind of miss each other's context, to moving to one that can hear audio natively and produce audio natively. It'll cut down on latency across the board. It'll make it so it can actually hear a customer's frustrated and then produce, like, a sympathetic response, like, end-to-end. Instead of, like, each model guessing how it should sound or what it should do. So that's exciting. Obviously, progress on that model architecture has been slower than we've wanted. But we're looking forward to, I think, end of this year, more calls being served by speech-to-speech models. So that's super exciting. Then beyond that, I have no idea. AGI, and then we'll all go to heaven, I guess.

[00:23:20] Speaker 1: I think there's definitely a lot which is going to be happening in the next two years or three or four years. This has been an awesome conversation, Jordan. Thank you for coming on and showcasing workflows. Before we wrap up, where can developers go to learn more about VAPI and get started with workflows?

[00:23:35] Speaker 2: Yeah, so they can just go straight to VAPI.ai. That's our website. VAPI, short for Voice API, so it's super easy to remember. And then they can just log in, make an account, and then you can spin up a voice agent like the one I made on my phone in probably, like, 30 seconds. So, yeah.

[00:23:49] Speaker 1: Thank you. And for those of you who want to learn about building AI voice agents with real-time transcription, check out Assembly AI Streaming API. We have a playground where you can test it out and great documentation to get started. You can also use Assembly AI Streaming API directly on VAPI's workflow as well.

[00:24:05] Speaker 2: Great.

[00:24:06] Speaker 1: Awesome.

[00:24:07] Speaker 2: Thank you. Thank you so much. All right.

[00:24:10] Speaker 1: Have a good one.

[00:24:13] Speaker 2: Sweet. Right? Is that the whole thing?

[00:24:15] Speaker 1: Yeah. Do you just mind staying on until the upload is complete? It's complete. Okay.

ai AI Insights
Arow Summary
Smitha (AssemblyAI) interviews Jordan (VAPI) about the rise of AI voice agents in 2025 and demos building a real-time phone-based dental-office assistant using AssemblyAI streaming STT plus an LLM and TTS. Jordan explains adoption drivers—faster/cheaper/better transcription, LLMs, and TTS—enabling near-human conversations when orchestrated together. He highlights major use cases (customer service/phone calls, training/coaching roleplay, entertainment), and key technical challenges: latency budgets (~1.2–1.5s end-to-end), interruptions/barge-in (VAD + fast streaming transcription), hallucinations and prompt brittleness (solved with deterministic workflow blocks), long-context memory (planned global state), noisy environments (background voice cancellation and need for more “promptable” STT), and enterprise handoff patterns (IVR replacement, transactional calls, warm transfer/whisper). VAPI’s platform supports multiple LLM/TTS providers, BYO keys or managed accounts with pricing benefits, tool integrations (e.g., pushing extracted fields to a spreadsheet), scaling via higher concurrency, and stronger privacy guarantees for regulated industries. Jordan is excited about emerging speech-to-speech models to reduce latency and improve emotional prosody; developers can start at vapi.ai and integrate AssemblyAI streaming within VAPI workflows.
Arow Title
Building Real-Time AI Voice Agents with VAPI + AssemblyAI Streaming
Arow Keywords
AI voice agents Remove
real-time transcription Remove
AssemblyAI Streaming API Remove
VAPI Remove
VAD Remove
barge-in interruption Remove
latency budget Remove
workflows Remove
LLM orchestration Remove
text-to-speech Remove
customer service automation Remove
warm transfer Remove
concurrency scaling Remove
noise/background voice cancellation Remove
privacy and compliance Remove
speech-to-speech models Remove
Arow Key Takeaways
  • AI voice agents are taking off because the STT–LLM–TTS stack has become fast, cheap, and performant enough to feel human when orchestrated together.
  • End-to-end latency targets for natural conversation are roughly 1.2–1.5 seconds across transcription, reasoning, and TTS.
  • Reliable interruption handling requires VAD plus very low-latency streaming transcription to distinguish backchannels (e.g., “uh-huh”) from true interrupts.
  • Prompt-only agents can drift or hallucinate; deterministic workflow blocks (say, gather, API call, conditions, transfer, end) keep business logic on rails.
  • Tooling integrations can extract structured fields (name/date/time/reason) during calls and push them to systems like spreadsheets/CRMs in real time.
  • Noisy environments are less about generic noise and more about separating background speech; pre-processing/voice cancellation helps, but “promptable” STT is desired.
  • Enterprises often adopt voice agents as IVR/frontline augmentation rather than full replacement, with smooth escalation via warm transfers and agent ‘whispers’.
  • Privacy guarantees (no storage/training) are crucial for regulated sectors like healthcare and telehealth.
  • Multi-provider flexibility matters: users can choose among LLM/TTS vendors, use managed accounts, or bring their own keys/voices/models.
  • Next wave: speech-to-speech models to reduce latency and better capture emotion and prosody end-to-end.
Arow Sentiments
Positive: Upbeat, optimistic tone focused on rapid progress in model speed/cost/performance, excitement about demos, and confidence in near-term improvements like workflows and speech-to-speech models.
Arow Enter your query
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript