Build a Voice Booking Agent with AssemblyAI + Claude (Full Transcript)

Workshop recap: scaffold a browser voice agent with temp-token auth, tools, UI events, latency tuning, and Railway deployment using AssemblyAI’s Voice Agent API.
Download Transcript (DOCX)
Speakers
add Add new speaker

[00:00:00] Speaker 1: All right, so let's get started. So for intros, my name is Dan. I'm a product manager here at Assembly for the Voice Agent API. And in this workshop, we're going to be running through building a voice agent together on the Voice Agent API with Cloud Code. And yeah, it should be really fun. This is my first webinar, so do bear with if there are any issues. But yeah, I'm hoping it's going to be almost as if I'm simulating being in a customer. And we're going to run through the docs and building it together so you can follow along. And we're going to be building an appointment-setting voice agent with a front-end application. So you can kind of take this code and run with it if you have an idea. But this is just showing you how you can get started with the Voice Agent API. I think we can get started now. So in terms of agenda, we're going to run through the setup. And oh, Robert, yeah, this is being recorded. And thanks for all the questions. Definitely keep firing them in as we're going. Yeah, this is going to be definitely more of a collaborative session rather than Q&A just being at the end. But I'll run through, what is Assembly for the people that might be new to us? What is the Voice Agent API? We'll then do the live build together with Cloud Code. And then I'll show you afterwards how you can iterate with that coding agent mental model that we'll get into. And then at the end, if we have time, we'll do a deployment. And we'll deploy it to Railway. So by the end of this session, you can have a appointment-booking agent that you can share with your colleagues or friends. So to begin, I want to demo what we're going to be building. So this is an agent I created before this session. We're going to try and recreate this live on this workshop. So let's talk to it and see how it goes. And yeah, it will be an appointment-setting agent. Hey, David. I'm looking to book a haircut. Yeah, I was thinking this Friday. Oh, sorry. You can't hear the responses. Let me reshare my screen. This is typical first webinar experience. OK, let's rerun this conversation. Thanks for calling that out.

[00:02:54] Speaker 2: Hi. Thanks for calling Aura Studio. This is David. What can I book for you?

[00:03:00] Speaker 1: Hi, David. I'm looking to get a haircut, please.

[00:03:08] Speaker 2: Sure thing. We can definitely do a haircut for you. What day were you thinking of coming in?

[00:03:14] Speaker 1: Yeah, I was thinking this Friday.

[00:03:24] Speaker 2: Let me check that for you. One sec. OK. For this Friday, we have openings at 10 AM, 11.30 AM, or 1 PM.

[00:03:35] Speaker 1: OK, anyway. We got the gist. I'm guessing a lot of you guys are hopping on this link. So I think this railway deployment is slowing down a bit. But this is the agent we'll be building. You can see the tools are executing live in the UI. And I think what's really great about this is you can go and build your own application, your own UI. You display the transcripts the way you want. You can create an interactive UI experience with the tool calling. And the voice agent API makes this easier than if you're plugging into some other orchestrator or some other application where it's more embedded into their systems. The voice agent API is really meant to fold into your application. So you can go and build your own unique voice agent application that's for your business and displays the way that you want it to. So yeah, that was the demo of what we're going to build. So let's go back to the slides. And so what you need to work along. So if you haven't already, definitely do sign up from Assembly. Our account and API key is going to be a requirement for the voice agent API. You do get $50 in free credits for signing up. So no worries about adding your credit card. And you will need Claude code. This could be like the Claude app where it has the code section in the app. Or I'll be using VS code with the Claude code plugin. So if you have either of them, it should work. We won't really be digging into the actual code itself. It's just more about having an interface where you can chat to Claude, give it like the documentation link and tell it what you want to build. So make sure you've got that set up. Oliver. Yeah, it is indeed possible to tie into a phone call process. If you connect it to like Twilio rather than connecting through the UI, you can then have the tool execute and like book an appointment, let's say on cow.com. We're not going to be getting into like actually executing business logic in this session, but we will be doing follow-up sessions. And you can check out our docs for instructions on how to do that. So firstly, what is Assembly UI? For those that are new, we are a speech model company. So we create, we do the research, engineering, design for the API of like building transcription models that can like take the audio and turn it into text. So let's say if you're building a meeting recorder, you can summarize a meeting and get notes afterwards. Or if you're a call center company, maybe you want to be transcribing all of those calls so you can store them and make sure everyone's following on the script. But you can also build voice agents that don't require any human in the loop and use the transcription to pass to an LLM and then to like a speech generation model. And that's what we'll be focusing on here today. So we're using Assembly's models under the hood and it's powering the voice agent. And for the voice agent API, it's one API that's using our models. We provide the full loop. So rather than needing to set up a speech to text model, a large language model and a text to speech model on your end, we handle the orchestration because we're like a model company as well. We can be vertically integrated. So we're making sure that the model plays well with our custom orchestration. And we're also designing the API layer on top. You might know from our speech to text APIs, we definitely focus on more like simplicity, like developer friendly APIs. And we want to carry that into the voice agent API. We really felt like there was a gap in the market, but a nice developer friendly API for building voice agents that like, as I was describing before, rather than like attaching an agent to your application, you're like building your application within the agent. And we'll get into more of what that looks like. But in terms of the pricing, we're also trying to be competitive. So at $4.50 an hour, we're cheaper than a lot of other options on the market. And this is because of us being vertically integrated. So like we rerun the speech to text model, the text to speech model, the LLM. That's why we're able to offer like a really competitive price. And I think it creates for a stronger product where we're making sure that the orchestration is using the model's strengths to create like a better voice agent experience. Yeah.

[00:07:52] Speaker 3: Hey, Rohan. Yeah, great to see.

[00:07:55] Speaker 1: And yeah, this is being recorded. So at the end I can share this slide deck and also the recording definitely on like LinkedIn or it'd be on our YouTube. Moritz. Yeah, certainly. So the voice agent API, it's not necessarily designed just to be for like phone agents or just for web agents. It's meant to be like a developer tool that you can go and build kind of any use case where you need voice input and then like an AI like responding back to you or doing some kind of action. So definitely like a voice coding agent can be created. Yeah, we will be sharing this recording. And Moritz, regarding the coding agent, like to get it to be more of a human in the loop answer only, it would be that the tool result, like you would have a tool that is executing the code logic for you. And then in the tool result, you only pass back what you want the agent to respond with. So you keep it more conversational rather than... Oh, I see. Oh, the chat doesn't display to everyone. Thanks, Meredith. Malik, yeah, you certainly can. If I could answer live, it's not going to... There we go. Yeah. So you can indeed do this enterprise scale with thousands of calls. I would say if you're looking to do thousands of calls, definitely do reach out to us so we can make sure that concurrency limits and maybe you do need some legal and security requirements met like a BAA. So do reach out to us. But certainly this API is designed that you can use and this API is designed that you can go and build an application and scale and not have to worry about concurrency limits being here or pre-purchasing usage like for the month. It's just usage-based pricing. And yeah, so in this workshop, the voice agent we're going to be building. So the step-by-step will be... First, we'll be designing a backend in Python using Ford code. And we're also going to have an HTML frontend. So the backend is used to generate a temporary token to pass to the frontend for authentication. And this is because we don't want to be leaking our API key to the client. So this is more of just security and designing the architecture in a way that you can share with people safely. And then the UI will then be displaying the transcripts and tool calls like we saw on the demo. So now we can get into the live build. I'm going to bring up my IDE and we'll just run through this together. And yeah, we're going to try our best to recreate the demo. It might not be exactly perfect, but that's definitely the goal. So I'll stop sharing my screen and let's bring up my IDE. Okay, so here I'm starting with a blank slate in VS Code. And we'll create a new folder in the downloads. And let me reshare my screen onto the new tab. So if you're following along, yeah, make sure you set up a new directory. And once you have Claude Chat brought up, this is where we can get started. So this might take some Zoom switching because I'm going to have to share my browser and then the code. But first thing I'm going to do is I'm going to bring up our documentation. And let me share my screen here.

[00:12:25] Speaker 3: This is going to be fun.

[00:12:34] Speaker 1: So now that we have the docs brought up, if you're following along, all we're going to do is we're going to copy this page. Actually, we'll even just copy the link. And we're going to pass this to Claude. And effectively, that is all we need to do to give Claude the context of how to work with the API. So if you browse to the docs and just copy this link, this is all we'll need from the docs. And then from there, we can just go all in code or all in Claude code. So to start with, we'll say, here are the docs or the voice agent API. And as we discussed, what we need to do first is set up a Python backend server, which is going to do all of the authentication flow. And it's going to mean that we're not passing our API key in the browser. So this is just the first kind of architecture setup. So let's say here are the docs for the voice agent API. And you set up a Python backend with a HTML frontend that uses the Python backend to generate a temp token for auth in the frontend.

[00:13:47] Speaker 4: So we'll let that run.

[00:13:52] Speaker 1: And generally, this was the idea with the voice agent API and the way we set up the docs is we really wanted to enable coding agent iteration for voice agents. So you might notice with other voice agent providers, it's primarily a UI-driven experience, or it's very deeply in SDKs and low-level programming. So we found that there was this middle ground of almost like how easy Superbase is to use for databases. We thought, why couldn't there be something like that for voice agents? And, yeah, this was where the idea came about of just having like a really easy-to-use API. You don't have to worry about picking the models. It's just purely like I'm giving Claude my business context. And that business context is driving how Claude uses the API and to set up my application. Let me just make sure there are no questions. Awesome. So, Meredith, this prompt, this isn't in the docs, but what I really want to show here is that you don't necessarily need like an MCP server or a skill to use the voice agent API. Just with Claude web search and like passing it the docs URL for the voice agent API, it can crawl the rest of the docs. And just with like blank Claude code, it's going to be able to scaffold this back end for me and the front end just from reading the docs, really. So like I'm not going to be touching any of this code or doing any like custom prompt magic. It's purely me just telling Claude what to do, and it's able to just find the docs on its own. So we can see Claude is setting everything up, and I'm just going to accept permissions.

[00:15:54] Speaker 2: Let's see here.

[00:15:56] Speaker 1: If assembly I serves TTS, when can we expect TTS feature to be live for customers like us as well? So this is coming in the future. I can't say when, but like it is in the voice agent API. So if you need TTS for a voice agent, we do recommend using the voice agent API. But TTS is definitely on our radar. And like right now, we're not offering it publicly yet because it's built in as part of our voice agent orchestration, how the TTS is working. But yeah, more to come there. Do watch our changelog and be sure to check in on our LinkedIn and our YouTube for latest updates. Oh, yeah, Mark, of course, let me paste the URL, and sorry about that.

[00:16:58] Speaker 4: Let me know if you can see that chat to everyone.

[00:17:13] Speaker 1: Marius, so is the Claude code you're using customized to build voice agents or can I use my personal Claude code? This is default Claude code. So I haven't got any NTP servers or skills set up. I just use like plain Claude code and I just give it the URL. So you can definitely use your personal Claude code. You can also use the Claude app. So rather than being in your like development environment, you can just give the link to the Claude app, like the Claude code app, and it can go and scaffold the backend and frontend for you. So now that Claude code has gone and built our backend and frontend, and it's done the temp token authentication, which is key for like browser voice agent applications, we now have like our local host application ready. So let's run this and let's see what happens. The only thing we will need to add is our API key. So I'll stop sharing my screen so I don't leak my API key, but Claude should be returning you a .env as well. So make sure you pass your API key there. So we don't need pipe cat or live kit if we're using assembly. Correct. Because we're passing or like we're providing the WebSocket transport, you can just plug this directly into your application. You don't need to have a separate voice agent worker. Not sure if this is covered already. Yeah, yeah. You should also follow this as a first step. The reason why there's no business context shed is because this is just like a getting started guide of like scaffolding and getting the application ready. But the idea is that you would then take this application that we're building together. And let's say you are doing appointment bookings for dentists, like dentist receptionists. You can then give that information to Claude Code and have it like customize and create the UI and the prompt for that business context that you're working with. Do we need assembly AI for voice agent? Yeah, you would need an assembly AI account to access our voice agent API. But we are providing like the full pipeline like speech to text, LLM and TTS. But you can indeed go and just take our speech to text models and build your own voice agent on another platform if you do wish to. But this workshop is for the voice agent API. Mark, question on retention. We will be publishing that very soon. That's on me really to get out. So thank you for the reminder there. And do feel free to email me at dint at assembly AI.com. And I can follow up there. But I'll follow up with you after the session as well. Oliver, your question about security and privacy of the data. Currently, we do not store any data. So it's just from rule during the session. But we are going to be adding session history. So for now, there's no data stored. But we will be storing that and adding a retention policy soon. But thanks for the question. Okay, great. So I'm going to add in my API key. And if you're following along, do the same. And we will then run the application together. Let me know if anyone's encountering any trouble with that first prompt. And we can tackle it together. So now I have my API key added in my application. We can now let me just rename this. We can now run this application. And we'll see what it has started with. Classic, we encounter a first error on run. But for those who maybe aren't into vibe coding yet, the recommendation here is just paste that error into Claude. And it's going to fix it. Marish, are there best practices or user guides that you have for prompting Claude on your use case? Yeah, this is something we'll get into later, more about like contextual prompting based on your use case. To give a quick answer, the best way to go about this is if you have existing human-to-human calls of the use case you're going to do. So if you have recordings of appointment setting calls, transcribe them with our speech-to-text model. And then passing those transcripts to Claude as context for how to prompt the voice agent to speak like the humans do on those calls is the way we recommend doing it. And definitely we can dive more into that later on in the session. I'll save that question and we can come back to that.

[00:23:10] Speaker 4: I see.

[00:23:20] Speaker 1: Okay, here we go. So Claude did set up a Python virtual environment for me. If you're encountering any issues with that, do just say to Claude, like, I'm having this error, I'm feeling stuck right now, what should I do next? This is the same way that I pretty much just brute force my way through coding an agent like this. And building web applications is difficult, so you might be encountering errors, but just paste them back into Claude. Or if you're really having trouble, paste it into the chat and we can break it down together. But currently what we've done, to reiterate the steps, is we've prompted Claude to set up the back end and the front end. And now we're just going to run that locally so we can see what is created. So let's see here now. So as we can see, Claude has created the front end, which is looking okay now. And I can click to start and begin speaking. So why don't we bring this into our browser?

[00:24:34] Speaker 4: We'll bring it into the browser first.

[00:24:36] Speaker 1: Okay. Let me share my screen. I will show you the local application we have running. So here is where it's currently at. I can connect. And let me make sure I'm sharing my audio here for you all. Here we can see we've already got a pretty similar application to the Aura Studio demo that we started with. So we're making very quick progress just from a couple prompts with Claude. So let's start a conversation.

[00:25:44] Speaker 3: Hi, how can I help you today?

[00:25:46] Speaker 1: Hey, can you hear me?

[00:25:50] Speaker 3: Yes, I can hear you loud and clear. How can I help you?

[00:25:57] Speaker 1: I don't know. But there we can see just with one prompt and passing the links to the docs, Claude was able to kind of crawl the docs, bind our connect from browser guide. And this is really showing the power of coding agents and using APIs. I think it's just able to like read all the docs so then you don't have to worry about even reading docs nowadays. And especially when it comes to like audio transport, like sending audio over a WebSocket can be really complex. But if you have a coding agent handling it for you, it really abstracts away that complexity. And fortunately, it's only one link you have to create from sending audio from the browser to our API. So it's definitely manageable more with a coding agent than trying to kind of manually code it. So a question from Marish. Which languages do you support? Most languages have dialects in EU like Portuguese and Spanish. What has been observed is that English or other common languages work great with voice agents. So the languages that we support are English, Spanish, French, German, Italian, and Portuguese. So these are the same languages that are supported by our Universal 3 Pro speech-to-text model, which we're using in the voice agent API. But we are expanding these languages in the next couple of weeks, which will also include for the voice agent API. And it's a great point about dialects. Definitely something we've learned since launching this is that like we have a couple of Spanish voices, but the feedback we've been getting is that it's a very generic Spanish voice. It's not like it's not a locale or an accent that is like recognizable to an area. So definitely an interesting learning we found is that like it's really important to have a voice with an accent. And that's definitely feedback we're taking on board to drive like future product development for TTS. Brazilian Portuguese and European Portuguese is also another one. And like Latin Spanish versus European Spanish. So yeah, I'd say right now generic, but the vision of where we see this going is we want to have voices like with locale-based accents. So just to check in, we've got our application now scaffolded. Let me know if anyone's having any trouble setting this up. But we are pretty much 80% of the way there. Now let's get on to the final 20% of adding the tools to our agent and making it about appointment setting and just generally making it look a bit nicer. So let's go back to Cloud Code and we can carry on. So because we've already given the link in the first message that we did to the docs, we can now say, can you look in the voice agent API docs how we can add a create appointment tool, just simulate the tool results for now. So we will be simulating the results in this demo just because we are short on time. But in our docs, we do have code examples for how you can handle the tool call event in your code and have that cool real services like Cal.com or Calendly or maybe a database or a CRM that you're using. So you can definitely plug this into your real systems and run like business use case calls. But for now, we will just be simulating it just to have an easy getting started guide. So now, Claude, as you can see, is reading through the events. Tool call required or started. Moritz, great question. Can you correct and improve the model so that I learned your custom words and way to speak? So one feature we do have in our documentation for this is key terms. So you can provide key terms to your voice agent, which is going to be like the key terms that you would want to give are rare words that aren't really in any dictionary. So it's not a word that like any typical speech text model could have learned. So if it's like a new product SKU, like for example, if you've created some bug repellent and it's called like bug gone, it might be quite difficult for a speech to text model to like recognize out of the box. But if you can provide the key term, which is also supported by the voice agent API, it's going to buy us the speech to text model to recognize that term. And like when it has something of that nature is going to like select the key term that you provide. So that's how you can like train the model to understand terms in your business that aren't generally understood. Aways, this is the assembly documentation link. I want to create a voice agent. Yeah, that prompt absolutely should be enough for Cloud Code to go and create a voice agent and to kind of explain why this is. So Cloud Code is a really powerful tool that isn't just giving the documentation link to the model. And, you know, as you know, like LLMs only have memory from like 2024. So it's kind of weird how it can like, if we give a link, it can go and understand how to use the API and go and build a voice agent. But because Cloud Code is like this harness on top of the model. So it's built for coding. So it can like web search to look up the docs. It can like scrape the rest of the documentation to understand what it needs to do. Just a prompt like that saying, I just want to create a voice agent is enough context for Cloud to then go search the documentation and build an application. So, yeah, I recommend pasting that prompt in and do let me know how it goes. It should be all good. Thought Opus and Fable have training data up to 2026. Oh, awesome. Thanks. Thanks for the heads up. Does this feature only support integration by API or with SIP and WebSocket as well? So, yeah, currently it is WebSocket API integration, but we do have a SIP pass-through. If that's something you're interested in, definitely give me a message or an email and we can help set up a SIP pass-through. So for anyone on the call who is looking to deploy this to telephony, we are working on that right now. And for those who want early access, we'd be happy to work with you on setting that up. I won't be able to cover this in this workshop just yet, but we can definitely attach it to a phone number and we can host it for you. In your experience, are some Claude models better than others in building voice agents, Claude versus Codex? Generally, I would say there's no specific model that's better for using the voice agent API. Now, if it was for like building a voice agent in Pipecat or LiveKit where you're going and creating on your own, I would say that maybe there is some chance the coding model you're using can matter. But for the voice agent API, because we've abstracted away a lot of that complexity, even if you're using, let's say, GPT 4.1, but it was able to read our docs, it would be able to go and set it up. For complex voice agents, I do see what you mean, how the model can matter. I think in terms of configuring the voice agent API, it wouldn't be a big difference, but maybe if it's more about how do I take 1,000 hours of human-to-human calls and then analyze those calls to learn what prompt should I give my voice agent that is going to perform the best, then maybe a model like Fable, which is the most powerful model available, would be better there. But I think that's something we have to explore with and we should do some more research there. Okay, so now that we're back in the code, let's continue with setting up the... Oh, no, sorry, we already set up the tool calls. What do I mean? So they run client-side. Is the tool call being shown in the transcript output box? So we're just prompting Claude now to display the tool call in the UI like we did on that original demo. And this is what I think is really fun and powerful about the voice agent API is because it's so simple to handle these events, we can create a more interactive experience. Right now, a lot of agents are just built to answer the phone and just do basic tasks. But I think what's really exciting is kind of like Siri on an iPhone. Siri isn't just speaking back to you and doing actions. It can actually move things on your screen or open an app or while it's thinking, it can do that vibration of the phone and it makes it a much more intuitive experience. We definitely think for voice agents to take off, it has to be this really cool experience where it's event-driven and the voice agent can not just... Oh, Robert, that's hilarious. I triggered the Siri. That's hilarious. And I'm probably doing it again. But it's really cool how it can see everything on your screen and do things on your screen as well as be able to speak back. I think the iPhone just has so many modalities that if a voice agent is able to use all of them, it definitely makes a much cooler app. I think a great example, if anyone has used the Portola Tolens app, which is this companion voice agent friend, I think it's one of the best designed apps because it's not just a voice agent you talk to, but things are happening. And for consumer apps, I think it definitely has made it a lot more fun to use than other companion apps just for the fact that it can change things on the screen alone. So now that we've set up the tool, I think this UI isn't looking very pretty. So for the original use case, it was like a salon booking a haircut. But let's say we are going to build a... MOT booking agent. So that is if your car has to go through a yearly pass here in the UK. So we can do an MOT, like a car mechanic booking agent. And we'll just make the UI look nice as if this is a hackathon project. So can you make this a MOT slash car mechanic booking voice agent? UI. So as you can see, I'm pretty much prompting Claude as if I'm like texting a mate. I definitely don't try and do any kind of prompt engineering. I just let the context do its work from searching the documentation. Funny enough, I think if you try and direct Claude too much on what to do and how to go about it, it can end up like really over focusing on that task when really Claude knows how to do things better than we do. And it's more about giving a direction than telling it what to do. So as you can see, I'm just giving a one-liner. Make this a car mechanic UI and it can take that context. I'm sure it's going to build a lovely looking UI. How do you manage latency, especially if the voice agent is configured with multiple APIs? So firstly, I think making sure that your APIs, that you're executing in the tools, making sure that you get as much performance out of that as you can. And I think as well when it comes to like, it would depend on the multi-step workflow. If it's something like checking availability and then booking an appointment in one go, I feel like you can reorganize how you do those tools in a more efficient way to cut out the latency. So rather than doing all these steps, when someone is saying to book an appointment, all that you should be triggering in terms of tool tools is booking it. And any background tasks that can be done on their own that don't necessarily need the human input, or they don't need to know that, like, I don't need to say my email is this. So then book the appointment with this email. Like you really, instead of trying to optimize the latency, try and simplify the tools that you're doing. And we found that that has resulted in like better latency and overall better design of your voice agent. I definitely think there is a lot of problems by trying to add too many tools where minimizing the tools is going to result in better tool calling accuracy and latency. Moritz asks, how would you get the voice agent to work on your iPhone, but still read and write on your local harness and context located on your PC? So in this case, you would need to deploy it to either like, deploy it as a web application or to a phone number. So once we finish up this UI, we can run through deploying it to railway, which is what I use to deploy the demo. So then on your phone, you can just bring up like Safari or Google and then go to that link and then you can use it on your phone. AR, how do you spell the companion app? I can write it in chat for you. It is Oolans. You search that on the app store. If you're interested in voice AI like me, and you're just obsessed with any kind of voice AI companies and apps, I definitely would recommend trying out Oolans. It is really cool. And I think it's one of the best examples of like a UI driven voice agent experience, which I think is a market that's going to only grow as time goes on. It won't be just phones. It will be like in WhatsApp or in a phone app. So now that we've refreshed the UI, we have a really cool professional looking voice agent application here. So let me bring this into my browser so we can see this a bit better. And it's pretty crazy. This is only from three or four prompts. If I scroll back up, this was one prompt.

[00:42:38] Speaker 2: And two prompts.

[00:42:41] Speaker 1: Yeah, four prompts, essentially. And I have like definitely a pretty nice looking demo that I can go and deploy and share with a colleague or a friend or even try and start my own startup from. So let me make sure I'm sharing my audio and let's run through a live example with our new MOT agent.

[00:43:12] Speaker 3: Thanks for calling Apex Auto. Are you booking an MOT, a service or a repair today?

[00:43:18] Speaker 1: I am looking to book my yearly MOT.

[00:43:23] Speaker 3: No problem at all. Could I have your vehicle registration, please?

[00:43:28] Speaker 1: Yeah, it's ELV44AGC.

[00:43:36] Speaker 3: Got it. That is ELV44AG. And when would you like to bring the car in?

[00:43:45] Speaker 1: Maybe the Saturday morning.

[00:43:49] Speaker 3: We are open on Saturday from 8 until 1. What time on Saturday morning would you like to drop the car off?

[00:43:56] Speaker 1: 9 a.m.

[00:44:00] Speaker 3: That is all booked in for you. Your MOT for ELV44AGC is all set for Saturday, May 24th at 9 a.m. The total will be £54.85.

[00:44:15] Speaker 1: Amazing. Thank you very much. Now that we see that was a successful tool call, I actually wanted it to fail because what I want to go into very quickly next is because we're working with tools and tools are notoriously difficult for voice agents, something I just wanted to touch on was if you have an agent where first it needs to check the availability and then it needs to book an appointment using the context from that previous tool. So you wouldn't want the agent to book an appointment without checking that it's first available. A design pattern that we recommend is that in the handling of that check availability tool, you then add in the agent to be able to book the appointment. So let me bring up in the docs here, we call this progressive tool reveal. So now that I've checked my availability, I can do a new update to my agent configuration and I can remove the availability tool and then add it back to have the book appointment tool. And that way you're going to see less hallucinations where the agent might just decide to book the appointment without first checking. And you increase the accuracy because you're only giving it one tool. It's going to be able to call that tool much more frequently when it should do. AR, does it matter that you're using headphones? I've been struggling with noise when using laptop speakers. One thing that you might want to check you have enabled is acoustic echo cancellation. We do recommend enabling it if you're not using headphones. And this would be because the speaker is going to be playing out the audio and then the mic is going to be picking it back up. But if you enable acoustic echo cancellation, which is available in the browser. So maybe here I should say, and you make sure AEC is enabled. Please. And let me post this into the chat as well. For anyone who is not using headphones, make sure you do have echo cancellation enabled. It will completely solve those issues. Yeah, that's the prompt I would recommend giving to Claude to enable echo cancellation. Tan, what are some of the out-of-box ways you would recommend to decrease AI response latency? So for the voice agent API, you can control the turn detection. By default, we do have it relatively conservative. One second min silence and three seconds max silence. And this is because we are focusing purely on accuracy for our out-of-the-box default experience. There's nothing more annoying than the agent interrupting you while you're saying, you know, a phone number and you want to give it your phone number to do a booking. But if you're building a different kind of application where maybe you don't need to take down someone's phone number or email and you just want really quick back and forth conversation, you can reduce our min silence and max silence parameters to your needs. So maybe you want to reduce min silence to 500 milliseconds. That will cut the response time in half to what it is currently.

[00:47:42] Speaker 3: So that's definitely a quick way to increase the, or decrease the latency.

[00:47:48] Speaker 1: But do be aware that because you're decreasing the minimum amount of silence for the agent to, like, so it can claim that it's a turn end, it does increase the, like, potential that the agent ends the turn too early. So if someone is speaking too slow or is spelling something too slow, there is that risk. So make sure that it matches your use case. And we generally recommend, and we've seen that more conservative settings do result in better customer outcomes. And sometimes trying to make the agent too fast isn't really worth it, but it might depend on your customer base. This is what Claude has come up with. Can you check? Yeah, absolutely. Let's bring up this application. Let me see.

[00:48:47] Speaker 4: Looks fantastic.

[00:48:52] Speaker 1: I can start a conversation with it. Let's choose Ivy.

[00:49:00] Speaker 3: Hi there, I'm your voice assistant. What can I help you with today?

[00:49:04] Speaker 1: Hey, I don't know. What can you help me with?

[00:49:12] Speaker 3: I can help you manage your schedule, set reminders, check the weather, or even find information on a topic you're curious about. Is there anything specific on your mind right now?

[00:49:23] Speaker 1: Awesome job, always. Yeah, you fully set up the scaffolding for the voice agent API. So now you've got a great base to start with. And yeah, you can just prompt Claude to iterate on this and tune it for your use case, but it's really cool to see that in this session you're able to deploy to CloudFlare a voice agent application along with me. But what I want to go into next, as Awaze has deployed it to CloudFlare, I want to show you all how you can go and deploy the application on your own. And what I recommend to use is an app called Railway. So let me bring up the link here. I will post this into chat for everyone. You can sign up free and they have a really nice free tier. And I'll show you how, with the power of Claude, we'll double check that our acoustic echo cancellation is enabled, but I'll show you how you can go to deploying this to a website link so that you can share it with a friend. Any advice about using the voice agent to do interviews and analyze the data? So the voice agent API can definitely handle interviews. Some recommendations that we would give there is to make sure that your turn detection settings are conservative because typically in interviews someone is going to take a lot of pauses, especially if you ask an open-ended question like, tell me about your career history. Someone might take a couple of minutes to answer that. So we definitely recommend updating the session like mid-session to slow it down when asking an open-ended question. In terms of documentation, we do have an interview agent sample app. I will remember to send you a link to that. And definitely we can exchange messages on email or LinkedIn to set up an interview agent. But yeah, we don't have anything in our documentation there, but I really appreciate the call out and we should add a sample app or an interview agent in there as soon as possible. So acoustic echo cancellation is now enabled. So if you were previously having issues and you're not using headphones and the agent was hearing itself, you should now see this fixed if you're following along and have enabled echo cancellation. Can the voice agent do a call summary after the call? So in terms of it being built into the voice agent API, we currently don't support this, but we do have a product called LLM Gateway which you can, let's say, store the, you know, we can set that up right now actually. If I bring up the LLM Gateway documentation. So if you're following along, I will post, this is the link to the docs, LLM Gateway, that if you go to Claude and you say, can you store the transcripts and at the end of the call, send them to LLM Gateway and it can generate a summary. That's going to give the coding agent the context of how to use LLM Gateway and then to summarize the call afterwards. Yeah, we don't have this built into the product, but it is something we're looking to do as we kind of mature and we do have a pretty strong existing like speech understanding product suite that we'll be using to understand like, you know, like the sentiment of the call or whether a tool call should have been called or not, but definitely recommend LLM Gateway for all your post-processing needs. So to round things out here, after we set up the LLM Gateway summary, we will finish our railway deployment. So I'll share what we need to do next. So in order to deploy to railway, we need a GitHub repository of this agent that we've created. So I might just pause on the LLM Gateway summary, but say, can you get this ready to deploy on railway and create a GitHub repository and upload it there so I can deploy? And yeah, railway is great for just, if you have your GitHub repository set up and you ask Claude to make it ready to deploy on railway, it's going to just be super easy to then in the railway UI, we can deploy that repository and we'll run through that as soon as Claude has finished. Marish, how do you measure the success slash performance of your voice agents? So this would definitely be with your LLM Gateway like scoring at the end of the call, but I really do think there's no better judgment than like reviewing the call yourself. So one feature that we have coming out this week is session history in the dashboard. So soon, any call that you've made to the voice agent API, let's say today, or like in the future, when you're calling it, you'll be able to go into your assembly UI dashboard and listen back to the audio of the call, so of the user's channel and the agent's channel, see the tool call events and the past transcripts. And then you can make a judgment there on if the agent is performing well or not, or you can take that session history and pass it to LLM Gateway to score for you. So we'll be posting documentation about the session history endpoint when it's out, but that would definitely unlock being able to score your calls at scale. And thank you for the brilliant questions. Tan, by the way, I created a note spot for Discord. We use assembly for our STT. Amazing product from you guys. Looking forward to seeing the company grow. That's awesome to hear, and that's great. I would really use a note-taking bot for Discord. That's awesome. In the past, I built a mini tool which was transcribing Discord voice messages, so it's really cool that you got calls working. I couldn't get that to work before myself. Moritz, could the agent detect the person interviewing you as not being your voice and answer a question in text but does not react to you answering interviewer's question? That would not be supported now to answer the question in text, but you can kind of simulate that by just muting the agent when it's responding because it does provide that text transcript at the end. But I'll put my email in chat for you, Moritz. If you would like to send me an email, I'd be happy to discuss that further with you of how we can support that. As Paul is running through getting it ready to be deployed on Railway, it should be finished here in almost a minute, and that will take us right to the end. Thanks to everyone who stuck around for the 58 minutes. This was really fun. What I will do is I will show how to deploy to Railway on the original project I did. So when you're in Railway and you've connected your account, if you just go to New Project and then you have your GitHub connected, this Aura voice booking app was the original demo, but I just told Claude to get it ready for being able to be deployed on Railway, and it's going to start the application for me. I think GitHub is having some issues, which is frustrating, but once you have the repository in Railway, you can just one-click deploy and then you can generate a URL like this one here with the Railway.app, and now this is shareable and you can share with a friend, share with a colleague, it's going to stay running 24-7, and you could start a business from it, which I think is really cool. But we are at time, everyone. I really appreciate everyone asking questions and following along. It's really cool to see the agency built. The recording will be available. We'll be posting it, I think, on LinkedIn or YouTube or Twitter. We'll make sure that it gets out there and you can see it. I'm really excited that you found this great. Certainly, if you have any questions or just want to chat or want to carry on from this session, please do email me, message me on LinkedIn. I'd love to help you all build something with the VoiceAgent API and do follow along as we continue building out the product and making it even better. Hope you have a great rest of your day. If you're in London, see you at the Hackathon tomorrow. Thank you very much, everyone. Have a great rest of your day.

[00:59:47] Speaker 4: See you later. Bye-bye.

ai AI Insights
Arow Summary
Dan (Product Manager at AssemblyAI) runs a live workshop showing how to build an appointment-booking voice agent using AssemblyAI’s Voice Agent API and Claude Code. He demos a haircut-salon booking agent with live tool execution in the UI, then explains required setup: AssemblyAI account/API key, Claude Code (app or VS Code plugin), and a simple architecture with a Python backend that generates temporary tokens so the browser frontend never exposes the API key. Claude Code is used “vibe-coding” style: paste the docs URL, ask for a backend+HTML frontend, and let Claude scaffold the app, including WebSocket audio transport, transcripts display, and tool-call event handling. The session covers iterating the agent: adding simulated tools for booking (with mention of integrating real systems like Cal.com/Calendly later), improving UI, and switching the agent to a UK MOT/car-mechanic booking scenario. Dan answers audience questions on telephony (Twilio/SIP pass-through), scaling/concurrency, languages (EN/ES/FR/DE/IT/PT with expansion planned), dialect/voice accent needs, security/privacy (no storage yet; session history and retention policy forthcoming), echo cancellation, latency tuning via turn detection parameters, and best prompting practices (use real human call transcripts; key terms for rare words). He introduces “progressive tool reveal” to reduce tool hallucinations by updating available tools after prerequisite steps. Finally, he shows how to deploy the app to Railway via GitHub for a shareable URL and mentions future work like post-call summarization via AssemblyAI’s LLM Gateway and upcoming dashboard session history.
Arow Title
Building a Voice Agent with AssemblyAI Voice Agent API + Claude Code
Arow Keywords
AssemblyAI Remove
Voice Agent API Remove
Claude Code Remove
voice agents Remove
WebSocket audio Remove
browser voice UI Remove
temporary token authentication Remove
Python backend Remove
HTML frontend Remove
tool calling Remove
appointment booking Remove
MOT booking Remove
progressive tool reveal Remove
turn detection Remove
acoustic echo cancellation Remove
Railway deployment Remove
GitHub Remove
Twilio Remove
SIP Remove
LLM Gateway Remove
session history Remove
languages and dialects Remove
key terms Remove
Arow Key Takeaways
  • You can scaffold a browser-based voice agent quickly by giving Claude Code the AssemblyAI Voice Agent API docs URL and a clear goal.
  • Use a Python backend to mint temporary tokens so you never expose your AssemblyAI API key in the frontend.
  • The Voice Agent API provides an integrated STT+LLM+TTS loop and WebSocket transport, avoiding extra orchestrators like Pipecat/LiveKit for basic setups.
  • Display transcripts and tool-call events in your own UI to create richer, app-native agent experiences.
  • Start with simulated tools, then later connect real business systems (Calendly/Cal.com/CRM) via tool call handlers.
  • Use “progressive tool reveal” (update agent tools mid-session) to enforce multi-step workflows and reduce tool hallucinations.
  • Reduce echo issues by enabling browser acoustic echo cancellation (AEC), especially when not using headphones.
  • Tune perceived latency by adjusting turn-detection silence thresholds, balancing speed against interruption risk.
  • Improve recognition of domain-specific terms using “key terms” (rare words, SKUs, names).
  • For best prompting, use transcripts of real human calls as behavioral exemplars for the agent.
  • For scale (thousands of calls), coordinate on concurrency/security requirements; usage-based pricing is highlighted.
  • Deploy demos easily via GitHub + Railway to generate a shareable, always-on URL; telephony via Twilio/SIP pass-through is discussed as an option.
  • Post-call summaries aren’t built-in yet but can be implemented via LLM Gateway; session history and retention policy are forthcoming.
Arow Sentiments
Positive: Overall tone is enthusiastic and collaborative, emphasizing rapid progress, ease of use, and excitement about developer-friendly voice agents; questions are addressed constructively with candid notes on upcoming features and current limitations.
Arow Enter your query
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript