Latency and Voice Quality: Keys to Voice Agent Adoption (Full Transcript)

Why outbound voice agents succeed: fast time-to-first-token, low end-to-end latency, and natural voice quality—plus hidden factors like background noise.

Download Transcript (DOCX)

Speakers

Add new speaker

[00:00:00] Speaker 1: Outbound, there's really two things that I think our customers really care about from a voice perspective is voice quality and latency. I remember our early days of, so just previous history, we were a consumer app. We actually did a voice agent for consumers to call collectors back in March of 2023 when we came out of YC and it was a seven second latency. It was actually kind of great because collectors were getting pissed off because they're like, hello, but like whatever, we were a consumer app, right? As we progressed through this, I remember when we pivoted over and we launched in B2B for banks and credit unions, we had like a three and a half second latency and customers were pissed, right? Like they were angry because we were at three and a half seconds. Now we're sub 1.6 all in, like including Twilio, getting it all the way through, not just, you know, from a model execution perspective and they love it. So they care about latency one. Within that, they care about time to first token, right? Like when's your bot actually going to reply back? Because they care about that first initial pickup as that part of it. The second thing that they care about is the voice quality. Like they want it to sound very, very conversational. So I think it does matter who you pick from a voice provider perspective. Some have performed better than others. I will say DeepGram has a voice that's done really, really well for us. And we kind of give our options to our customers of who they want to go and utilize. We are starting to now leverage Rhyme a little bit more too, just to kind of give an idea of the vendors that we're looking at and working with. But really those are the two key things that they care about is how quickly is the voice agent obviously replying back from an overall perspective, but also from that first part of the conversation, how good is the voice so that it actually sounds conversational. Then there's all the other stuff too, right? Like we were having a side conversation around background noise and how it's been impacting some of our results lately. I mean, like those are things that clients don't even think about during calls. Those are all kind of like added things that we care about. Clients don't even realize it or care about it or know about it.

Summary

Speaker 1 explains that for outbound voice agents, customers primarily care about two factors: low latency (especially time to first token) and high conversational voice quality. The product initially had very high latency (~7s) in a consumer use case, improved to ~3.5s when pivoting to B2B (which customers disliked), and now achieves sub-1.6s end-to-end latency including Twilio, which customers love. Voice provider choice matters; DeepGram has performed well, and the team is also starting to use Rhyme, offering customers vendor options. Additional technical concerns like handling background noise affect results but are mostly unnoticed by clients.

Copy

Download

Title

What customers want in outbound voice agents: latency and voice quality

Copy

Download

Keywords

outbound voice agent Remove

Remove

latency

Remove

time to first token Remove

Remove

voice quality Remove

Remove

conversational speech Remove

Remove

Twilio

Remove

DeepGram

Remove

Rhyme

Remove

B2B banks

Remove

credit unions Remove

Remove

background noise Remove

Remove

ASR/TTS vendors Remove

Remove

Copy

Download

Key Takeaways

Customers judge voice agents largely on latency and voice quality.
Time to first token is a key sub-metric within latency because it shapes the initial pickup experience.
What was acceptable latency in consumer contexts (7s) becomes unacceptable in B2B; 3.5s drew complaints.
Sub-1.6s end-to-end latency (including telephony like Twilio) materially improves customer satisfaction.
Voice provider selection impacts perceived conversational quality; DeepGram performed well and Rhyme is being explored.
Engineering factors like background noise handling can impact outcomes even if clients don’t notice them explicitly.

Copy

Download

Sentiments

Neutral: The tone is practical and performance-focused, describing past shortcomings (customer frustration with latency) and current improvements without strong emotional language beyond noting customer reactions.

Copy

Download

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file