Latency and Voice Quality: Keys to Voice Agent Adoption (Full Transcript)

Why outbound voice agents succeed: fast time-to-first-token, low end-to-end latency, and natural voice quality—plus hidden factors like background noise.
Download Transcript (DOCX)
Speakers
add Add new speaker

[00:00:00] Speaker 1: Outbound, there's really two things that I think our customers really care about from a voice perspective is voice quality and latency. I remember our early days of, so just previous history, we were a consumer app. We actually did a voice agent for consumers to call collectors back in March of 2023 when we came out of YC and it was a seven second latency. It was actually kind of great because collectors were getting pissed off because they're like, hello, but like whatever, we were a consumer app, right? As we progressed through this, I remember when we pivoted over and we launched in B2B for banks and credit unions, we had like a three and a half second latency and customers were pissed, right? Like they were angry because we were at three and a half seconds. Now we're sub 1.6 all in, like including Twilio, getting it all the way through, not just, you know, from a model execution perspective and they love it. So they care about latency one. Within that, they care about time to first token, right? Like when's your bot actually going to reply back? Because they care about that first initial pickup as that part of it. The second thing that they care about is the voice quality. Like they want it to sound very, very conversational. So I think it does matter who you pick from a voice provider perspective. Some have performed better than others. I will say DeepGram has a voice that's done really, really well for us. And we kind of give our options to our customers of who they want to go and utilize. We are starting to now leverage Rhyme a little bit more too, just to kind of give an idea of the vendors that we're looking at and working with. But really those are the two key things that they care about is how quickly is the voice agent obviously replying back from an overall perspective, but also from that first part of the conversation, how good is the voice so that it actually sounds conversational. Then there's all the other stuff too, right? Like we were having a side conversation around background noise and how it's been impacting some of our results lately. I mean, like those are things that clients don't even think about during calls. Those are all kind of like added things that we care about. Clients don't even realize it or care about it or know about it.

ai AI Insights
Arow Summary
Speaker 1 explains that for outbound voice agents, customers primarily care about two factors: low latency (especially time to first token) and high conversational voice quality. The product initially had very high latency (~7s) in a consumer use case, improved to ~3.5s when pivoting to B2B (which customers disliked), and now achieves sub-1.6s end-to-end latency including Twilio, which customers love. Voice provider choice matters; DeepGram has performed well, and the team is also starting to use Rhyme, offering customers vendor options. Additional technical concerns like handling background noise affect results but are mostly unnoticed by clients.
Arow Title
What customers want in outbound voice agents: latency and voice quality
Arow Keywords
outbound voice agent Remove
latency Remove
time to first token Remove
voice quality Remove
conversational speech Remove
Twilio Remove
DeepGram Remove
Rhyme Remove
B2B banks Remove
credit unions Remove
background noise Remove
ASR/TTS vendors Remove
Arow Key Takeaways
  • Customers judge voice agents largely on latency and voice quality.
  • Time to first token is a key sub-metric within latency because it shapes the initial pickup experience.
  • What was acceptable latency in consumer contexts (7s) becomes unacceptable in B2B; 3.5s drew complaints.
  • Sub-1.6s end-to-end latency (including telephony like Twilio) materially improves customer satisfaction.
  • Voice provider selection impacts perceived conversational quality; DeepGram performed well and Rhyme is being explored.
  • Engineering factors like background noise handling can impact outcomes even if clients don’t notice them explicitly.
Arow Sentiments
Neutral: The tone is practical and performance-focused, describing past shortcomings (customer frustration with latency) and current improvements without strong emotional language beyond noting customer reactions.
Arow Enter your query
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript