What It Takes to Ship Voice AI Beyond Transcription (Full Transcript)

Builders from CodeOop, EdgeTier, and Granola share pipelines, diarization and multilingual challenges, real-time tradeoffs, eval strategies, and privacy.
Download Transcript (DOCX)
Speakers
add Add new speaker

[00:00:00] Speaker 1: All right. Cool. So, nice to meet you, everybody. My name is Ryan. I'm here from Assembly AI. We build voice AI models for developers. I'm here today to moderate this amazing panel with people who are actually building in voice AI and going to answer some of the questions that you all put on your forms when you registered for this event as well as some that we've set up in advance just to make sure we kind of like cover all our bases. So, a quick housekeeping. We'll do this for about 20-25 minutes and then after that we'll open it up to Q&A so again we don't have mics we'll talk as loud as we can you can't hear us let us know and with that maybe we can kick things off Adrian if you want to get us started who you are what your company does and what you're building in voice AI would be a great

[00:00:46] Speaker 2: great place to kick us off. Cool I'm Adrian co-founder at Codeoop we help companies understand their customers better by taking all of the the interviews that they do with their customers and helping them analyze them and get to the insights they need to make better decisions. The way we use voice AI is to transcribe all of that qualitative data, so interviews, video audio files, so that we can actually, you know, make sense of them. Yeah. Great. Hi,

[00:01:16] Speaker 3: I'm Shane Lin, often criticized for my loud voice, so finally tonight is the night. So yeah, I'm co-founder and CEO of EdgeTier. So we're a conversational intelligence platform for high-volume contact centers. So we ingest the conversations that everyone has with customer care centers and high-volume brands, ingest all of that data, and effectively process it and help those companies improve the experience, find out where customer frictions are, find out where there's agent performance issues, and we're operating at a scale where there are too many conversations to read so often our clients would have five ten twenty thousand conversations a day so it's just a massive slew of information and for us we I suppose work across every channel so you know calls are in there but emails chats surveys whatsapp messages you name it comes into a contact center we process it and we've been partners with assembly AI for a few years now voice was a big unlock for us we used to work exclusively in chat and then we were like okay customers are just saying well this is great but I I want to see my calls. So it actually just widened our ICP. We used to focus on digital-only contact centers, and now we use that for voice. So for us, it's about ingesting all that data at scale and actually having the result very quick so we differentiate on the real-timeness of our insights and the kind of proactivity of those alerts.

[00:02:38] Speaker 4: Cool. I'm JK. I work here at Granola. We build a product to take meeting notes from your meetings. Talking of, I'm gonna actually granola this meeting now and I will share it later with everyone. But yeah, obviously we use Voice AI for transcription, capturing what people said on meetings and yeah, it's a huge part of our product and it's great to chat more about it today.

[00:03:04] Speaker 3: Awesome. As an Irishman, I appreciate you had Guinness in the fridge. Oh, hang on.

[00:03:09] Speaker 1: Well, I think that's a great place to start. Maybe you all can just walk us through, like what does your end-to-end pipeline with voice look like today? And then maybe help give the audience some insights into like where do you maybe spend the most of your time within that pipeline to really make like your product and your experience great just to give them some flavor of like, I mean, I work at assembly, obviously I'm biased that our models are good, but the reality is the model is just one piece of like the big stack that you're probably managing around voice to make a great experience for your customers. Does anyone wanna take it to kick things off?

[00:03:42] Speaker 2: You wanna go? Yeah, it's funny because Oscar over here is actually rebuilding the whole thing from scratch Just because it's gotten very messy over the last couple of years building it It's like from source like we we take audio and video and Assembly is one of the one of the our key provider actually We we deal in we in qualitative research like the precise Wording and like who you assign words to is really important and so like terminology especially in complex domains like pharma is a real challenge for us. And so Assembly works really good for our CPG companies who do consumer research, but where it sort of breaks apart a little bit is for those really technical domains. And so a lot of the work that we do is around using the semantic context that we have from the setup that people do when they bring their project into CodeLoop to actually augment that transcription that we do. And so one of the things that we do is figuring out, based on the interview structure, when speaker diarization maybe goes a little bit wrong, how can we adjust the transcripts so that we're making sure that what a participant said is actually assigned to that participant and what a moderator said is assigned to the moderator. And then in very complex domains, we take terminology and we can actually look at the phonetics and maybe where there's transcriptions that look like certain terms that should probably be those terms and we apply corrections to the transcript in order to make sure that users feel confident in the source material that we are, that we're analyzing for them. So that's like a bit of an overview of the sort of things that we do.

[00:05:23] Speaker 3: You can go, go ahead. Yep. Yeah. So our pipeline is simple, if I have a whiteboard. So when we actually started the company, we thought we would build two or three integrations and then just like churn it out, you know, lamp in customers, no problem. That has not been the case. So we have almost, you can imagine us having an ingestion layer at the bottom that takes raw data from tons of different sources. So you know, we've like a Salesforce integration and intercom integration, a Zendesk integration on the chat side. And then on voice, we've got like five, nine, Genesis, eight by eight, a ton of providers there. side of them got like Qualtrics and SurveyMonkey and all that kind of stuff. They're all maintained to ingest data and then that all goes into this central API that we have and effectively the API, the integrations layer cleans that into our, I suppose, unified format for contact center data. The API stores that in a big massive Postgres database. Effectively that's nice and niche then and on top of that Postgres database then we trigger off a number of queues that do post-processing on the data through Amazon that layer on additional semantics to each conversation. So when we take in a single conversation, which might be the chat you had with Vodafone last week, we'll label every message from the agent, every message from the customer, the context of the entire conversation, some further stuff around emotions detected, keywords, all that type of stuff. That'll go back into the API, into the database. And then there's both a UI that users can use to explore that data very easily. There's a few kind of proactive alerting services that find things proactively and bring their attention via email and slack and now there's a more agentic approach now where there's almost a an agentic data scientist that also routes through the data for you and because we've kind of done all that work to make it nice and query able that's quite effective at finding one of the things so in terms of time spent it's our assembly fits in just above the integration layer so where we take voice recordings from those systems it triggers assembly AI jobs then for the diarization, the recognition of the multilingual aspect. Most of our time is really spent on what additional ancillary signals we can add to the data once we have it, and they're often industry, language, and market-specific. So you know we'll have a very specific Italian-Spanish frustration model, because the way Italians get frustrated is very different to the way English people get frustrated. And we tried with generic models out of the box first, but but we ended up with quite industry and language specific things. So we spent a lot of time there and then a lot of time really in the UI. I find the UI is just such a, it's all well and good having all your data looking nice in a database, but actually having that, our use cases are so spread. It's like people just come to customer operations directors and say, oh, there was an error last week with the promo code, were people annoyed and why? So you have to be able to just flexibly find that data, query it, summarize it, get the answers to that and the questions can be anything. So our UI is probably a big, it's always the bit that's underestimated, I think, in terms of making it both easy to use and flexible enough to answer all those use cases. That's me.

[00:08:36] Speaker 4: Cool. Yeah, so I mean, at Granola, we have, I guess, two main products that use voice. We have the desktop app, which runs on your Mac or Windows, and we have the iOS app, which runs on your phone. They're slightly different in terms of how we use voice there. So for the desktop app, we do real-time transcription. The desktop app listens to your microphone and your system audio. We create connections to assembly and transcribe your meeting in real time. You can see that in the app, and then once the meeting's over, we'll send that transcription plus your own personal notes to an LLM provider to summarize it and give you some beautiful notes at the end of it. On iOS, we actually do an async batch job. We will upload the file after the fact of the meeting and do that, and that's mainly just because You might be in a coffee shop or something with your phone. Can't really be real-time streaming there. But yeah, I guess in terms of where we spend our time, transcription is interesting for us because it's the base of everything we do. Obviously, it captures everything that's said. But users don't fixate on it. We find people, they look at the notes and actually the transcription is the means to an end there. But obviously, that means that we need to make sure we do a good job with transcription because it is the basis of everything else. So as we've been spending a lot of time trying to figure out how we gain insights across lots of your meetings, and we're finding that if you don't have good transcription on those meetings then you end up with just not great insights either. So yeah, I guess for us the challenges are around capturing good quality transcription, specific like keywords as well are really important for there, like if you're on a call with someone, you want to make sure their name is transcribed correctly. Otherwise, your notes might look quite bad. Or technical industries as well, that's super important. So yeah, I guess it's like the whole spread. And to the UI point, obviously, trying to make, obviously, the app as delightful as possible, and kind of hide away all of that complexity so that you as end user just have a good experience.

[00:10:39] Speaker 1: Yeah, I'm curious, maybe, how do you actually measure if the transcript is good or not today? Are you looking at support tickets from users? Are you looking at, I don't know, maybe granola? People are at the office vibe testing this. What does that look like? I mean, definitely the vibe testing, we do a lot of.

[00:10:54] Speaker 4: We obviously dog food our product a lot, and we make sure that it works for us, and so on. We do a lot of user calls, user testing. We're spending more time on actual evals and trying to get like more quantitative analysis here. But yeah, it's kind of, it's a challenge. We find it hard because there is such a wide range of meetings people can be in, lots of audio environments people are in, like in person, like 10 people on a call, only one person, like two people on a call. So it's kind of, it's difficult to kind of nail down like what makes a perfect transcription. And we also don't, we don't store any of the audio. I guess one of our key points is that we only saw the transcription, and that helps a lot with people feeling safe and secure with the product. Downside is that we don't have anything to test and benchmark off. So yeah, there's trade-offs there.

[00:11:50] Speaker 3: Yeah, it makes sense. Yeah, it's hard. And we have a similar thing. And obviously, vibe testing is an internal assembly AI terminology, and I've heard it twice now. But we definitely have loaded that as well. Because we work with a lot of brands that have kind of complex names to say, the first thing you see on the call when you open imagine like in our system you see like 100 calls the first thing you open the first thing you see is like oh hello and welcome to Abercrombie & Fitch but I have seen so many combinations of words that sound like Abercrombie & Fitch or TUI travel that comes across but that vibe thing actually it is hard to quantify what quality looks like because for us if you actually listen to the way I'm talking now and I'm sure you do this all the time there's not very many complete sentences and that people interrupt and on the phone, there's a lot of yep, yep, yeah, gotcha, OK. And what about, oh, yeah. And the different models I've found separate the messages in the diarization format in different ways, where you might end up with, even though someone said yep in the middle of a large monologue, it'll put the yeah at the end, which is actually much nicer to read than halfway through a sentence seeing a yeah and the timestamp being perfect. So we get a timestamp for the message of someone starting to talk, and then the yep might have happened actually in the middle of that. but if it comes at the end, that's a much nicer reading experience. But it's very hard to encapsulate that in like a test. So we spend a bit of time just reading through and kind of, do we like this? Is that, yeah. It falls apart now when someone asks us, my Greek transcription looks a bit odd. And we're like, it does. Yeah, so that gets a bit harder. So we blindly test the WER rate that you publish as well.

[00:13:26] Speaker 2: We made a beautiful collage of all of the different ways that Kodi was mistranscribed. but I am happy to announce that Universal 3 Pro now gets it correct. Good, good.

[00:13:40] Speaker 1: Maybe on the point around just like post-transcription, post-processing, maybe give folks in the audience an idea of like, what are you actually doing to like customize that output per domain? Are you like boosting certain terms? Are you just like running multiple LLM workflows? Like maybe just walk us through what that looks like.

[00:13:58] Speaker 2: Yeah, yeah. So we get the transcript out from Assembly. I guess like, in order to understand this, users who go into CodeOop and they create a project, doing qualitative research, you almost always have like a discussion guide that is basically underlying the interviews that you do. And within those discussion guides, there's a lot of context about like, what is this research about? Who's involved? Like, what are the key terms? What questions are we asking? What objectives are there? And so there's a lot of really rich context in there that can support transcription. And so what we do is we take that context and we get out various bits of structure information like keywords that we can pass into assembly in the first instance. But also secondarily, like once the transcription is done, we effectively have an LLM pass that is going over the transcript. And basically like an AI coding agent is making edits in the transcript. So like passing in like the original string and then like the string it wants to replace it with in order to like insert and edit at a specific point based on like various instructions around like phonetic mistranscriptions or like words that maybe should be joined together or like all sorts of things. And in that way we're basically able to then correct the transcript and we've set up a number of evals to test that. So we sort of like work backwards from a clean transcript and like insert like problems into it to validate that that works quite well. So we built up an eval data set in that manner.

[00:15:33] Speaker 1: That's roughly how that works. I think I heard from basically everybody that who said what is really important in the transcript at the end of the day. I guess maybe, can you elaborate on that? Is it more important to know who said what or that the transcript's perfectly accurate at the end of the day?

[00:15:51] Speaker 3: No, for us that's imperative, right? So obviously we're looking at customer utterances and agent utterances. And we're analyzing those from, you know, the customer has more content in what is this conversation about? The agent has more in the agent quality assessment side. So, you know, are they using the right tone of voice? So often we get in more modern call systems, you get a call file with, you know, that'll be stereo two channels and you'll have left and right. Often they're not labeled, but we're dealing a lot with inbound calls. So originally we had an assumption that the first person who speaks is the agent, you know, because they go, hello, you're through to, you know, to retravel, spelled wrong. But like, that only worked for a while because then we started to see that in every contact edit, there's a proportion of calls that end up being outbound. So, oh, you emailed in about the, someone answers, oh, hello, it's Ryan. Yeah, you emailed us earlier on about the thing, and then we had it all R Swiss. So, we started using some of the LLM stuff in the API, actually, to take the full call transcript and kind of say, this is a call transcript from a contact center. Your job is to, from what they say, work out who is the agent and who is the customer. So that works quite well. Still some complexities where the call gets transferred to a second agent, and then we've got to have three speakers or if there's the odd time in some context is where they have three ways where they're like, the supervisor gets on board as well. Something's going really badly. Something's going down. And it's like, one of my supervisors is on there. But then there's a third voice in there and it's like, is that inbound? Is it outbound? So we do have to separate all those and make sure we label them. So for us, it's really key, yeah. What about Grenola? I'd love to hear that.

[00:17:28] Speaker 4: Yeah, I mean, speaker identification is really important, especially, you can imagine in a media scenario, Ryan, you say you're gonna do something, and it says, no, actually, JK's gonna do that, and then you're like, okay, I have no idea who said they were gonna do this. So yeah, I guess it's an area that we're actually spending quite a lot of time trying to figure out. I think we get away with it quite a lot at the moment because we actually create two audio streams. We actually use this microphone and system audio split, so we can confidently say it's the microphone or confidently say it's the system audio. What do you mean, the system audio? System audio is like the audio from the computer. So it's like a multi-channel, three-channel.

[00:18:09] Speaker 3: So you have two mics.

[00:18:10] Speaker 4: Basically, we have two mics running at the same time. And that means that for a one-on-one video call, great. We know exactly who said what. But for a one-on-one video call? For this? For this, absolutely. And like, you know, we can do, to be fair, the iOS app does do diarization through assembly. So we'll have different speakers labeled.

[00:18:32] Speaker 3: And from the intros, where we all set our names?

[00:18:34] Speaker 4: Yes. Essentially. Feature request. Yeah, I think it is a feature from assembly. But it's, yeah, I think, so there's scenarios, but again, there's large gaps where we don't do a good job. That's hard. to figure out how to do that is tricky.

[00:18:51] Speaker 2: Yeah. Yeah, I mean, we, as I said in research, there's generally two roles, like moderators and participants. Sometimes translators are also in there. And one of the first things that we did when we started using Assembly, like I think two or three years ago, was to build that like name and role identification into like scanning the transcript and getting an LLM basically to discover like the names of participants and the names of moderators and assign the role based on like what they were doing in the conversation. And this was very important. We actually then like gave users the option to like verify whether that was true. So, you know, they have to confirm if there was some ambiguity to that. But nowadays you guys just support that out of the box, which is, I think, really, really useful and sort of makes it easier for people to adopt transcription.

[00:19:42] Speaker 3: We have it easy with our two, our two participants. Now that I think about it, it's easy. Yeah, sorry about that.

[00:19:49] Speaker 1: We've got channels right here. I mean, we've talked a little bit about real-time versus post-call. I mean, there's a number of questions, I think, from the audience just around building with real-time, voice agents, et cetera. I don't think, at least as far as I know, anyone has a voice agent live today. But maybe you could talk a little bit about how do you make those design decisions between post-call versus real-time? And how do you weigh off some of the trade-offs that you see there?

[00:20:16] Speaker 3: Yeah, so we have some demand from customers for real time, but we haven't really gone there. So that layer down here that I talked about, which is the integration layer, this is like a complex mess of horrible integrations to APIs on old legacy call systems that are pants. And they go down, they fail. And I was a technical founder, so I was involved in building a lot of this stuff at the start, and it's horrible. and asking the team to now, instead of downloading the call after the call, because normally we receive an event or we can run a report every two minutes for how many calls ended in the last two minutes. Okay, just give me those transcripts. That's a much easier loop than let's try and connect to the stream, transcribe it in real time, and then actually get into our API in real time. So for us, the gain wasn't enough to go down that thing. But what is important for us is we give one of our, one of the large ins for us in high volume contact centers is that we assess every call that's happened in the last 30 minutes. And then we compare it to every 30-minute interval over the last 30 days. And we look at what are the themes appearing and is there anything unusual. So we know that at half seven on a Wednesday evening, there'll be this many calls about cancellations and this many calls about password issues. But oh, hey, there's normally not some calls around this payment problem. And then we can build an alert around that and quantify that and get action fast. but that requires us to have the transcripts of the system as soon as possible after the call ends. So for us, the real-time is more around, we call it near-time, we call it real-time on our website, but really, it's post-call, we hammer it at the API and we need a response really quick, and then that can, often in those circumstances, there's an influx of calls, so then we need to kind of scale that up very quickly to transcribe all of those calls and get that result back to the user,

[00:22:06] Speaker 5: so we're real-time-ish, yeah.

[00:22:09] Speaker 3: But we do get a good bit of, and I think there's a product opportunity for us in the future in connecting to the stream and delivering some of the alerts live to either supervisors or agents within their systems. So that while they're on the call, they're physically receiving either alerts or warnings or information. But the call systems are also moving in that direction. So it's a little bit of like, should we do it, should we not feeling.

[00:22:33] Speaker 4: Yeah, from Granola point of view. So like a desktop app, as I mentioned, is real time. It was quite an important product decision we made early on. The product decision was a lot focused around just reassuring people as well, because you can see the words coming in. We see that people, when they start using Granola, will open the transcript and will look and make sure it's okay. And then they get confident and they don't have to look at it anymore, which is great. But I think it just reinforces that, okay, cool, it's working, it's great. We do have some neat features in the app. You can ask chat, like, what did I just miss? That's my favorite feature. Yeah. Yeah. There's no doubt for me. It's actually the first thing I built when I joined Cranola, like, a year ago now. So yeah. It saved me a lot of meetings. Right. So that's really nice. And obviously, that builds on the fact it's real time. You couldn't do that if it was batch. But then, as I mentioned, the iOS app is batch because of other constraints, the technical constraints and network constraints. So yeah, for us, it's really driven by the product. It's really driven by what makes the most sense for a user. And I think we're pretty happy we chose real-time to begin with, and yeah, so does Will.

[00:23:48] Speaker 2: Nice. For us, I think, at the moment, we're not really using any real-time, but there's three product use cases that we're thinking about where it really makes sense. One of those is providing a backroom for clients and stakeholders to basically be present during the calls and get transcription of what's going on and be more involved in that process. And that's often quite important for relationship building when you're conducting research between like an agency and an end client, for example, and getting the stakeholders for whom the research is actually done involved in that process so that they can sort of tune the direction in which things are going. The other thing that we're thinking about a lot is using real-time transcription to actually help support the moderators themselves as they're doing those interviews, so leveraging the transcription to basically get insights into, okay, maybe you should focus more on this question, like something interesting was mentioned here, go deeper into that, and in effect, like, sort of like, there's many tools for this in, like, sales, right, where you want to support your salespeople as they're doing sales calls so they can learn on the call, and we're thinking about similar things for interviewing. Then the last thing is AI moderation. Getting an AI interviewer going and basically having AI transcription happening live so that you can support a speech-to-text and then text-to-speech pipeline for voice agents basically conducting interviews fully autonomously. think there's a big opportunity and especially in like doing real qualitative research with AI agents that get you like that depth that you need

[00:25:32] Speaker 1: to to really understand your customers. Yeah nice. I'll ask one more and then we'll just open it to the audience for questions so maybe we can close this this panel it's just like what are you most excited about in voice and we can leave that with a parting thought and then anyone can start asking questions. Anyone want to go first? Are you going to copy the same answer earlier?

[00:25:56] Speaker 2: I think for us, the most exciting thing is global language support across all languages.

[00:26:05] Speaker 1: It's funny. I'm also excited about it.

[00:26:09] Speaker 2: Should go on first. We do research for companies that operate globally, and so one of the key challenges that we have is in those lower resource languages where there's like a very big important user basis and customer basis for these customers and right now we really struggle with that. It's really important for us to be useful across everything because ultimately we want to be the customer context layer for all of the research that you do and so like if you only get to do part of it with us that's not great. And so, yeah, we hope that someone will crack the nut of, like, solving transcription for low-resource languages.

[00:26:52] Speaker 3: Global language support, also. Yeah, no, I was thinking about this briefly. So the language stuff, actually, for us, is also applicable. So a lot of the contact centers we work with are... We work a lot in Europe, and actually the rest of the technology works really well multilingually. So search, summarization, searching through summaries, you know, if you summarize everything and you search through it, it becomes really powerful. If you've got 15 languages, you can kind of search for stuff. Yeah, like that, we do get a lot of like, oh, this one's in Flemish, but it's being detected as Belgian. OK, well, it's close enough. It's like magic in a computer. What do you want? But obviously, the better that is, the better. The other piece for us, then, I think, where there's opportunity, and I was speaking to someone about it earlier on, is if you imagine what we do on voice, we apply the same pipeline that we have for chat and written communication. So the detection of like frustration, gratitude, confusion is based on models that look at the sentence and sentence vectors, that kind of stuff. But so, you know, someone saying like, I'm confused, obviously it's confusion, but actually using the voice recording to detect tonality and stuff will be an area that I'm kind of keen to go on. So when someone says like, it's fine, You know, I said it's fine. We'd be like, excellent. Whereas I think there's a layer of maybe contextual understanding that we're missing there. Now, no one's complaining about it, but I think it would be a powerful thing to both show the agents how they, we do get a lot of people trying to show their agents where they're performing well. So where customers come in with an escalated kind of emotional state, they're annoyed about something, but the agent actually handles it quite well. And they leave the conversation with a, thanks very much, you sorted that out. I think that's a real skill for customer service, which is a hard job in those situations and getting harder. So that ability to detect that more reliably, I think would be great for us. So that's the sort of thing we're looking at. Obviously doing that in Flemish would also be good.

[00:28:50] Speaker 4: Yeah. I mean, for us, it's like speaker identification and diarization is probably the most exciting thing for us. I think it's like, as I mentioned, like because we do real time, the downside of that is that real time diarization is hard. It's like a hard thing to do. And, you know, if we can, if we can crack that, if we can get properly, like this was said by this person, it just unlocks so many things that we can do downstream, so. Yeah, nice.

[00:29:14] Speaker 1: Cool, well, questions from the audience? Go ahead.

[00:29:19] Speaker 6: I was just got a question for Adrian. I was curious about the low-resource languages, because there's like, let's say, five that you could have tomorrow.

[00:29:27] Speaker 2: Which languages would they be? We would like Tolugu. We have a lot of problems with that. It's a good question. I don't know all of them off the top of my head. There's Telugu, there's a couple in basically South Asian countries that are quite challenging. Telugu is one of them. There's a couple from the Philippines that I don't remember, yeah. Tagalog?

[00:29:56] Speaker 3: Yes. Tagalog, I suppose. Tagalog, yes. Better Arabian would be good in that area. We get a good bit of that in Quarries. We we kind of turned them down. And then the Asian markets, we were surprised it worked in Japanese actually quite well. Some of the Chinese dialects, we get a bit of like, oh no, this is Taiwanese. No, this is, you know, Cantonese. But then when I look at our database, it doesn't actually have a language code for it, it's like Chinese. So we have some improvements to do there as well. Flemish does come up a lot, actually, for the European travel operators. They don't want us to speak to them. People ask us to do things in the African markets, but we just say no because we know it's not assembly AI that would like, you would fail there too, but then every other downstream process would also fail. So our large language models would struggle. We do a bit of like semantic tagging on meaning within sentences. So not keyword, but all that would fail as well. Just be really difficult. So we kind of stay out of that market for now.

[00:30:56] Speaker 2: No results is better than. No, yeah.

[00:30:58] Speaker 3: It's just a ton of languages we don't feel, we don't feel we could serve those customers.

[00:31:02] Speaker 4: I think for Granola, there's some languages, obviously, but actually the interesting thing for us is that we really want multi-language, like in the same model, because from a user experience point of view, we don't want people having to choose, oh, especially if you're a multi-lingual speaker, like, oh, I'm speaking in French today, or like, no, I'm speaking in English. Actually, you combine them off in the same meeting.

[00:31:24] Speaker 3: The new one plays that game, doesn't it? What? The new model does something like that.

[00:31:27] Speaker 4: Yeah. Yeah, yeah, yeah. And I think the challenge there is that you want that, and you have a certain set of languages, but then you start expanding that language so that it gets harder and harder for the model to understand, actually, the differences between languages. And so, yeah, there's a trade-off there around how good they can be.

[00:31:42] Speaker 3: It's amazing how fast customers, or our customers, people's expectations of technology have changed, like how, you know, you're showing them things that are literally magic only three years ago, and they're like, well, I mean, it's Flemish, it's okay. Yay.

[00:32:01] Speaker 1: Yeah, right.

[00:32:01] Speaker 7: So yeah, my question is about getting emotions from it. You mentioned it to me. So I'm trying to imagine, assuming someone's angry. If you're talking to an AI via chat, and they're just like, why is it talking to us? You are wrong. And it's like, OK, sorry. But if you're doing this with voice, how is that going to work? Is it like an AdWord problem? There is a problem. How is that going to work?

[00:32:26] Speaker 2: There's a company working on this, I think, called Hume.ai, they're both basically looking at facial expressions and like your voice and like how not, well, incorporating the semantics of the things that you're saying but also at like how you're speaking and basically like are you being loud.

[00:32:44] Speaker 8: There's someone here I was talking to doing this, yeah, hello, it's Defano, it's Demia,

[00:32:52] Speaker 9: I have a question, so it's for Shane, in the early days what specifically unlocked traction for your business, was it distributing, positioning, or timing, and what were the highest objections around trust, reliability, with voice AI, and I'd love to understand how you use this data.

[00:33:20] Speaker 3: You're assuming that we're working well. The early days, so we had some credibility in our backgrounds that we had worked a little bit on. We had built a chat platform before, and we had a few customers there that we leveraged to get in. What unlocked scale for us was actually narrowing our ICP a lot. So we went through a period where, if you talked to me in like 2017, 2018, we would have sold to anyone, anyone who would talk to us. We would be like, yeah, we've got what you, we've got it. You know, it's a telecoms provider. Yeah, it's an insurance company. It's a small shop down the road. Let's go, chat system. And a big change in a lot for us was probably in 22, 21, we got the idea of like ICPs and someone really sat us down and said, lads, you're messing. focusing in on, you know, a really tight ICP with, like, you know, you're going after, you know, contact centers with more than 30 seats that work in Europe, that are in the travel sector, that are an OTA provider, that have hired recently, like, a head of customer experience, and you keep narrowing that down until you get to, like, 70 on a list. And then you narrow it further and say, okay, I'm going to get a meeting with every one of these 40 customers rather than this. That was one big unlock. And the other was we moved from having a product that required you to completely rip out your existing stack and retrain everyone. So we were like 10 people in an attic and we were asking these customer operation centers of 200 people to retrain their entire organization. And we were like, this is great. But like we get to their security guy and he'd be like, get out. So that changing to a product that we could lower the buyer risk changed a lot there. So being able to say to them, you can try this for a month, it's not gonna affect the rest of your team, and you can turn it off if you don't like it. That changed it a lot. And then we also had the message for that very specific, when we know that you're a person with 40 customer service, here are your problems, here's what we do. Those two things enabled an outbound motion to start to flow, and that was a real change for the pace of our business. But in saying that, I will say that I've been really shocked at the variance in customers' security posture. So we've had like large customers that we go live with that we have to go through, you know, three months of security stuff because we are taking data that's, that is, they say, what personal area? Well, whatever the customer says. And then at the same time, I've also had customers where we email them, they agree to go live and then they email us their Salesforce admin password the next day in a plain text email, say, yeah, yeah, cool. It's up there. So you can just get a bit of luck in. So that was lucky, right? live that customer a day later or zero and for the same value so there's just a bit of look to that there

[00:36:12] Speaker 5: I think we talk about UI UX for all three products. We also touch upon the product side. My question is, I guess everyone must be spending a lot of time on evidence. A lot of discovery is happening. So your product or roadmap, what are you thinking or what are you building for agents? We've built enough for humans, like now what are you building for agents?

[00:36:35] Speaker 2: That's interesting. Maybe last year we started thinking about as the world moves more towards agents, how does that impact the way our product will be used? This hasn't changed enough yet to the degree that we're comfortable doing that. One of the things that I found really useful is, for example, using our docs in Cloud Code and I often wonder like we should have like a CodeLoop MCP so that like effectively all of the customer research we have about our own customers could be used as we're developing. And in similar vein like bringing CodeLoop into Slack, into email so that where the work happens is where it is. And so I think like as that becomes more clear on like what are the tools that are being used by humans or indeed operating autonomously, like we want to make sure that we're there to provide that context that we've gathered about your customers.

[00:37:38] Speaker 3: We've started this journey, we're currently having healthy debates on how much we should do, so we've built an MTP server, we have that working internally, we haven't given it to customers yet. We're also building an agentic system in the app that allows you to kind of free text access the data through, you know, why were customers annoyed yesterday and it queries the database effectively. It's probably a load of really interesting kind of pricing questions about where our value is, but one consistent feedback we've had is that, like, having a really large interface to find your way through the data, like, people, with some customers that use it all the time and they're really good at it, and with some that just, they don't learn, they won't learn and they're not arsed and I get it like it's a big enough interfaces a lot of different pieces we're nerds who made an interface it's kind of you know we've got some product guys but it's not perfect there's loads of stuff I'd like to improve and that learning thing is a real adoption risk or adoption blocker so for us we see the agentic approach as a way to gain more usage but now it becomes of do we build more to expose to the MCP or do we keep investing more in in the UI, and at what point will there be things you can do in the UCP, or the MCP, that you can't do in the interface? And we haven't hit that. At the moment, the MCP is catching up with the interface, but I can already see there's some questions you can ask that you couldn't achieve, because it'll do like multiple queries and then aggregate all the results together. And I'm looking like, we just won't build, we won't build that in the interface. There's no way we'll do that, because it's really, because people ask mad things. One thing we're challenged with there, though, is that like, I think the X, And it'd be interesting to hear, actually, as a show of hands, on how many people use MCP servers here. So like, this is a tech-enabled room, right? But like, that are all talking about voice AI. And I'm not sure how much that is in, you know, customer service teams. So we're toying with, if we release the MCP, the expectation is that'll be free. But if we have a really good agent in our system that's using tokens from our system, If you set up your MCP using your tokens, we don't really care. So, but where's the value then for us? How do we make our agent better than the MCP? Because that's where we add value. But I think there's a reductionist argument for us that like the value actually becomes our data processing and cleanup and ancillary information that makes it really good for agentic use cases. But like, we're hotly debating how we go to market with this at the moment. Cause it's really confusing. Like, do we build more UI? Do we build MCP? Do we forget about it all? And do we just tear the whole thing up? Yeah, it's challenging.

[00:40:22] Speaker 4: Yeah, I mean, it's, I guess, a similar vein. We have an MCP. We released a personal API today, actually, you can use. And so we're definitely trying to, I guess, meet people where they are right now, but also trying to think about how do we embed and incorporate these kind of agentic flows into our products as well, more natively. And yeah, it's a, I mean, it's, you know, Cloak Code is like just over a year old at this point. Like, it's like such a news thing. There's such a, like, I don't think anyone really understands like where this is going to go. And it's, we've, yeah, we've seen like it'd be really valuable from an exploration kind of point of view, from like being able to grab the right context and be able to like pull in really interesting insights across your like meeting notes. But ultimately our goal is to make it work, I guess, easier for people and less stressful and more like a nicer thing. And so we don't wanna make more work for people. And so we need to be careful and thoughtful around how we build these things in a way that let people do the things they want to do and get out of the way otherwise.

[00:41:36] Speaker 3: Yeah, but I can see a world where all your work is done in either Slack or in cowork and you just get all your MCP servers there. So that's a potential future for this type of tool.

[00:41:46] Speaker 8: Yeah, a bit deviating from the conversation on UIs and integrations. More from a problem-solving perspective, we've got two of these problems. How do you handle mixed-language conversations? And how do you handle a choppy, noisy conversation with a lot of background noise? Because I think one of you said you delete the audio and just keep the transcript. That's kind of totally keeps you away from going back to the text and saying, where did it go wrong? Did that person say that, or did something fall on the ground?

[00:42:18] Speaker 4: No, yeah, that's not, yeah, ground, we don't keep any of the audio components.

[00:42:21] Speaker 8: So how are you solving both of these, or any of these problems?

[00:42:25] Speaker 4: Uh, I mean, realistically, it's still, like, a problem. I'm not sure we have a good solution there. I think, I guess the two things around, like, how to deal with noisy environments and mixed language. Yeah, mixed language is like, language is like, yeah, we're very much relying on these multi multilingual models from from assembly and and doing a little bit heavy work for us. We do have the added benefit for granola, but as I said, like the transcript is like the first layer, but then the notes are actually what people are mostly interacting with and looking at. And so LLMs are like really good at understanding like and papering over like miss, you know, miss translations or like misfits and in conversations. And I think that helps us as well with the second thing around dropped like noisy environments, you're able to kind of get the actual call thing that people are trying to convey, regardless of you've captured every single word. So yeah, it's a challenge for sure. And I think there's lots of different approaches there, but like approaches are mainly sort of, I guess, one step above the transcription rather than trying to fix the ones below.

[00:43:30] Speaker 3: We don't have a good solution for that. And I am surprised how many customer service calls have recordings of entire movies in them where someone just left the phone. We get these alerts for this weird language. It's like, why are they talking about weapons? And then you listen back, and they're like, oh, yeah, I'll help you in one second. And then they just leave the phone there for a second. It's like a radio, or a TV, or the news, or something on. And we just see the whole news transcription appearing. And we don't have a solution for it yet. I actually like the idea of running an NLM to kind of reduce that stuff. Because when you listen back to the call, Again, this is magic, but to our customer, they're like, well, obviously don't transcribe that. And you're trying to say, well, it is in the call. You can hear it. And again, the multilingual stuff comes up quite a bit. There's a lot, again, a lot of customer service conversations start with, you know, hola, you know, a placer de ingles, and then straight into English. And we've detected the language in the first few conversations. And so we've just switched then, because we've got like a translate button that translates it, and we don't have the right labels. So, we have had to change that recently from a language per interaction to a language per message so that we can accurately transcribe and move between.

[00:44:40] Speaker 8: Some languages allow or they use other languages.

[00:44:44] Speaker 3: Yes, yes.

[00:44:45] Speaker 8: What language is that? That language or the other?

[00:44:50] Speaker 2: That's even more difficult. You guys got this?

[00:44:53] Speaker 1: Yeah. I just wanted to say, right? So, I mean, some of the traditional speech-to-text models, when they would, like, try to handle mixed languages, they're just going to try to predict the whole sentence in one language, and so you're going to end up with some, like, nonsensical sentence. Some of the research we've been working on in the model that we brought to market in the past, like, month and a half, it actually has an LLM in the decoder, and so it actually uses the LLM's context to bias the transcription. And so if you're doing things like Spanglish, for example, if you're actually doing multiple different words from different languages in the same sentence, it's not going to try to predict it all as English, for example. It's going to actually use the context to go and do that prediction. So we've actually found some really interesting behavior that we didn't even really intend out of the box. But one example, a lot of customers bring up that Quebecois, which is like French-Canadian, is a totally different dialect and almost like a different language. it's super interesting with the LLM and the decoder, it actually perfectly transcribes that. And so you see some of these emerging phenomenon in the research that we're doing that this is going to get better, but of course it's hearing from customers like yourself how important it is that helps drive that direction. It's very exciting. And on the noise side of things, you try to make the model as robust to noise as possible, but really at the end of the day, transcribing the TV is actually like,

[00:46:13] Speaker 9: maybe it's amazing, it's actually what's happening.

[00:46:16] Speaker 1: What we're trying to do there is make it so that you as the end user can kind of control a little bit more of that transcription. And so our most recent model is also promptable. And so you can do things like ignore background noise or transcribe everything, don't miss a single word, I want everything in this transcript. And the idea being that you as the end user could get those results rather than you just get our transcript and you have to figure it out later and sort out whatever result we give you, you have a little bit more control over that output ultimately. And so, yeah.

[00:46:45] Speaker 8: because that helps, like, to say this is television.

[00:46:49] Speaker 1: Yeah, so you can do non-speech tags as well, yeah. So non-audio tags, that kind of stuff. It's not as advanced right now as, like, television. It would say something more like, noise or, you know, speaker or something like that. But, like, that's the kind of stuff where we want to make it eventually.

[00:47:07] Speaker 3: Yeah. Cool. Old music has some good lyrics as well.

[00:47:12] Speaker 10: Yeah, I can start there. I think, yeah, models getting smaller on device is really interesting.

[00:47:39] Speaker 4: I think for us, it still feels like in its infancy, a lot of that. I think we still have a lot of... We still rely on really high-quality transcriptions, and I guess we don't have the constraints. I guess iOS is probably the exception, but if you're on a desktop app, you're probably on a video call. You probably have good internet. Transcription is not going to be the main bottleneck there, and so running a local model is not as important for us. But I think, um, uh, I think there's definitely interesting avenues to explore around running like a local model in parallel, uh, maybe focused on a more like different part of the transcription or, or, uh, being able to compliment the two, uh, and like, uh, trying to combine the two at the end. Um, yeah, not sure. It's a, it's interesting.

[00:48:24] Speaker 3: I, um, so I think models are gonna like, we're gonna hit like a, you know, law of diminishing returns eventually, maybe. Maybe we'll see, you know, infinite gains and we'll all be screwed with the AI overlords who we love. But, you know, there's a likely scenario where it starts to kind of taper off. And I think there's going to be a big rush to cheaper tokens then, right, where you can get a proportion or most of the way there, which you already can with open source, and companies will start to offer tokens at a much lower cost. The Another big change that I think will change things maybe more than cheaper tokens is faster tokens. So I did see some research recently where this company had taken the weights of a model. So they'd taken a fairly basic model, it was a Lama 3, one of the older ones. And instead of putting it in RAM and going through the kind of large compute in memory, they had encoded each weight into a hardware chip. So they had a single chip that was like, just this model. But the speed then is, you kind of have to- It's 14,000. It's like, yeah. So you can try it. I think the website is chatjimmy.com. Yeah. But if you try it out, you kind of have to see it to believe it. Like, you go on that on your phone and you type, write me some codes to get a voice call to SMBAI in Python and press enter. All of that code is generated before you take your hand off. Like, it's shocking. And I think we're so used to the like, I'm on co-work. I give it some work. I press enter. I watched the spinny thing that like, there's some mad transformation going to happen when you press go and the PR is generated before you look away. And then that'll happen locally. And then, you know, there's just so many more use cases are going to come out of that. And the other thing looks at that faster that the review is done. The comments are done. The mergers are done. The testing is done all within a minute. And I think that's, that's where it's going to be the biggest change. I don't know if that'll move local, But I'd say if the API prices remain high, then there'll be a push to local, but the API prices remain OK.

[00:50:31] Speaker 10: I suppose for your use case, but I've seen models running in browser that are sentiment analysis, so I just think it's an interesting, it's going to be an interesting world opening up where it's like, OK, fine, you don't, maybe you don't do, you know, the really hard core stuff. One of the lags for making it to do everything in one big model was like having, like, basically breaking it up into microservices of fine-tuned, specialised, smaller models.

[00:50:55] Speaker 3: I know Virgil in Intercom talks a lot about this, if you follow his blog, and he covers a lot of, like, they took some of the Quen models and fine-tuned them and got better, cheaper results running on EC2 servers by focusing on individual. Now, at their scale, that could make quite big swings. That hasn't been worth the effort of fine-tuning for us yet. I think that's part of the way to go. So at the moment, actually, we still use, to keep costs down, we still use traditional, traditional, like, ancient five-year-old, you know, normal machine learning models to do emotion detection, because they work fine, you know?

[00:51:30] Speaker 9: Yeah.

[00:51:31] Speaker 2: I would say we're definitely in, like, the dial-up era of AI, right? It's like, it's so slow, and I think we'll look back on this and just be like, why, why were we okay with this? Yeah. I think for us, actually, I think we have a slightly different view in that quality is paramount for us almost above all else. The people that use our AI transcription are used to paying hundreds of dollars for a transcript. What they care about is, they don't really care about cost, they care about the quality of that transcript basically being as close to perfect as possible. I think it's going to be quite a long time for us before transcription is going to be at a point where we can deliver on that promise effectively and repeatedly, and so yeah, we'd be happy to pay more for a transcription model that did way, way better, basically, I would say. Yeah. Awesome.

[00:52:29] Speaker 1: Well, we'll do one more, yeah. Appreciate it. Yeah, one more question. Go ahead.

[00:52:33] Speaker 4: I think there's a chance that society changes here and it becomes acceptable, but I also think that people want to be able to have privacy and I think that's kind of always been the case, and I think it's an interesting problem for Granola because obviously as a product we get better with more context and so we want to be recording everything, but I think as a product we are very human-centered and therefore we should be following and being opinionated on what we think is okay and what's not, and in a work setting that's different as well to a personal setting. So yeah, there's trade-offs. I don't think, we're not planning to doing any kind of wearable on Amazon in the near future, but you know, who knows? This might change. People might get more accepting of it. We're not sure.

[00:53:45] Speaker 3: I probably sit in the same bell chair. Bit weird, isn't it? Yeah. I mean, like if I met someone here and I was recording the chat we had, I think it, yeah. At the same time, do I want something that I could ask, Did I agree to bring the kids to the basketball practice last week? Yes, I want that. But would I wear that around the house? Probably not. I don't know if this is a work context or a personal thing there. Is it expected that if you're on a call with someone now, they record that call and put it into their large language model notes?

[00:54:18] Speaker 7: Probably is kind of acceptable, is it? Is that just kind of OK now?

[00:54:22] Speaker 3: So maybe that just spreads slowly and we all become censored.

[00:54:27] Speaker 2: Yeah, it's often the case where I'm like, I just wish I recorded that in the office. I think definitely in a work environment, as long as it's like super clear where that's happening, I think that could be quite useful. I think obviously that there would be exceptions to that, like private meetings on like one-on-ones, etc. maybe you want more privacy but I think like often I feel like we're getting into this world now where we can operate a lot more I think as you guys have on the wall there'd be a lot more present in things because we know that like the context is being gathered for us and we can go back to it later and I find that really compelling but equally I think there's a big question around like how do you do that while making sure that people have the privacy they need and And what does our general solution look for that? I'm not really sure, but I think that's a neat problem to solve.

[00:55:32] Speaker 1: Cool. Well, that will conclude the panel. Thank you to our awesome panelists.

ai AI Insights
Arow Summary
Panel moderated by Ryan (AssemblyAI) with founders/builders from CodeOop (customer interview analysis), EdgeTier (conversational intelligence for high-volume contact centers), and Granola (meeting notes). They discussed end-to-end voice AI pipelines, emphasizing transcription as a base layer and the large amount of work in ingestion/integrations, post-processing, domain adaptation, and UI/UX to make insights usable. CodeOop augments transcripts using project discussion-guide context, term boosting, LLM-based correction passes (including phonetic/terminology fixes), and evals built by injecting errors into clean transcripts; speaker role accuracy (moderator/participant) is critical. EdgeTier runs a large ingestion layer across many channels/providers, normalizes into a unified schema, stores in Postgres, and runs queued post-processing to add semantics (topics, emotions, keywords) and proactive alerts; most effort goes into ancillary, language/market-specific models and building flexible UI for diverse queries. Granola performs real-time streaming transcription on desktop (mic + system audio split for implicit speaker separation) and batch transcription on iOS; transcription quality matters most as a foundation even if users focus on summaries, and privacy constraints (no audio storage) complicate benchmarking. They debated how to evaluate transcript quality beyond WER, including “vibe testing,” readability vs perfect timestamps, and speaker diarization. Real-time vs post-call decisions depend on product value vs integration complexity; EdgeTier focuses on near-real-time post-call alerting, Granola values real-time to reassure users and enable “what did I miss?” features, and CodeOop sees future uses in backroom stakeholder viewing, moderator assistance, and AI-led interviews. Excitements: better global/low-resource language support, robust mixed-language transcription, improved diarization, and adding vocal-tonality-based emotion signals. Audience Q&A covered low-resource languages (e.g., Telugu, Tagalog), mixed-language/noise handling, emotion detection approaches (including prosody), agentic/MCP/API access vs UI, on-device/faster models, and privacy/acceptability of pervasive recording.
Arow Title
Building Voice AI Products: Pipelines, Quality, and Real-Time Tradeoffs
Arow Keywords
voice AI Remove
speech-to-text Remove
AssemblyAI Remove
transcription Remove
speaker diarization Remove
mixed-language transcription Remove
low-resource languages Remove
domain adaptation Remove
post-processing Remove
LLM correction Remove
contact center analytics Remove
meeting notes Remove
real-time transcription Remove
near-real-time alerts Remove
emotion detection Remove
prosody Remove
UI/UX Remove
integrations Remove
MCP server Remove
agentic workflows Remove
privacy Remove
Arow Key Takeaways
  • Great voice AI products require substantial work beyond the STT model: integrations/ingestion, normalization, post-processing, and UI/UX.
  • Transcript evaluation is hard; teams rely on dogfooding/vibe testing plus quantitative evals, and readability can matter as much as strict timestamp fidelity.
  • Speaker attribution/role labeling is often as important as word accuracy for downstream analytics and meeting action items.
  • Domain adaptation commonly uses contextual metadata (discussion guides), term hints, and LLM-based transcript correction passes, validated with synthetic error injection evals.
  • Real-time transcription is chosen when it unlocks user trust and live features; otherwise post-call/near-real-time can be simpler and sufficient.
  • Multilingual and mixed-language conversations remain a major need; low-resource languages and dialects are frequent blockers for global products.
  • Noise and background media can pollute transcripts; mitigation may involve non-speech tagging, prompts/instructions, or downstream summarization to ‘paper over’ imperfections.
  • Emotion detection benefits from combining text semantics with vocal cues (prosody/tonality); generic models often need language/culture-specific tuning.
  • Agentic access (APIs/MCP) can reduce UI learning curves, but raises pricing/value and product focus questions.
  • Privacy constraints (e.g., not storing audio) improve trust but reduce debugging and benchmark capabilities; clear consent norms will shape adoption.
Arow Sentiments
Positive: Upbeat, pragmatic builder tone with excitement about multilingual advances, diarization, speed, and agentic tooling, balanced by candid discussion of messy integrations, evaluation difficulty, and privacy concerns.
Arow Enter your query
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript