Exploring Ultra-Fast AI with Advanced Models
Discover how the integration of the latest LLM and text-to-speech models results in high-speed AI performance with minimal latency, featuring DeepGram's innovations.
File
Worlds Fastest Talking AI Deepgram Groq
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: What happens when you combine the fastest LLM with the fastest text-to-speech model? Hey, my name is Greg. Hey Greg, nice to meet you. How can I help you today? Well, you get a really fast AI. I teamed up with a team at DeepGram who sponsored this video to pressure test their new text-to-speech model. But we're just not going to use any old LLM. Not even GPT 3.5 Turbo will do. We're going to use the new Grok API with the insanely fast tokens per second because I want to find out with latency, how low can you go? Let's review the pieces that we'll need. We're going to need three pieces to build our conversational AI. And it all starts with the audio. And this is the audio that's coming out of my mouth into my computer's microphone. So piece number one is we're going to need a speech-to-text model or a transcription model or an STT model as the cool people like to call it. And so let's just say, I like cookies. Well, it's going to tell me, hey, Greg, I like cookies. And it's going to give me the string that we have here. And then with that string, we're going to go ahead and we're going to go past that to our language model. This is the LLM that we are going to work with. And then from there, we're going to get a response out the other end. And it's going to say, great, me too. They like cookies. That's awesome. And then finally, we're going to wrap it up with a text-to-speech model or a TTS. And this is going to turn that text that we just got back from the language model. And this is going to turn it back into beautiful audio. And of course, this process doesn't just go once. There's not just one command. We're going to loop this all around until we say an exit word or something that says, hey, this conversation's over. Then we can exit the program. So diving in on the first piece for the speech-to-text model, we're going to use DeepGram Nova 2. This is their latest model. And I found it to be the fastest and the most accurate for our use case. So they also support a whole bunch of different ones. I thought this was pretty interesting. So they have the base Nova 2 model, but then they also have a Nova 2 meeting, phone call, finance, conversational AI. These are all different Nova models that have been trained for different scenarios. For example, if I'm doing a drive-through app like that one down at the bottom right there, well, maybe I want to use the Nova drive-through because the audio is a lot more choppy. Who knows? All right. They also support streaming, which is very cool. The other really cool thing that DeepGram does when you do do streaming is it does end pointing. And end pointing is when it's going to notice a natural break in the conversation. And it's going to say, hey, it looks like the person is done talking here. So again, I'm going to start speaking. I'm going to finish speaking. And then DeepGram is going to notice a pause. And when it notices a pause, it's going to say, hey, I think Greg just stopped speaking. I'm going to give you an endpoint, meaning I'm going to set a flag that says I think this is done. And so what it does is it has a flag within the response that says speech final. So while you're speaking, it's going to say speech final equals false. But when you actually get the final data back after it's determined an endpoint, it's going to say speech final equals true. And then you know you can go and continue on with the rest of your app, which is nice. All right. I want to show you the single part that just does this transcription of my voice so you can see how this works. The important part here is going to be this async function called on message. And on message means, hey, DeepGram just sent us a little chunk of data. We should go do something with it. Right. What are we going to do with it? Well, we're going to pull out the sentence that it tells us. Keep in mind, this isn't the full thing that I said, but it's just a chunk of what it was because it sent us to it in little pieces. So we're going to get our sentence. And if I'm not done speaking, so if it doesn't find that endpoint that we just talked about, then I want to add the sentence to just a little. I made this transcript collector just means like combine all these things together. Now, if I am done speaking, well, add the piece of the sentence, go get the full thing that I just said. And then we're going to go print it out. And then we're going to go reset the transcript that we had in here. By the way, all this code is available. Links in the description. Go check it out. All right. Let's try this out here. Hi. Okay, cool. It seems to be like it's working with us. All right. Let me try saying a really long sentence. Problems, output, debug console, terminal ports. Yes, it still has it for us. See, that's pretty cool because after I'm done talking, it notices a natural break in the conversation. And then it will do the end pointing for us. You just need to make sure you don't talk too slowly. Or else it'll pick it up and it won't be good for us. Now, for the LLM piece, what we're going to do is we're going to use Grok. Now, Grok is a new model provider. They don't actually make models. They're not in the open source model game or any of the foundational model game. They're getting really good at serving models. So they're making custom chips that they call the LPU. And they're really, really good at doing inference. So think of these as custom design chips that speed up these open source models. So they don't make their own models, but they're going to serve them to us really quickly. The cool part is you can go try these models right now. Now, the API, I believe, is still under waitlist. But you can go and try them in the UI. So let's go check that out. Tell me a long poem about trees. Let's see what we have here. And you can see here, it just blows through all these tokens. 526 tokens per second. That is pretty insane. All right. Then let's check out what this looks like on the API side. So I got two functions here. One is in batch, and one is in streaming. And this is just so we can kind of look at this. But either way, I'm going to say, explain the importance of low latency LLMs. Let's see how quick this does in batch here. All right. So it's pretty quick, which is really nice for all this text. This would probably take a while if you did GPT-4. Now I want to show you the streaming side. So let me go ahead and clear this. Let's do streaming. And let's see what it looks like when we stream it out. So write me a very long poem about topic. And that is an insane amount of tokens. This is really, really cool. Let me pause that, though. Either way, that's the LLM side. And so for the final side, we're going to go from text to speech. And the way we're going to do this is we're going to use DeepGram's Aura Streaming. This is a new model that is just coming out. And if you think about it, DeepGram's been in the transcription game for a long time. They have access to a lot of audio data, which is really, really cool. So now they're starting to train their own models to go from text to speech, not just speech to text. So what we're going to do is we're going to get our response from Grok. Let's just say it says, I went to the park. And we're going to pass that entire thing up to DeepGram. Then they're going to start to do their processing. Now this is where the cool part of streaming comes in because I want it to give me its data in chunks one at a time because I really care about the time to first data. I don't want to wait till it's done all its processing till the end just to get my first data here. So the cool part is that it's going to do its processing and then it's going to send it back to my app. And then I'm going to measure that distance between when I sent a data and when it send me its first chunk back. And this is the streaming piece. And this will be the time to first byte. Then what I'm going to do is right when I get that first chunk, then I'm going to start playing the audio. Now you may say, well, Greg, what if it doesn't send us the chunks fast enough? We're not going to have the audio to play for us. Well, luckily, their models are so quick that it's above real time. So they can process one second of data in less than one second. So either way, you're going to play your audio slower than it processes it. So it all works out in the end. So either way, it's going to send us more chunks, send us more chunks. And then finally, we're going to play the last little bit of data here. All right, let's go ahead and see what this looks like on the other end. So now we're on the text to speech side. And then we're going to do some cool stuff here. So this is our streaming request. So you can see here, we're going to make a post request. We're going to give it the DeepGram URL for the streaming, for the text to speech. And we're going to say stream equals true. Then the cool part here is for chunk in response dot inner content. So this is the request as R here. So for inner content, this is all the different chunks of data that are coming back to us. We're just going to do a little bit of timing here to see when our first byte comes in. But then we're going to say, hey, write this chunk of data to our FF play that we had up above here beforehand. And this is what will do the actual playing for us. So it's going to write each chunk, which is really, really nice. Okay, all right, let's go ahead and try this out here. The returns performance are super linear.

Speaker 2: The returns for performance are super linear.

Speaker 1: Awesome. And as you can see, we get a time to first byte at 272 milliseconds. And so that's pretty quick. All right, so now let's take a look at the entire process. We have a ConversationManager class here. And I have all the rest of the code up here. Again, links in the description if you want to go check it out. So with our ConversationManager, I'm first going to start off by making a language model processor. And the reason why I do this is because I added just a little bit of memory to this conversation via LangChain. So as I'm starting to speak to the bot, it's going to keep track of those chat messages so we can actually have a fruitful conversation. Then what we're going to do is while goodbye, not in transcription response and lower. So goodbye is my exit word. So whenever I say goodbye, it is going to exit the program. So first, what we're going to do is we're going to hear the transcription. And so this is me actually talking. Then we're going to go take what was just said with the transcription response. And we're going to go give that to the LLM. And then once we have that LLM response, we're going to go give it to our text-to-speech model, which is what is actually going to do the talking for us. And then we're just going to reset the transcription response. We're going to have this run basically forever until we want to stop and we say goodbye. So let's go ahead and run this and see what it actually looks like. Hi, how's it going? My name is Greg.

Speaker 2: Hello, Greg. I'm here to chat. How can I help you today?

Speaker 1: Yeah, totally. Could you tell me a story about the San Francisco weather?

Speaker 2: Sure, San Francisco weather can be foggy in the morning and sunny in the afternoon. It's quite unpredictable.

Speaker 1: Oh, thank you for telling me. And one more question. What is my name again?

Speaker 2: Your name is Greg.

Speaker 1: Wonderful. Thank you very much. Goodbye. Cool. And as we take a look at how this goes here, you can see the language model latency. We're talking a little over a second here, about a second, and then half a second. So these are pretty slow in terms of the latency compared to the time to first byte. Now, deep gram was 230, 278, and 300. And staying under 300 is pretty snappy. So I'm happy with these results. And this just goes to show that if you're going to optimize anywhere, well, the language model is probably your first spot to be. And this is going to be the number of tokens that are produced. The other thing I want to highlight here too is this latency metric for deep gram also includes the network latency as well. So if my Wi-Fi is slow, this is going to slow it down just a little bit. Now, it won't be a ton, but that just goes to show that the deep gram processing number is probably a little bit faster than this too. Cool. Then to finish this off here, I just wanted to leave you with a few considerations. Now, what a lot of people will do is they'll put in filler words in their response to users in order to buy themselves more time. Basically, artificially putting in words to disguise the latency. So as they're responding or as the LLM is thinking, the voice might go, yeah, well, okay, now I have your answer and I'm just going to go for it. So that's a way to disguise as opposed to just having silence there. Now, the other thing that I didn't show here was the interruptions piece. Now, you'll notice that I never interrupted the AI as it was talking. The reason why I didn't show that one is, well, it's pretty difficult. And it'd take a lot more code, but two, it's more of a software engineering problem to interrupt a stream of audio as opposed to an AI problem, which I wanted to highlight here. But just keep that in mind for when you're building yours. Now, the last thing I'll say is, Yohei, I posted a tweet about this and he actually said something pretty interesting. Okay. This got me thinking when I was listening to someone talk, I can start forming a response before they are done talking, applying it to this, somehow stream the speech into the model. So as I'm talking, go give that to the LLM, then have the model predict the rest of the user speech. So as my first half of the sentence say is with the language model, have it predict what the second half of the sentence is going to say. Then once it has that, form a response based off the expected user speech. So once it knows what I think I'm going to say, then go start generating your response. Now, this is probably pretty expensive, but keep in mind, the cost of tokens and the cost of intelligence are all going down to zero. So this would be a pretty cool bet to make if you started to implement this within your own applications. To start building with text-to-speech, head over to deepgram.com forward slash TTS to get started. And as a reminder, all this code is in the description. You can go ahead and go grab it. I'll see you out there on Twitter. And please show me your speech models. I would love to see them. We will see you later, my friends.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript