Building a Multimodal Voice Assistant with AI

Convert Your Audio To Text

4.9/5

3702 customer reviews

Join us to create a multimodal voice assistant using Lava and Whisper models on Colab and Gradio, integrating speech and image processing with AI.

Build an AI Voice Assistant App using Multimodal LLM Llava and Whisper

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hello everyone, welcome to AI Anytime channel. In this video we are going to work on a very interesting project. We are going to build a multimodal voice assistant. So we are going to combine a couple of models. One is the generative model and the other one is a speech to text model. We are going to combine Lava which is a multimodal LLM and we are going to combine that with Whisper model which is again an open source model by OpenAI which helps you with speech to text kind of tasks. So we are going to combine both of these and we are going to build a voice assistant for multimodalities. So if you have images and of course videos you can extract the frame and if you want to discover information or retrieve information from this multimodal data through a voice assistant how can you do that? That's the ultimate goal of this video. We are going to do this in Colab notebook but we will build a Gradio app so you can play around with these models and also with the features and see if that makes sense for you to scale this further and you can of course build an app as well. We are going to rely on the T4 or V100 GPU probably in this video we will see that in a bit and you can do this with consumer GPU as well because to load the model we are going to use bits and bytes because we are going to load this model in 4 bit so we can just use a consumer GPU to do that. So let's build this voice assistant with Lava and Whisper. So if you look at here on my screen I am on Google use me I am on Google Colab you can see it's called Lava Whisper yeah but you can you'll be able to use any other as well but you have to first install a few things but before that you can see I am here on runtime and then I have to change my runtime. So to do that let me just bring it a screen. Change runtime you know I have Pro so I'll just go with V100 high RAM but T4 high RAM will also work so if you don't have a Colab Pro you will be able to do it on T4 high RAM as well. So you can do this on T4 GPU. Now we're going to install a few things so let me just do that. So of course transformers and but I'll just do quiet here I want to see all the logs and things like that. OK transformers and then you need a specific version. So let's get the 4.37.2 version of transformer. OK so this is the version that we need and then we need other libraries. So let me just install that quickly. We need pip install bits and bytes. OK now bits and bytes helps you to load the model in different n bits. OK so you can read in 4 bits and things like that. OK and then I need accelerate to complement the bits and bytes thingy and I'm going to load get the 25.0.25.0 and then you need a Whisper. So the best thing best way to install Whisper is to get it from their GitHub. So if you are not aware about GitHub or Whisper both let me show you. OK so this is where you see Whisper. Now let me just click and you can see the Whisper over here and let me just copy this. I'm going to install this from Git the source and that's how I'm going to just do and here then you have to add Git plus. So let me just do that. So Git plus and then you give that HTTPS GitHub.com OpenAI and Whisper.Git. Now this is how you install and then we need a couple of other things. So let me just install Gradio as well. Gradio is a library that helps you build you know UI where you can work with Python and have a simple app to showcase capabilities, demos and proof of concepts. Now let me just install Gradio over here. Gradio and then I also need GTTS because I also want to respond to the end user in again in a voice manner. So you have both a speech to text and text to speech capability and this will see that. So let me just do pip install OK and minus QGTTS and guys same thing will also work for RAG multimodal RAG because most of you want to work with Lava for multimodal RAG. If you are able to get it from a context of course from a single image and things like that you'll be probably able to do it with a vector database as well. We'll see that from the rocket science. But anyway now once we install that I'm going to import few things. So let me just write it over here. I'm going to say import torch and after that from transformers excuse me from transformers import and then you need bits and bytes config. So let me just get writing it wrong bits and bytes config and then I also need pipeline from transformer because I'm going to use image to text pipeline of transformer because when I'm inferencing with Lava OK now let me just do that here bits and bytes thingy. All right. As you can see now our import is successful. Let me just get the now we have to create a config for the for loading the model in 4 bit that's basically a quantization config that we have to write. So let me just do that. So I'm going to say for example quant config and then quant config when I use bits and bytes config here. So bits and bytes library in Python and load in 4 bit equals true. So let me just do that. So load in 4 bit equals true. It's a Boolean value and then you also can have compute D type. So I'm going to go with torch float 16. OK you can see it suggest me compute D type and then I'm going to write torch dot float 16 not the BF 16 because for BF supports ampere architecture ampere GPUs by the way and of course GPUs based on ampere architecture like A100 and things like that. But here I'm going to have BNB 4 bit compute D type torch float 16 and then load in 4 bit which is true. Let's now get that here. So quant config is done. Now this is the model I'm going to use. Let me just show you. So I'm going to write lava 1.5 version of that model of that weights model and the 7 billion at the weight category. So I'm going to use this the official model. So let me just copy it over here. It's a multimodal LLM and one of the best right now when it comes to open source. Of course DPT 4 vision is the best one out there. But yeah this this also does the job. Now let me define the model ID and the model ID and this is how you define that. So you give a repo path. This is the model ID and I'm going to use the pipeline. So let me just do pipeline here. Pipeline and then in this pipeline I'm going to write first image to take. That's the pipeline. So transformer has many pipelines that you can use and you can see let me just do that. So image to text model equals to excuse me this would be small because the import is their image to text. We don't need all of those things. So let me just remove this. I just need model ID and then I need model quark. So I'm going to use that here so you can see quantization config is nothing but the quant config. Now let's load the model. This will take a few minutes depending upon your internet speed and also the compute power that you have. But you know we will wait for it. OK so let's load it and then we'll keep on writing the next line of code. You can see it's downloading. It has to download around 15 GB plus size of different started model weights and config and things like that. So let's wait for that. Meanwhile let's start writing the next cell of code. So I'm going to write import whisper because I'm going to use whisper open AI again. People have that perception that open AI only works on the closed source models like open AI and things like that. Hell look they have a lot of open source model that probably would not be aware about it. Please go and have a look at that. They have many many models like that. Whisper is one of them. Then they have Sappy. They have other models like Cliff for example which helps you with vision embeddings and things like that. So open AI has significant contribution to the open source community as well guys. But not with the the new days LLMs. But don't forget about the other LLMs language models like GPT-2 and others. Right. So they are they are one of the best ones once it comes to you know making an impact with generative AI. But anyway import Whisper. Let me get Gradio and things. So Gradio you can also use Streamlit. It's your choice. So import Gradio as GR. I'm going to need a few utils you know warnings OS JSON and things like that. So let's get it. And then from GTTS. GTTS is a Python library guys which helps you with text to speech capabilities. OK. And then I also need from PIL. And then you're going to get import image. OK. And let's get it here. Now once I do that you will see that hey we have our imports. Now you can see our model has been again loaded started everything it's been loaded. So we have our model now if you if you print pipe it will show you the pipeline. You can see it says the pipeline object of transformers. Now let's do one thing here. Now let's bring up an image. OK. So for that you need an image of what image we can guys inference it. Let me just upload some image over here. So I have some image probably and I'll go to my image here. OK. And I'm going to upload this image. Now you have this image. Let me upload that. And you want to build a voice assistant for a lot of use cases like you can have something in health care you can have something in finance you know insurance and things like that mainly the customer centric use cases where where text is now text in text out is gone. You are looking at the multimodal dimensions of data right. Images videos audios and things like that. Now let me get this image in here. So what I want to do is image path equals and then we go with our image which is one dot JPE and let me just get it over here. And then we are going to use pillows image to show that. So let me just do image dot open. That is how you saw it. And then you write your image path. So let me just do that image path thing here. Image path. And then you can just do image. Once you do the image you'll be able to show the image over here in this collab notebook. Now you were building a voice assistant where you want to see if that LLM the multimodal LLM can help a doctor to get some insights and findings on this skin related images that the let's take that as an ultimate goal here. Now I need a few. Let me just add a lot of cells. Now I need something on the natural language toolkit. So import NLTK and NLTK dot download and I'm going to get that. People have forgot about NLTK by the way. From NLTK import send tokenize. OK that's what I need. OK send tokenize and let's get that here. OK cool. And that is done. Now let's set some hyper not hyper parameter the inference parameter. So let me just get some max new tokens. So I'm going to get max new tokens. Let's keep it smaller like 250 or something just to get it. And then you have prompt instructions. Let me just write prompt instructions. You need a prompt doc string so we can write it in a better way. And then let's write it. Describe the image using as much as detailed as possible. You are a helpful AI assistant. Who. Who is able to answer questions about the image now. Now generate the helpful answer. Now generate the helpful answer. So here. Let me ask a question. What is the. Image. All about. OK and now generate the helpful answer. And all right. So this does. OK. Our problem make it full stop. Now this is a prompt instruction. You can again make it even better better prompts. Now this is how the lava prompt template works. This is how the. Structure of that prompt. So you give an user. And then this is how you bind your image. So image. And then you give it a slash end. And then you say OK plus. And then prompt instructions and then plus. And then the assistant thingy comes in. So for that I'm going to write it here. Assistant because we have written assistant thingy. So assistant. OK. This is how you do it. Now let me get the prompt. Now prompting is done. Let's get the output for this particular image for now quickly. Then outputs and I'm going to write pipe. And then in pipe I'm going to write a few things. So the first thing is image which is going as an input. And then prompt equals prompt and max new tokens. And for that you can give it a again in a quarks manner. So if you want to get more inference parameter. You can do that as well. So generate. And generate which. Quarks and then here you can pass it as a dictionary. So then I'm going to write max new tokens. Or you can just get that max new tokens from by the way from above as well. But we can probably use that in our function. Now max new tokens and then you can give max new tokens. OK. Where did I said that? Max new tokens there. Let's see it out guys. OK. Now you can just when you print outputs, right? Let's get the output thingy. You will see a bunch of thing. OK. Like generated text. You have to pass the output. You can again get that. OK. Let's get the assistant. The image features a young girl with a skin condition. Possibly a skin rash or a skin disease. The image features a young girl with a skin condition. Possibly a skin rash or a skin disease. The image features a young girl with a skin condition. Possibly a skin rash or a skin disease. Possibly a skin rash or a skin disease. The girl has a visible bump on her ear which is a noticeable feature of the image. The skin condition appears to be affecting her ear and is likely that bump is the result of the skin condition. The girl's face is also visible and it seems that she's looking at the camera. Oh, all right. All right. He's of course not looking at the camera. It's a sideway. But yeah, that much of outliers are expected. Now let me add a few more cells and then we'll write here. Look for send. I'm going to use send tokenize. Okay, now send tokenize and then you give output generated text print send. And once you do that, let's see what a list indices must. Okay, this is a list end. Okay, I got it. So outputs and then you have to first give that here because we're looking at the list element out here. This will probably work and you can see now. Okay, we get it in a better way. We have an assistant thingy image features a younger blah blah blah and you can just pass the assistant as well. By the way, not a big deal. This is the print send. Now, this is how it's working. But how can we, you know, combine whisper here and do it in a gradual thingy. So we have a voice assistant. So let's do that. So what I'm going to do here now is that first gets let me just add a few more cells quickly here quickly and now let's add warning. So I'm going to have warnings dot filter warnings. So filter warnings ignore. I want to ignore a few warnings here. Now the next thing is let's get from GTTS. I think I have imported it. Let me see on top. Where is that's gone from GTTS. We have it is I think we need something. We need NumPy. So let me just get NumPy for that image. So import NumPy as NP. Okay, and now let's write some utils thing. Okay for GPU. So torch dot CUDA SOMI is available is available. Fine, this is done. And then you can write a device and then device. You're going to bind that with CUDA and say CUDA if torch CPU is available L CPU. Fantastic. Now device is done and then you can also do a print guys here. Okay, so let's just do print and I'm going to say okay using torch and then here you can write in a dictionary kind of a format that torch dot and you can write underscore underscore version and if you want to do that, otherwise, it's fine. And then you give here and then you give here your device. I think this should work. Let's let's have a look at that. Oh, yeah, you can see it says just printing out which version and what CUDA and things like that. So it says using torch 2.1.0 plus CUDA 12.1 version of that. So if you are doing it locally and face any error make sure that you have this particular version of CUDA and torch. Now this is done. Now, let's work on the Whisper thingy. So I'm going to do import Whisper one of the best one of the best model that I have seen for speech to text or in the recent year guys, you know, it was revolutionary. Okay. Now, let me just get model out here. So Whisper has probably I forgot tiny small medium large. I think they have four different types of a base model. So there are five. So it's tiny small base medium and large. So there are five different types of weights categories in Whisper feel free to use any one of these depending on how much of compute you have. So I'm going to rely on you know, probably the and bind I'm going to use with medium larger be too big for us for now. Okay, so Whisper dot load model and then I'm going to give device equal device because I'm using a GPU and I want to bind that with that and let's get it over here. You can see it's 1.42 gigs. So 1.42 gigabytes. Okay, the medium probably would have gone with the base model itself. It was around I forgot. I have to check that. Let's come to Whisper here. They will have a table somewhere for see what okay. Where are you? Okay, you can see it over here. Oh what I did. You can see this is where the Whisper thing is. Okay. So the base oh base is only 74 million. Ah, all right. All right. I was talking about small then probably, you know, we would have gone with base. Okay, we'll see if we get any error. We'll come back here. Okay, now we'll see that. All right, now let's let's do a print thing. It's what I'm going to do is print and I'm going to say that is fine. So it says if model is multilingual if model dot is multilingual else English only and then I mean that was fantastic. The recommendation on Colab is really cool and I'm going to say and has let me have a look at that. Okay, np.prod np.prod p.sep. Okay for yeah, this makes sense. So parameters would have been better. Okay, and you can see model is multilingual and has these parameters. So what I'm trying to tell you that hell look you can build a voice assistant in all the languages that Whisper support. So for example, if you commit over here on the GitHub repository, you can see all the languages that they support for now. They have a list also. Maybe you can have a look at that. I know I can give it that link somewhere. But it supports many languages by the way. Yeah, I mean they support a lot of a lot of languages. I've I've used that previously, but this is not the real thing. Now, let's keep going. So next thing is import regular expression and I'm going to have an input text here now because we're going to build our Gradio application. So let me just build or probably will skip this part if not required. Okay, now to. Now let's get. Now, let me just get import date time. Okay, so I'm going to get import date time and yeah, because we're going to use a Gradio application. We'll see how we can do that now. Let's get a logger file. So I'm going to write here. Logger file and for that what I'm going to do is T stamp to timestamp. Okay, so the timestamp and then I'm going to write date time. Date time dot date time. Dot now. So let's keep it for now all our date time and then again T stamp. And then HDR T stamp dot replace. This looks good. This looks good. Not that much. I mean, that is fine. Okay dot replace. Okay, now T stamp is final. Let's create a log file. So log file equals and then you write F log T stamp dot TXT. This is fine. Okay. Okay, now let me just run this. Okay, now let me write a function. So define write history. You can build a multimodal rag as well guys here. What we're trying to do now, let me get a text. So text and then you write okay with open log file as F and let's keep also let's get an encoding thing. So with open log file A and then I'm going to write encoding. I want to do UTF-8 to avoid some encoding errors encoding and I'm going to write UTF-8 9. I don't know why I'm writing 9. Okay, and now F dot write and I'm going to just write text here and then F dot write and I'm going to close this then. So I'm going to close this F dot close. Okay, so once that is done, just close it. Cool, and now after that, let me just get request because we're going to use Gradio. So import request. Now I'm going to write all the logics in a function because we are going to use Gradio. So first thing I am created a lot of gist just to save some time to write the repetitive code again and again, and I will just do that. So first thing is just to get the image to text pie. So let me go to raw and control, but I'll explain what we are doing here. The first function if you look at here, let me just close this for now. We have an image to text which takes input text and input image which loading the image and then we have a right history where we are utilizing our history thingy. By the way, if the right history is there. Okay, then we have a prompt description and I'm just going to use my prompt. Let me just come back here. Where is my prompt guys? By the way, okay here. Uh, let me just copy. Copy that. Okay, and I will do. And I am just going to delete this one. I'll paste it mine here. And just to make it more readable, of course, it's in docstring doesn't matter if the intonation is not that aligned, but just to make it a bit beautiful, you know for the end user. So now this is how else prompt. Then we have a prompt instruction else in the right history prompt and then the outputs where I'll increase the maximum tokens by 250 for now. If output is not and this thing this is fine reply match dot group. We have registered. Let's run this now. Now we have one function. Now the next function is about the transcribe. I also have written that for just to copy paste a bit faster and I'll explain that how the transcribe transcribe is fairly, you know, easy. It's not that complex to understand. Also, I'll just show you now what we are doing. We are saying take an audio file in this function and you're checking the audio is input is none. So we're finding out if that is not none. If it's none, there is no transcription. Then we have a default language in English, which is which is commented because you can also it's a multilingual and then we are, you know, getting the result or text. We also getting the language detection using spectrogram male spectrogram that you can see it over here. And yeah, and then this result takes let me just run this. You can find out the piece of code over here so you can find out this is the code. They would have given it somewhere. This is how you can get it. You can see this is the code. The same thing that we are doing nothing else. You know, the male spectrogram blah blah blah. OK, now going back on the transcribe. The next function that we need is a text to speech part. So let's get the TTS file here. So text to speech, which is again fairly easy. And you can see this is a text to speech. The language is English. We have we are using GTTS. You can also use pyttsx3 if you want to do that as a library. And then let's just run this. OK, now there is something that we have to do. You need an ffmpeg command for the temporary file where I've written it over here on Ubuntu. It's very easy to do it if you're online. If you are on Windows, then probably you need to set it up in a different way. It says an UTF-8 local is required. OK, an UTF-8 local is required. And I think I have solved this problem recently. Let me have a look at my note. OK, UTF-8 local dot get. OK, let me just add it here as a code. I'm going to say import. Import local. And then I think let's print local dot get local and see if there's a function. I'll run this. It says we have an UTF-8, but I don't know why I'm getting this error. It says an UTF-8 local is required. OK, ah, this is this is strange. Now let me fix this error. It says ffmpeg blah blah blah. And we have let me have a look at my note. Where did I solve this error? OK, I have solved it somewhere, but probably let me see. Yeah, this is something to do with Colab. We have been using this, this, this ffmpeg. Let me copy this here and get that from Internet. How did I solve that? It's a sell command. OK, yeah, I think this should do. But if I do this also, it's it's such shows that UTF-8 is there, but. All right, so we need this get preferred in preferred encoding lambda, but when I say get local, it shows me that UTF-8 is there. OK, now the next one is that we go back to gist and I have my Gradio thingy done. So let me just copy my Gradio thing. And I'll explain what we are doing is just a Gradio is fairly easy. Now, if you look at this, what I'm doing here is I'm saying, hey, look, there's a function to handle audio and image input because the user is going to upload an image. So I'm just going to copy this and I'm just going to copy this and I'm going to I'm saying, hey, look, there's a function to handle audio and image input because the user is going to upload an audio and an image on the UI. We'll show that and then handle the image input. Assuming transcribe also returns the path. I mean, it's temp3.mp3, which is from here on the FFmpeg as you can see it over here. Now we are creating the interface. We are using Gradio interface. We are saying Gr.interface. And then function equal to process inputs and then we are passing the input, which is your audio and the image and. And then the output, which is your speech to text, chat GPT output, I'll make this change at chat GPT output as an AI output and then temp.mp3 and I'm going to make this change as learn image processing with Whisper and I'm going to call this as. LLM powered voice assistant. For multimodal data. OK, now this is how I'm going to change it. And now let's run this. OK, so I'm probably saving this file, but I'm going to run this now. Now, once you run it, you will see it opens. You can open this link and I will show you. So once you open, you have to give the let me just click here and you have to give that. It will ask for you for the access. You have to give them the permission. Let me give allow on every widget. OK, now here you have a record. Let me explain what you have to do here. You have to upload the same image and then also record what you want to do with this image. OK, so let me just click here and put the same image probably and I want to record this now. Can you analyze the image and tell me what's wrong with this image? And then you click on submit. Probably not the right way of asking question because I said what's wrong with the image. There's nothing wrong with the image, but I should have asked what's wrong in the image. Is there an anomaly that you can tell me like it's a health condition? You can see it. We have got our output. Let me make it a bit bigger so you can see it. And let me first play the audio because we are also using text to speech, which is an interesting part. So we are taking the speech. Now, if you want to build a voice assistant, you know, for the health care, for the customer service chat board, for instance, for finance, for legals, whatever, you know where you have a system which, you know, where we are expecting the voice input and you also have to give a voice output and you could see how fast it was. So you can also look at the GPU infra. Like if you are trying to build this kind of capability, what are the GPU infra that you need? I'm running it on a V100, which is not that probably costly when you compare with A100 and things like that. Now let me run this.

Speaker 2: So you could see how fantastic it was.

Speaker 1: I mean, you know, in GTTS there are different types of voices. You know, if you want to use PTTSX3, there are different. You can use Azure voice as well. You can use AWS voices as well. It depends where you are building this kind of capabilities for your organization or either have hobby project as a college project and things like that. Right. So this is one way. Now, let me again close this, guys. OK, I want to upload a new image probably. So let's download an image and see how it can help you, you know, on a health care. So I'm mainly focusing on health care because I believe that is an industry where LLM has a huge impact. You know, it has so much potential that it can solve bigger problems. Looking from, you know, writing medical notes to looking at the clinical notes, synopsis, medications, recommendations to the doctors, secondary opinions and things like that. Right. So let me now just get an image again of the same. Or let me dandruff. OK, dandruff is something that I like to take it dandruff. Dandruff issues. Of a man and I'm just going to go inside images and which one should we take? I'll see. I want to want to test out the capabilities of lava because we have to see that if that's a good model to work with. I'm not getting the right pictures, guys, probably probably this one. I'm not sure. OK, there should be a clear picture. These are the pictures which LLM probably might not infer in the right ways. This is something I need on the more dandruff and things like that. So let me just have a look. This is something that which makes sense. So I'll take this image probably if this is a JPEG. It's a WEBP file. That's what I'll save it if I have to. What what about this? These are all WEBPs. Now let me convert a WEBP to JPEG. OK, so WEBP. OK, so WEBP to JPEG and then I'm going to convert this. So let me just convert it over here and you can see in downloads, you can put this over here and then convert to JPEG download converted image and you can see it over here as a JPEG that I have. Now let me upload that image. Now here I'm going to upload it and then again, I'm going to ask the question. OK, so it was a big of you can see it's a big image, it seems. But let's ask the question. Can you tell me what is wrong with the ladies here in this image? And let's ask it again. And you also have to look at the infra. You would see it will not take more than, you know, 15 seconds, you know, as an average to generate a response for you. It takes around you can see it took around 12 seconds. It says in the image, a woman is combing her hair with a comb and first see the speech to text output. It says, can you tell me what is wrong with the ladies here in this image? So perfect, right? So we are using the I think medium model of Whisper, but you can also use base or small depending on what if you want to run it on a mobile device, you will go with the tiny on it to run it on a Raspberry and things like that, right? So if you look at over here, it says in the image, a woman is combing blah blah blah. However, there is a noticeable issue with her. It's getting the issue. The hair on the back of her head is covered in a fine white substance. You can see which appears to be a type of powder or dust. This unusual appearance might indicate that the woman has experienced an unexpected event of situation such as an accident or an unforeseen circumstance. So what I'm trying to tell you that it's not able to get the dandruff part of it, but it says that there is something wrong. So you also have to be very careful with false positive when you work for medical industry, which is so regulatory, you cannot return a wrong output. Yeah, but this concludes guys, you know, this will be available on my GitHub repository and you know, most of the code pieces are taken from a packet book. OK, there was a book on packet that I was reading. So credit goes to the writer. But yeah, I also have been provided a lot of things into this. But this is the project that I wanted to explore and you guys can work with this. If you if you are building your own voice assistant, you know, this entire logic can work, you know, if you are building a voice assistant, how can you leverage this in that particular project that you are working? Now, the code will be available on GitHub repository, this entire code that you have the notebook. Maybe you can build a rag now. It's pretty simple. As you can see, you can just you need a vector database. That's it to do that now. But we'll see that as well. Now, if you have any thoughts, feedbacks, comments, please let me know in the comment box. If you like the content I'm creating, please hit the like icon. And if you haven't subscribed the channel yet, please do subscribe the channel. It will motivate me to create more videos in near future. Thank you so much for watching. See you in the next one.