Speaker 1: Virtual assistants like Google, Amazon, and Alexa seem very useful to me. I think they would be a little more useful if they had more information about me, but obviously I'm not going to give more information about me to any company or third party, so I decided to make my own virtual assistant. This project is very large and I'm going to divide it into several videos. In this specific video we are going to see the beginning of the project, which is a bad planning and the implementation of Whisper from OpenAI to be able to convert audio to text, which is like the most basic functionality of a virtual assistant. In this video I mention all the configuration and commands so that you can follow it. In the same way, in the description of this video, there is an article where I have all the documentation of what I am doing for this project. One of the challenges of this project is that I don't want to have to go online to be able to use my virtual assistant, so I'm going to have to run it locally. So for this I'm going to be using the Jetson AGX, which is a microcomputer with a GPU that has 32 GB of RAM and more than 1,800 cores, which is perfect to be able to run all the machine learning models that we are going to need for our virtual assistant. What I have vaguely planned is to have a lot of cameras and microphones in my apartment that are sending all the information to the Jetson Orin, which, depending on the situation, will activate different functions, whether it is to control an intelligent device through Home Assistant, to do something in my pending list or agenda. For this we are going to need computer vision models, audio processing, natural language and all the software that unites them. Right now I don't know how I'm going to integrate all these models. For the moment, all I know is that I'm going to have each model in a docker container and these containers are going to be communicating with each other. The first thing I did was to start investigating different models that make audio to text transcripts. And the first model I wanted to try was Riva. Riva is an NVIDIA platform that is optimized to run audio processing models on NVIDIA graphics cards and process audio in real time. After trying to install Riva in the AGX for hours and be in forums and be modifying configuration files, I realized in the documentation that Riva is in beta phase for ARM processors and the Jetson AGX has an ARM processor. So for the moment I'm going to leave this model alone. The next model I tried was Jetson Voice, which is also from NVIDIA and has a lot of abilities for audio processing and also natural language processing. And the best thing is that it is optimized to run on Jetson devices. Here, after having been for hours trying to run it inside my Jetson AGX, I realized that it is not designed or is not yet updated to run on Jetpack 5. And just Jetpack 5 is the version that comes with the Jetson AGX. Now, the third is the defeated one. The third model I tried is one that I already had a few weeks that I wanted to try, which is Whisper from OpenAI. This is a model based on transformers with encoders and decoders, which has very impressive capabilities to transcribe audio to text and also make translations to different languages. And there are two very amazing things about Whisper. One is that this model has the ability to make a very good audio transcript with low quality audio and with a lot of noise. And it also has different sizes of models so that you can adapt it to the machine you want to run it on. I'm going to run it on a Jetson, but if you want to run it on a computer or on a server, you can do it without any problem. Since the smallest version of Whisper runs on just 1 GB of RAM and the largest version requires just 10 GB of RAM. To install and run Whisper, I'm going to do it inside a Docker container. But if you want to do it directly in the operating system of your computer, you can do it, you just have to skip the command I'm going to do to download the container. First I downloaded and ran a Docker container, which has PyTorch optimized to run on Jetson boards. I did this with this command. Then I installed FFmpeg, first doing an update and then installing this package. And finally, all we have to do is a pip install git plus the URL of the Whisper repository. Ready, with this we already have Whisper installed on our computer. And the first thing I did was run a test with a file I have called test1.wav. To run this test, we run this command, in which we are telling the model, the file, the language and the size of the model with which we want to make the transcript. We can see that it generates a transcript of the audio that is inside this file. Now, the problem is that Whisper does not process audio streams, but processes files that were already recorded and saved. And for a virtual assistant, we do need this to be in real time. So the way I'm going to go around this, at least for now, is that I'm going to make a code that is recording audio and every 10 or 15 seconds, cut the recording, save the file, process that using Whisper and generate a transcript. So, although it will not be in real time, at least it will be processing in blocks of 10 or 15 seconds. To record audio from Python, the truth is that I did not know very well what I had to do. Investigating, I saw a package called SoundDevice, which gives us this possibility. And all we have to do is run these commands. The first is to install SoundDevice. Then the second is to install LivePort Audio 2, which needs SoundDevice. And finally, we are going to install SciPy. Once I installed this, all I did was a small Python code, which starts a recording, every 10 seconds cuts the recording, sends it to Whisper and generates a TXT with the transcript. For this, I also did a class called Recorder, which is the one that manages when the recording starts and how the recording is saved. As this is an MVP, right now this is running, capturing the audio from a fairly old Logitech webcam, which is connected to the Jetson AGX. The idea is that later it has several devices that are an SP32, an Arduino, Raspberry and Pico, which have a microphone and are sending this information to the Jetson AGX. When we run the code, we can see that a folder called SampleTXT is generated, where each line is each batch that was processed by the Whisper model. So, in an iterative way, we are going to see how lines are added to this TXT file with all the transcript of what was mentioned in each audio. I was doing several tests with the size of the model and Tiny and Bass process the audio faster than in real time, so those two sizes were the ones I liked the most, but I think I'm going to be staying with Bass, since Bass has a little more structure, puts commas and adapts better to certain words that the smaller model sometimes fails to detect very well. Ready, now with this we have the most basic of a virtual assistant. We still need a lot of things to do. We need to put in computer vision models, natural language processing, so that it knows if we are asking it something or what we are asking it. We will also need a lot of APIs to connect with different things. For example, if I want my lights to turn on or off, I need to connect to my Home Assistant server to be able to tell it what it has to do. So, there are a lot of things to do, but I will be talking about that in different videos. What I would like is that if you have an idea that you would like to see integrated in this virtual assistant, leave it in the comments of this video to see what we can add to this virtual assistant and make it the most complete and most useful, not only for me, but for more people in the community. And well, that was all for today. If you liked the video, click on like, click on subscribe if you want to see more of this in the future and have a very good day.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now