Running LLMs Locally on Raspberry Pi: A Step-by-Step Guide
Learn how to deploy large language models on small devices like Raspberry Pi, ensuring privacy and efficiency. Discover common AI pitfalls and future innovations.
File
I Ran ChatGPT on a Raspberry Pi Locally
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: This is an LLM, except it isn't running on one of OpenAI's servers in the cloud somewhere. In fact, it's actually running locally right over here on this Raspberry Pi, which fits in the palm of my hand. Which means it's 100% private, secure, and doesn't even need to be connected to the internet in order to work. And if that scares you, join the club. Today, I'm going to show you step-by-step how to run massive large language models on the tiniest of computers. We'll also discuss the most common mistakes technologists make when looking to incorporate Gen AI, and how this approach will fix them. We'll also cover the hidden issues of ChatGPT that most people ignore. By the end of this video, you'll be able to deploy state-of-the-art LLMs on basic computers like laptops and even SBCs. You won't need to download the weights off of sketchy bit torrents, I have them for you here. Nor will you need to send your ChatGPT queries to a private company in the cloud somewhere. You won't even need internet. You also won't have to set up accounts or generate API keys with OpenAI. But this video is much deeper than just the tech. It's about the future, and how AI is poised to transform the world as we know it. It's probably an understatement to say that Gen AI is hot right now. It's literally everywhere. The technology is not just a game-changer, it's a revolution in the making. And it's disrupting industries left and right. This isn't just another fleeting trend, it's the future. In fact, those last two sentences, yeah, I didn't write those. An LLM that's baked directly into my note-taking app Notion did. Pretty cool, right? But there's a problem. See, the compute resources and money needed to bring this to life was fairly substantial. In fact, it took nearly $5 million to train ChatGPT. And it's estimated that GPT-4 is roughly 800 gigabytes and requires something like 50 gigabytes of VRAM in order to load the model. And these technical characteristics mean it needs to run on specialized hardware in a data center in the cloud and usually requires several A100-class GPUs, each which retail for over $8,000. Not to mention, the model is proprietary and closed-source. Which is fine, except it means that to leverage these models, we need to send our telemetry data to a private company. Which could be a problem. But here's where things get interesting. See, back in February, Facebook shared its collection of private LLMs dubbed LLAMA, many of which outperform GPT-3. So this was essentially an open-source version of ChatGPT. But even though the weights were now open-source, the models were still big and unwieldy. Enter Georgi Greginov, who heroically ported the LLAMA model to C++, which greatly reduced the size of the model. And quantization is when you take something perfectly beautiful, like this photo of a cat, and then delete a bunch of pixels in an effort to make it smaller and hope that it still resembles the original image. And we can do this with the model file themselves. And it kind of works. And people are taking this advancement to run models on tiny devices like Raspberry Pis. And by people, I mean me and this guy here who tweeted about it. So I couldn't help giving this a shot. This is my Raspberry Pi 4 Model B, featuring a quad-core CPU and 8GB of RAM. And I'm going to run an LLM on it just to prove the haters wrong. First, we're going to load stock Ubuntu Server 24 64-bit onto this microSD card. The Debian-based version of Raspberry Pi OS should work, but I don't really trust that all the packages will be available. So I'm just going to go straight into a more tried-and-true Linux distro. I'm going to go ahead and open up Raspberry Pi Imager. And then for operating system, we're going to use Ubuntu Server 23 64-bit. So let's go ahead and add that. For the storage, we're going to select the 32GB microSD card. And I'm going to go ahead and seed it with my Wi-Fi credentials. Then we're going to select write. Okay, so Ubuntu has been written to our microSD card. So we do need the model to be available to our Raspberry Pi. It's a little over 4GB. In theory, you could put it on the microSD card if there's enough space. But even the read-write times on the microSD card are really slow. So what I'm going to do is I'm actually going to throw it on an external drive, my Samsung SSD. And then what I'll do is I'll connect the SSD to the Raspberry Pi, and it'll have the model available. And that read-write time will be really fast. So what I'm going to do is I'm going to download this file. So I'm downloading the model to my Samsung 1TB SSD drive. So let's go ahead and eject the microSD card. Okay, so now I'm in my warp terminal. And let's see if we can connect to the Raspberry Pi. All right, so this should be our device. So I'm gonna go for the IP address. It looks like it's returning ICMP traffic. So let's try to connect. Let's do ssh data slayer. Okay, so I'm in the Raspberry Pi. Okay, so the SSD drive is connected, but it's not available because I have to mount it. So I'm going to do that real quick. Basically, I'm in this MNT directory, and I'm just going to make a directory called SSD. And then I'm going to mount the USB drive to this directory here. Double check how to do that. I'm going to go over to chat GPT over here. I'm going to say, how do I show USB drives on Linux? Because you just need to get the name of the drive that it's in. So it's SDA1, because we can see right here, Samsung T5. And I believe mounting it is as easy as mount. And then the drive, so it's dev, and it's going to be SDA1. And then the destination is going to be what I just created, MNT, SSD. Now if I go to MNT, SSD. Okay, so now I'm inside my Samsung Terabyte, and you can see the models right here. So we can actually work out of this drive, and it'll actually be faster because the read-write times are faster than the microSD card. So I'm going to do apt-get update, and then we're going to run this command. All right, and then I'm going to download some additional packages. The Git repo that we want is this guy here. So we are going to git clone this guy. You can do it in the same space here. All right, so now we have alpaca.cpp, and there's a bunch of files here. We need to compile it. So we just run make chat, and then all we have to do now is move the model into the Git repo file. So I'm going to do mg, take this, and then go to alpaca. All right, it should be in there. All right, so now we can kick off the LLM. It's going to be using the 7 billion parameter 4-bit quantized alpaca model. Run the chat file, and I have htop open here on the right, so you'll be able to see how the CPU gets utilized. All right, looks like it's ready to go. The cores are pinned. All right, let's do, let's give it a difficult one. What is quantization? You can see it's a little bit slow. I'll speed it up. This is the speed

Speaker 2: that it's actually coming out at. It's not that bad. Quantization is the process of converting continuous values into discrete or numerical values. This can be done by taking a set of data points and grouping them together based on similarity in order to reduce noise and overfitting when training machine learning models with limited dataset sizes. Quantizing also helps improve model performance as it reduces variance between different parameters used within the same

Speaker 1: model architecture. But see, if I try to get cute with it, if I say, like, explain quantization in pirate speak, this is where it starts to fail. You know, with GPT 3.5 or GPT 4, you can start to do these more creative responses from the model. You can say, put it in this language, or say it as if you're this person, or, you know, act like a Linux terminal or something, and it can actually handle all that. It looks like this model isn't quite sophisticated enough to do that. And just so you can kind of see the difference, like, if I pose that question to GPT 4, it has fun with it. Ahoy, matey, gather round and lend me your ear while I spin ye a tale of quantization. And it will provide, like, real analogies. Like, imagine, if you will, a vast stretch of water with waves of all heights that say you want to describe those waves using only a handful of measurements, small, medium, big. This is beautiful. Like, this is really a beautiful way of describing a rather complex topic. But it is pretty good at, like, stuff where the answer is, like, pretty concrete. If I said, you know, what is the Linux command to delete a folder called tmp recursively? So I think that's right. I usually throw the f flag in there, but maybe that's right. All right. And then just for comparison, let's take the same prompt and bring it to GPT 4. Still talking in pirate speak. That's okay. Yeah, that's what I would have used, honestly. So I think this is a little bit better. And then it gives these, like, verbose explanations of each flag. You can see GPT 4 is just a lot more strong. So similarly, write a JavaScript function to calculate the average of an array of values. So I think this should work. I don't know why it's rounding anything, but okay. So now I have this average function. If I pass 7a, the average should be 7. And it is. Yeah, it writes valid JavaScript code. All right. And then let's just bring this guy to GPT 4 and see what it comes up with. Did it say ARR as, like, a pirate pun? I can't tell. That would have been amazing if it did. GPT 4 is good, I got to say. It's so good. Gives an example, uses the reduce function, very elegant. Ask it one more broad question. What is the most significant innovation

Speaker 2: in history? The invention and widespread adoption of the internet has been one of the greatest technological advancements ever, revolutionizing communication, commerce, entertainment, and more. It has enabled global collaboration on an unprecedented scale, connecting people from all corners of the world with ease. Yeah, so I think that's a really solid

Speaker 1: answer. I think you'd be probably hard-pressed to come up with something much better. So the model is a little slow, but it's pretty good with more concrete questions, like how to write a Linux command or write a JavaScript function. Once you get into more heady, abstract concepts, it begins to lose the plot a little bit. And since it's a little bit slower, it's best to pose your questions as a yes or no answer. So you might be wondering, why don't we accelerate our model inference using a specialized GPU or TPU compute stick like the Google Coral AI USB accelerator? Unfortunately, LLMs don't run on TPUs, and they're actually restricted by virtual memory more than anything. The pace of innovation in this space is super exciting, and the possibilities to leverage LLMs at the edge to power sophisticated chat interfaces for robot waiters or other new use cases is super compelling.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript