Guide: Using DeepSeek R1 Distilled Models Locally
Learn to use DeepSeek R1 distilled models with LM Studio locally on Windows, Mac, or Linux. Discover settings and tips for seamless operation without a GPU.
File
How to use Deepseek R1 Distill LLMs Locally
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: How to use DeepSeek R1 distilled models locally. Whether you have got Windows, Mac, Linux, this guide will ultimately work for you to use DeepSeek R1, which is at this point, the best open source reasoning model. Rather than using the DeepSeek R1, which is a really, really big model, what the company has done is, they've released a bunch of distilled models. They took some other model, used the outputs of DeepSeek R1, and fine-tuned those models. That is exactly what we are going to use locally. So just a disclaimer, this is not the 100% OG DeepSeek R1. Rather, this is the DeepSeek R1 distilled model. And for this tutorial, we're going to use something called LM Studio. The reason why we are using LM Studio is, it's very easy for beginners to use. All you have to do is go to LM Studio and then just literally download it, and then rest of the items, you can follow everything that I'm showing on this video. I will link the LM Studio website, link in the YouTube description, if you have never used LM Studio before. And I've also got a bunch of LM Studio tutorials, which I'll link it up for you, so you can get started. Once you have successfully installed LM Studio for the first time, you would see this interface. You've got chat, you've got developer, my models, and discover. The good thing with LM Studio is, you can chat with the model, but also if you happen to use the model as a, let's say, an endpoint, then you can use it. So it even supports function calling, and you can expose the model as an OpenAI compatible endpoint. If you have got any questions in anything that I say for the first time, if you are visiting here, please let me know in the comment section. I'll try my best to answer the question. So this is the chat window. But before we can chat, we have to download the model. The very first thing that you have to make sure is, you have got the latest version of LM Studio, which at this time of recording is 0.370.3.7 build two. So this is something that you have to make sure that you have it. So you can see in the release notes that it supports DeepSeek R1. So make sure you have got the latest version. Once you have that, go to the discover tab. And if you go to the discover tab, you would see DeepSeek R1 distal. This is a Quen 7 billion parameter models distal version. And this is Lama 8 billion parameter models distal version. So we're going to use the Quen 7 billion parameter models distal version, which is DeepSeek R1 distal Quen 7 billion. You can go see the model card on Hugging Face if you want to know more details, but also technical details, you can see it here, like what kind of information this model has got. So I'm going to go click download. This is a five gigabytes download. So it will take some time for it to download. As you can see here, my download has started. Once my download is finished, which is downloading this five GB of model, then I can start using this model. Another important thing is you have to keep both GGUF and MLX enabled. If you're on Mac, especially only on Mac, then MLX models might provide an extra edge in terms of compute and speed. But if you're on Linux and Windows, then simply go ahead and then use just the GGUF model. And you can also see the tags here and the details here, what kind of model it is. And you can start using the model once the model is downloaded right away. As you can see here, my download is completed. So you can either click here and load the model, or you can go to the tab where you can chat and then use the model there. So just start clicking load the model. That is the easiest option for you to start with, but I'll again close it and then open it and then show you completely from scratch. So after we have loaded the model, you can just go ahead and then ask the question. You can see this bar. The bar basically indicates whether the model has been loaded into your current session. So just wait until the bar is loaded. If you want to make any changes, for example, if you want to change the system prompt, or if you want to increase the number of tokens, a context window, then you can click this and it will actually show what is the system prompt that you can give. And if you want structured output like JSON and all those kinds of things, you can do it here. But I'm going to just leave it as it is. And at the bottom here, you can also see the system usage, the RAM and CPU usage. So at this point, the model has been successfully loaded. If you want to close it, you can click eject, but I don't want to close it now. I want to just go ask some questions. So I'm going to close this and then simply go ask simple question. Can you calculate three plus four plus five? So there are two kinds of tokens. One is the thinking token, which is the process in which the model is having an internal monologue. And then finally it gives you the output. I think LM studio will at some point escape the thinking token and then have a UI which will show it differently. So you can see the thinking token where it discusses with itself to solve the problem. And then finally we have got the solution and this is the latex format. So if you want to see it rendered, you can paste it on a latex render. So I'm going to close this LM studio completely and then show you how you can load this. Close LM studio and I'm going to quit it. And at this point, I'm just simply going to call for LM studio, LM studio. Once I've got LM studio available, ideally what I should have done is I should have rejected the model, which I did not do. So I've got LM studio here. So my existing chart is stored. I'm going to go to chat, click a new chart, load the model. And you can see the list of models that you have got in your LM studio. I'm going to select DeepSeek, R1, Distal, Quen, 7 billion. It gives you all the information like what kind of things that you want to do. For example, if you want like a longer context window, you can increase the slide depending upon how much memory you've got. Some memory optimization. I'm going to just go ahead and then say, just load LM studio, load the model. And as you can see here, the model is being loaded. The model is being loaded at this point. And then once you have got, you can start chatting with the model. If you have got a vision language model, which in this case, I don't think it is, then you can also upload the attachment. But right now it's just a simple model and we can just go ask any question. Can you do a probability check of how long humans would survive on Mars? I should have already asked for a joke about Elon Musk, but maybe I shouldn't do it. So you can see that it is thinking, it's having an internal monologue. Okay, so I need to figure out how long humans would survive on Mars. So it's going through this internal monologue and then ultimately once it is done, it would come back and then answer. So meanwhile, another important thing for you to notice is if you go to the developer tab, you can see DeepSeek, R1, Distil, Quen, 7 billion is running right now. So you can also make these models available as an opening a compatible endpoint. Why is it important? It is important if you want to develop something in your local computer as an MVP, and then probably you would deploy it on a server. So you can keep the same code, just change the local host endpoint, and then you can do everything that you want. So if you want to serve the model, it's very easy and straightforward for you to serve the model as well. So these are the endpoints where the model is already available and then the model is being ready. The model is ready to be served. So you can see the server is stopped, but I can start the server if I do this. And at this particular endpoint, like this particular endpoint, I can go ahead and then hit the model, get a response back. And you have got my models where you can see all your models. If you want like another model, for example, the one that I covered is Quen 7 billion parameter model. You can probably get Quen 14 billion parameter model. There are like different versions of DeepSeek R1 Distil model. So you can go ahead and then use it. A huge shout out to Bartoski, who has been active on converting these models to GGUF. Thanks to them. And also thanks to him and also the LM Studio team who has made it possible for us to use this model. You can see the number of tokens here. 33 tokens per second. That's quite fast. 991 tokens totally. And the first token time, like very first time, how much time it took. After it does all the thinking, finally you get the final answer, like atmospheric pressure and all these things. Not required. Finally, one final thing. You can go click a new chart. If you want to delete the existing chart, click the three dots, delete the chart. And that is exactly how you can use Distil R1 or DeepSeek R1 Distil version. In this case, we use the Quen 7 billion parameter models, Distil R1 Distil version of Quen 7 billion parameter. Locally within our computer, you don't need GPU. You don't need really, really powerful machine. I use 7 billion. If you want, you can even use 1.5 billion parameter model. There are like different versions of the Distil R1. DeepSeek R1 is available. So use whatever is fitting in your RAM and then enjoy the model locally without having to worry about privacy. And in this particular case, your data would not be sent anywhere unless and until LM Studio is doing something shady, which I'm not sure they would be doing. Thank you so much for listening. See you in another video. Happy prompting.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript