Unlocking DeepSeq Language Models for Local Use
Discover DeepSeq's open-source models, performance benchmarks, and training processes. Explore the benefits of these locally applicable, distilled models.
File
Deepseek R1 DeepSeek R1-Distill-Qwen-32B Reasoning LM explained
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hello community. We have some brand new reasoning models and I'm gonna focus here on the smaller size language model that we can use locally because they are open source. So let's start Hello DeepSeq. Now we got here some beautiful new models from a Q132B, 14B, 7B down to a 1.5 billion free trainable parameters and as you can see we're here with DeepSeq Reasoning 1 and this is a distilled version here utilizing here the Q1 versions. But let's start at the beginning. We are here 21st of January 2025 and we are looking at DeepSeq. Now we know DeepSeq here, GitHub, beautiful. We have here DeepSeq version 3. This is a strong mixture of expert language model with only 0.6 trillion total trainable parameters where 37 billion free trainable parameters are activated for each token given here the intelligence of a specific router. And you might say hey is this the reason why in your last video you were talking here about intelligent router in mixture of expert system and I'm gonna tell you maybe. So DeepSeq tells us that they train DeepSeq version 3 on close to 15 trillion diverse tokens followed by supervised fine-tuning and reinforcement learning stages. And this only took about 3 million hours on a classical GPU. So you see we have our classical process, free train, supervised fine-tuning, reinforced learning stages. But there is a lot of detail to this. We have two models. We have the v3 base model and the v3 model. And you might say what is the difference? Now this is rather easy because this 3 million hours here is only for the base model. Because DeepSeq analyze the performance and say hey we can improve on the base model if we do a post-training. So we do a knowledge distillation from our reasoning model or one. And they say hey we distilled here reasoning capabilities from their long chain of thought models, specifically here the R1 models, into our base models here, no, into our v3 non-base model, particularly DeepSeq version 3. So now you know what's the difference between this and this. It is simply the post-training in the easiest words possible. As you can see if you go to Hugging Face you have an update there the DeepSeq R1. So please work with the update. It is really worth it. And R1.0. So we are now at the reasoning models and you might say what's the difference between R1.0 and R1. Now just following here the technical literature by DeepSeq, they tell us DeepSeq R1.0 encounters here challenges such as endless repetition, poor readability and some language mixing. Therefore they further optimize the R1.0 to the main R1 model. So DeepSeq tells us hey we achieve performance like OpenAI 01 across mathematics, coding task and reasoning task. And this is really amazing if I tell you that just hours ago we got a new license here which opens up a complete new universe for coding. So let's have a look at the technical literature. 20 hours ago we were published they were published the DeepSeq R1. This new reasoning pipeline they implemented now and this is really easy. Have a look. So we have a block about R1.0. I'm gonna ignore R1.0 for the moment because I want with you to focus here on the new R1 and especially here reinforcement learning with a cold start. So let's have a look at DeepSeq R1.0. And we are looking here at the first item in the training pipeline. The training pipeline has four elements and we are looking now the first one the cold start. And DeepSeq tells us here we collect thousands of cold start data to fine-tune the DeepSeq version 3 base. And now you know why I introduced you here at the beginning of the video to the base model as the starting point for the reinforcement learning. So we have a massive fine-tuning exercise on v3 base for the R1 cold start. And it is really interesting because now they have a new readable pattern. And they say when creating here the cold start data for R1 we designed it readable pattern that includes summary at the end of each response and filters out responses that are not reader friendly. So they define the output format as special token, reasoning process, special token and summary. Where the reasoning process is of course the long chain of sort for the query and the summary is used here to summarize the reasoning results. So they had some beautiful fine-tuning data exactly for the first point here their cold start. You see the cold start is not part of R1 zero. Great. Then let's jump here to 2.3.3. So the retraction sampling and a supervised fine-tuning. And yes here we have another supervised fine-tuning. They said when the reasoning oriented reinforcement learning converges, this is here 3.2, we utilize the resulting checkpoint to collect the supervised fine-tuning data for the subsequent round. They say unlike the initial cold start point 1, which primarily focus on reasoning, now this stage, stage number 3, incorporates data from other domains to enhance the model capabilities in writing, role-playing and other general-purpose tasks. So they have now a very specific fine-tuning for non-reasoning tasks. And they tell us here, now we fine-tune the deep-seek version 3 base model for two epochs using the above-corrected data set of about 800,000 samples. Wow. This 800,000 sample, please remember this because they will become important in about two minutes time. So great. So you see our training pipeline is really a dedicated pipeline and they experimented quite a lot to find this particular configuration for the training pipeline. And then when we have now the deep-seek R1, now they did something beautiful. And for me recording this video January 21st 2025, they give us here now this DILT small language model. And as you can see just updated four hours ago and really installed four hours ago, we have now QN models that start from 1.5 billion free trainable parameter, 7b. Then we have a Lama 8b, a 14b and a 32b. Now this is interesting because you know what? They even open-sourced the models. So you can use this small whatever you like to explore to further build on this. This is an MIT license. And on Hugging Face, if you go there, Hugging Face deep-seek R1 distil from the original R1. It's now a QN model, a 32b model. And you see you have everything already available. The inference API, the endpoints, you can go to Amazon, Asia, Google Cloud, you have spaces for Gradio. It is already there for you. Very interesting, the new technical report on R1, they also tell us here what did not work for us. And I just focus here on the process reward model. I also have a video on this. And they told us here in conclusion why this process reward model demonstrate a good ability to re-rank the top-end responses. Its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process. Now if you only have, only if you already spend three million hours just for the pre-training phase, you understand here any additional computational overhead. As good as it might theoretically be, you have an equilibrium of performance to how long you want to train this, how long you want to fine-tune this, how many million hour you want to spend on training the model. This distillation is something beautiful. And they said, hey, to equip even more efficient, smaller models with the reasoning capabilities, like we trained our DeepSeq Reasoning 1 model, we directly fine-tuned now the open source model by QN and Lama, using now this particular 800,000 samples, you remember, now is the time, curated with DeepSeq R1, as detailed in 3. So this second supervised fine-tuning, where we had our 800,000 samples, what they do, they take those 800,000 samples and they fine-tune other open source models like QN. This is beautiful. And they give us this for free in an open source version. Now I know you might say, okay, but let's talk about the performance. Not about R1. We know R1 is really a good model. I have a video on this already. But what about this distilled small language model? How good are they? And here you have the benchmark data. For me particular here, the DeepSeq R1 distilled Q132B model sticks out, because if you compare this with a GPT-4 Omni, or the latest Claude 3.5 Sonnet, or OpenAI-01 Mini, this is really impressive. And especially if you compare it with the QWQ32B BriefView. Let's go here for the different tasks. And you have here immediately an idea where you are in your benchmark data. You have here the last two, the live code and the code forces. You see here there's quite some difference. Because if you go here with mathematical, yeah, you see 94.3 or 94.5 or 93.9. So you are real close, even if you go down to a 30B model or even a 14B model. But with coding here, oh, there seems to be quite a threshold here in the performance jump. So interesting. So whatever is your job, if you do coding, if you do mathematics, if you do abstract reasoning, here are your benchmark data. Choose your small language model that you like, that you prefer. All of this is already available to download for you on HuggingFace. HuggingFace, DeepSeq, Distil QWQ32B. Or if you like it smaller, the 14B model is also available for you. Now I'm currently, as I record this video in parallel on my cloud, I'm testing this model here on my particular task. Because I want to know, do I have to go with the R1 Distil QWQ32B? Or maybe a 14B is already enough for my task. So as you see, the difference 69.7 to 72, 80 to 83, or 93.9 to 94.3, is this acceptable if I can go down with the size of the model for a local implementation? So maybe in some days I have my first result, so I can tell you if you even can go down from the 32B to a 14B and have here the DeepSeq R1, the reasoning capability, maybe also on an open source 14 billion free trainable parameter. You know what is interesting? DeepSeq here tells us, hey, the DeepSeq R1 Distil version, the QWQ32B version, outperforms even the OpenAI R1 Mini version across various benchmarks. And there are of course some benchmarks that R1 Mini is still better. But you know what? This here is an open source model, and OpenAI models are proprietary models. So however you decide. But I think me as a European, I have the choice between this model and this model. I have a tendency to go here with an open source model for my tests and for my implementations. But of course you can check this out in real time before you decide anything. I would recommend this to you. So I am also there at DeepSeq, and you see here if you go here chatdeepseq.com, you can try it out for free. You have either here the latest DeepSeq model, or if you click on here Deep Thinking, you will use now the high level DeepSeq R1 model to solve the reasoning problems. And I'm sure as I record these videos on Hugging Face, you will have a lot of radio spaces coming up for all the small models that you can try out yourself online for free before you decide which model to go with. So you see there is an interesting competition happening now. Kind of a cooperation and a competition. And it is interestingly about open source versus proprietary models. Because, you know, more or less you have now an option. Either you go with chatgpt-01, the proprietary model, or you go with an open source DeepSeq. Let's go with the R1 model. Now you understand why chatgpt decided to already publish the R3 model. Because if an open source model is at least at the same performance level like your proprietary R1, I like competition. Now you might say, are you absolutely sure that the R1 really is available for us? So I just went now because me too I ask this myself. So here, January 21st, I'm here, MIT license for DeepSeq R1. Yep, updated yesterday. Beautiful. So interesting here that not OpenAI, the company with the open in its name, has an open model. But there are other companies that provide us here with open source MIT license model. So here we are on the very first day that we have this new small language model for the reasoning task. And there are some distilled versions from the big R1 models. But they are implemented here. They are continued the fine tuning now on this 800,000 very special data sets from R1 on the QN model. And if you can go down to a QN 1.5 billion free trainable parameter model, I think this is really something we have to test out. So in the next days, in the next week, maybe I do hear a particular evaluation video for my specific task. I'm more science oriented in my task. So if you are more about, I don't know, finance, or you are more about writing or general knowledge, hey, would be interesting to learn about your experiences in the comments to this video. If you like this kind of video, why not subscribe? And my next video is already in the making.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript