Speaker 1: This is Lance Blanchet. DeepSeq just released R1, which is a new fully open source reasoning model from the DeepSeq lab. And it comes with a paper that describes their training strategy, which is quite cool because reasoning models represent a new scaling paradigm for LLMs. I have a separate video on this that's also coming out soon that you can check out. So scaling paradigm over the past few years has been next token prediction. We've seen many successful chat models trained with this. It's system one type thinking. It's fast, intuitive. We often tell the model how to think with tricks like how to think step by step. And the interaction mode is often chat. Now, reasoning models are a bit different. They're trained with a different paradigm, RL on chain of thoughts. We'll talk about that a lot in a bit. It's system two reasoning. Often tell the model what you actually want, not how to think. And the interaction mode is a bit different. It's very good for research or planning in the background style tasks that are less interactive. So the really interesting thing here is that we now know how a state-of-the-art reasoning model is trained. So of course, the current state-of-the-art reasoning models from OpenAI, the O-series models are closed source. We don't have detailed information about how training works, but this is a very clear illustration of how they built a state-of-the-art reasoning model that is on par with O1. And you'll see those results here shortly. But let me actually talk through the training strategies. It's very interesting. So DeepSeeker R1 uses a combination of fine-tuning and reinforcement learning to produce this reasoning model. And it has a few different stages. So the first stage is just fine-tuning. They take DeepSeek V3, which is their very strong base chat model, and they fine-tune it on some number of thousands of chain-of-thought reasoning examples. Now, they don't actually tell the specific number of examples from my reading of the paper, but the point is they do a fine-tuning phase to build a good starting point for RL. Now, the second stage here is reinforcement learning with this approach GRPO. So what's going on there? Well, they have a separate paper on this, and I do want to talk about that a little bit. So this RL Stage 1 uses GRPO reinforcement learning. It's from this DeepSeek math paper. Now, here's what's going on. For every training example, and I do want to make a note how many training examples they use. They have 144,000 training examples of hard, verifiable problems in math and coding for which you need some degree of reasoning, like you need to produce a reasoning trace typically to solve them, and there's some definitive solution that can be verified. Those are the two criteria that matter here. So they have all these samples. Now, what they're doing is for every training example, they actually produce 64 samples, 64 different attempts to solve the problem, and they score each one of those with some rule-based reward like correct or incorrect for math or coding, right? That's pretty straightforward. Now, here's where it's kind of interesting. They basically compare every sample to the mean of all samples in this 64-sample batch, okay? So that's kind of what they do. For samples with high or low normalized reward relative to the group mean, they increase or drop the probability of the model generating all the tokens in that sequence. So what happens is each token in that output gets a positive or negative gradient update, and here's the intuition. It's basically saying, let's make all the choices that led to this correct or incorrect decision more or less likely. So that's actually what they're doing. And so the punchline here is why do they do this? They're trying to discover good reasoning patterns, and what happens is this makes the model very strong at reasoning, but it loses some general capacities. For example, they mentioned it had potential language-mixing issues. So this is where another interesting trick comes into play. If we look at our diagram here, we're right here. So we've actually used reinforcement learning to get a very strong reasoning model, but it's actually a bit weaker in some other capabilities. So what they do is they take the resulting reasoning traces from that model and filter them to get only high-quality ones. So the paper talks about this as rejection sampling. So they're basically filtering the outputs of that first reinforcement learning phase for a bunch of different things. It's not just correctness, but the point is it results in 600,000 reasoning traces that they can then train further on. So that's really the interesting insight here that you can utilize outputs from the first stage of RL to subsequent stages of your model, and that's exactly what they do. So they do a second stage of fine-tuning on the results of that sampling, plus 200,000 additional non-reasoning samples in writing and factual QA to what they describe as restore the general model capabilities while baking in high-quality reasoning. So if you look at the diagram here, what's happening is they are filtering the outputs of that first phase of RL, okay? So remember, this model's a very strong reasoner, but it's weak in some general capabilities, and they're combining some non-reasoning examples and therefore writing in QA, and they're fine-tuning on all that, and then what they argue is, or they present, they get a model that retains very strong reasoning, but also restores general capabilities, okay? So that's the key point. Then after that, they have a final or second round of reinforcement learning with two different rewards. So previously, we talked about they only used a rule-based reward for reasoning on math and coding-style problems. Now they include different rewards for helpfulness and harm as well as reasoning, and they use a mix of data that includes both reasoning and general problems to really optimizing for both reasoning and general capabilities, okay? So that's really the second stage of RL. Now, a final note, which actually is very exciting, and we're gonna be actually working with more directly here, they also take that dataset that they get from that first phase of RL of 600,000 samples, and they actually do knowledge distills. They take much smaller open-source models and fine-tune them on those high-quality reasoning traces, and then what they get is a bunch of distilled, smaller R1 models. Pretty cool, and some of them you can actually run on your laptop, as we'll see right here. So what are the results? So they show a bunch of nice results here, DeepSeek R1 versus O1, and some smaller models, O1 Mini. Really, the punchline is it's very close to R1 on a bunch of interesting challenges related to coding and math. Now, one in particular, pay attention to SWE Bench Verified. It's a very popular benchmark for general software engineering challenges, and it is doing, indeed, quite well, slightly better than O1, apparently. They also have a bunch of distilled models. So here's the thing that is really quite cool. If you look at their distilled Quen 14b, you look at the benchmark results, and it is pretty close to O1 Mini. You can kind of go across and look and convince yourself of this, but look, it's pretty strong, and 14b can actually run on a lot of people's laptops. For example, I have a 32-gig MacBook Pro, and I can run 14b, as we'll see in a bit. So now let's try playing with it. So I pulled DeepSeek 14b from Olama. So you can see they put a tweet out recently. They host all these models. It's pretty cool. You can try to run them on your own hardware. And I'm in a notebook, so all I need to do is grab Lanchain Olama. I'm gonna initialize my model, and what's nice is I'm gonna use JSON mode with Olama to also produce JSON outputs and see how well structured outputs work with this model. So first, let's try a simple question. What's the capital of France? Cool. So you see the capital of France is Paris. We'll see something else that's interesting. These think tokens, okay? If you go around local Olama and you hunt around for this, there's a lot of people talking about these think tokens. They're hard to prompt away. I've tried a bunch. They seem to be kind of an annoying thing. They are absolutely part of the training process. You can look at the paper. You can see that these think tokens are actually included in the training. Now let's try JSON mode. So what's interesting is when you use JSON mode, the think tokens are not present, so there is some post-processing happening on the Olama side that strips them and you get a JSON object out. So it looks like JSON mode at least is working, so that's a good thing. I ask a more involved question. Give me a summary on scaling laws for RL models. And again, you see, wow, this is quite verbose. And you can see this think token emitted first, just like before. And now you get like a much more detailed breakdown of its internal thought process. So let's go right and look at Langsmith just to kind of get a better view of this output. So I'm in my Langsmith project now. I'm gonna open up this trace. We can see it took 64 seconds. Okay, so that's actually pretty long. But again, I'm really pushing the limits of my hardware running the 14B model. So that's fine. It's a little bit on me, but I wanted to test O1 mini level performance running locally based on the benchmark. So I just wanna kind of play with it. Okay, so here's the output. Again, it's quite verbose. You can see the think token here emitted. So it does a lot of thinking prior to responding and seems to provide a sane response. But again, this issue of a lot of pre-thinking being emitted is evidently an issue with these models. That may be a problem or not, depending on your application. You can also programmatically try to remove that. That's another thing to think about. So let's vibe test it a bit more. This is a repo called olama-d-researcher. And this is basically an evaluator optimizer workflow for report writing. So what I'm gonna do is I'm gonna have an open source LLM running locally via olama, take an input topic from a user and generate a search query for it, perform web search, get the results, produce a summary, but then reflect on the summary and regenerate a question, go back. So this loop is gonna look kind of like this. You can see query generation, research, summary generation, reflection, new query, and so forth. This will continue for some set number of cycles that's configurable. And in the end, I'll get a nice summary with sources. And this can be run with any open source LLM. Now I have a separate video on this that talks about building it from scratch. So I'm not gonna kind of build everything again, but I will just test this out using R1. So just some specifics here. I have a MacBook Pro M2 Max, 32 gig. I found that the 14 billion distilled DeepSeek R1 model is about at the edge of what I can run, but still it's fun to try as discussed before. All I need to do to run this is basically just set my Tableau API key that allows for web search and kick off this command. When you do that, you're gonna see Landgraft server spin up in your browser and you can actually start interacting with it directly. So you're gonna see this in your browser. So this is pretty nice. It's a little environment that I like to use to play with assistants that I create using Landgraft. You can see this shows the overall flow of our assistant here. So it's gonna generate a query, do web search, summarize the results, reflect, and go back. This is a nice test bed though for looking at different local models. So what you can do is open up this configurable thing, just paste whatever local model on LLAMA that you've downloaded and wanna test. So in my case, let's test 14B. I'll have it iterate twice. So go two loops and we can ask any question here. So let's say, give me a report on RL training approaches. So fine. All I have to do is submit. So it's generating our query and it's using structured output. That's good. So that part of the flow is working as expected. Nice. Now it's using Tableau to do web research. You can look at the repo to dig into that. I have a separate video on all this in detail. So now it's summarizing my sources. So this is kind of nice. You see it stream as it goes. You can see those think tokens again. Now in the repo, I added a filter to remove them because I found that they do affect some downstream processing. So I'm gonna filter this out when I save it to state in my graph, but you can see how much reasoning and thinking it's doing. So now it's reflecting on my summary and I've stripped out those thinking tokens during this reflection phase. So this is pretty cool. It finds a knowledge gap, generates a follow-up question, and now I've done more web research and I'm summarizing based upon my initial summary and the new web resources that I've retrieved. So now here's my updated summary. So it's thinking it needs to extend the summary. Okay. I mean, you know, that may not be a bad thing that it is expressing its kind of process for us to see. I mean, look, it's listening to the property. Reasonably need to seamlessly integrate these points. Good. It's highlighting the new resource that it found. Good. Okay. So it's being very expressive about everything it needs to do based on the instructions I give it. And it's done thinking. And cool. It is updating that summary. So you can see now we're reinforcing learning to some field of AI. It's gonna reflect again. Go back. We'll try that one more time. And again, it pulled a new paper. It's thinking I need to extend the summary further. I mean, in a way, I kind of like this thinking process because it really explains what it's actually doing and how it's reasoning about it. And you can see it's updating its summary and it actually looks pretty sane, pretty nice. And it gives us information about the global market for RL. So fair enough. And it exits. So now we get the final summary. So my system will add this little summary thing to the top. And we have our nice written summary here and it'll add the sources. And we can go ahead and look at that in Langsmith as well. We can see everything it did. We can look at the final summary, which is actually right here. So here's the summary and there's our sources. So basically my take is the think tag thing is kind of annoying. I actually kind of like to see that as a developer, it's annoying to manage if you're actually trying to build an application with this because it emits that to the output. You have to process it out. The 14 billion parameter model is at the edge of what I can run locally on my laptop. Of course, it depends on your hardware. And the summary looks quite nice and comprehensive and I'll run locally. So it's all for free. So listen, it's pretty cool that you can have these reasoning models now running locally on your laptop. They'll obviously will get better. This think token issue will be resolved, I'm sure, in the near future. I encourage you to play with it very much. I find Olama to be a very nice, easy way to access these models, but there's also some other ways to do it. And I think it's a really nice step forward and it is really cool that we actually have visibility to how these models are trained and that this is all open source. So thank you to DeepSeek for releasing this and for Olama for getting this up really quickly. So anyway, I hope this was informative and thanks.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now