Unveiling China's New AI Models

Convert Your Audio To Text

4.9/5

3749 customer reviews

Explore the latest breakthroughs in Chinese AI reasoning models, their open-source release, and their implications on global AI dynamics and policy.

Emergency Pod Reinforcement Learning Works Reflecting on Chinese Models DeepSeek-R1 and Kimi k1.5

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Welcome to the Cognitive Revolution. Today, I'm going to do a walkthrough of everything that I am learning and understanding and taking away from the latest Chinese reasoning model releases that have come out this week. Perhaps not coincidentally, both R1 from DeepSeek and the new Kimi reasoning model from a company called Moonshot AI were released on Trump's inauguration day. And we now have two Chinese models. DeepSeek is out. The weights are open source. You can download them. The Kimi paper from Moonshot came a little later in the day. Honestly, shades of open AI and Google kind of racing to preempt each other with their launches. I don't know if that's what's happening in China or not, but it certainly had the flavor of like DeepSeek put their paper out and then a few hours later, here comes the Kimi paper. Their model isn't quite yet available. They said it will be available via API and presumably in their product soon, but it's not yet. So unclear what's going on in China. Did they intend to put these out on Trump's inauguration day or was that just an accident? It's kind of hard to believe that it is an accident, but at the same time, these folks are focused on unraveling the mysteries of AGI with curiosity. So maybe they don't care about when Trump is getting inaugurated. They're coordinating. I have a lot of questions about the dynamics that are going on in China behind this, but what is clear is that at least DeepSeek with this R1 model has joined the top tier of global AI developers. Potentially Moonshot with their Kimi model could be there as well, but it's obviously hard to say that kind of stuff purely from the benchmark. So we'll have to wait and get our hands on it before we can be too confident about that. What I want to do is walk through what stands out to me about this and try to make some sense of it. I suspect this will be the first of several conversations about this because this R1 story touches on so many different aspects of AI at the same time. The research itself is really important. The consequences for just practical utility are significant as well. The gap between closed source and open source, also the gap between the West and China, I would say shrinking gaps at the moment. Certainly the gap between the West and China seems to have shrunk significantly from where it was a couple of years ago. And that's just the start, right? Then there's of course the strategic dynamics like why is China open sourcing this? What are they getting out of that? How, if at all, should the US respond? Does this challenge narratives that are increasingly dominant in the West about AI race? We've now seen no less than Alex Wang, the CEO of ScaleAI, take out a full page ad in the newspaper calling the current situation an AI war, which I honestly totally hate and think is wildly irresponsible. You don't have to be a China dove to recognize that an AI war does not exist and would be bad for everyone. I hated to see that. We need to reconsider some of those framings in light of what we're seeing here. And certainly the strategy that we want to play, if our strategy is predicated on preventing China from doing certain things so that we can have certain advantage, so that we can solve certain problems so we can be the good guys. I think that the window of opportunity that we have, where we're going to have this sort of unassailable AI lead looks quite short. I think we'll understand that better as we go through the research and understand how simple a lot of the stuff is driving a lot of these significant advances in reasoning capability. But at the end of all that, what sort of policy response, if any, makes sense to this? Does it still make sense to think about pre-training as being the real measure of model power or the standards by which a model would qualify for some sort of special process, special government review, special government notification? I think that is highly questionable in light of the power of the reasoning paradigm because significant gains over pre-training are showing up with presumably a lot less compute. Also, they're being distilled into much smaller models. It does seem like we have crossed a meaningful threshold this week where prior to this week, there was not really a good quality reasoning model that I could run on a local machine. Now I've got a wide range of things that are open source that I can download that have been trained specifically as reasoners, which I can further modify, including with more reinforcement learning. This is really one of the most important stories that has come to the public in AI in a while. I wanted to help make some sense of it. I thought we would start, and I am doing a screen share this time around. So if you are listening to this, I think it will be fine. I will plan to basically read everything that's important. If you're a more visual person, you want to see stuff on screen and be able to read along. I'll have the screen share on YouTube as well. I am using a variety of AIs to help make sense of this. You'll see me tabbing back and forth and looking at various sources, but it should be fine in audio format if that's what you prefer. Let's start by talking about what the R1 model is and how they created it. There's a couple different flavors, and I think there's a couple different big takeaways from this. First of all, they just recently came out not too long ago with their DeepSeek V3 model. This made headlines on its own for being a top-tier model that was made incredibly cheaply. Zvi, as always, has great coverage of this. He calls it the $6 million model. This is a small percentage of what Western AI leaders are understood to have spent to train their top-tier frontier models. A lot of work has gone into the efficiency there, a lot of work on the data curation side, a lot of work on optimizing the algorithm, closely coupling the design of the neural network itself to the hardware that it's going to run on. All sorts of interesting things have gone into that, but suffice it to say that with a total compute budget of single-digit millions, the DeepSeek V3 model already was a pretty big we-are-here-and-cannot-be-ignored statement from DeepSeek. Also, a real warning shot for you're going to try to deny Chinese, the whole entire nation of China and Chinese companies in particular, access to compute to prevent them from doing frontier work. Maybe they won't be able to do it in the future if compute requirements continue to get bigger and bigger and become super high, but if they're able to achieve this in single-digit million-dollar compute budget, I think it's going to be hard to keep the top-tier Chinese companies from having enough compute manufactured domestically in China or smuggled in. A $6 million compute budget is just not that much, and already they are hitting GPT-4.0 and Claude Sonnet level with that budget. That's the base model from which this new R1 model is trained. It's a large model. It's a mixture of experts architecture that has 600 and some billion parameters because it is a mixture of experts. The point is that you can have lots of parameters, but you don't use all the parameters at runtime. That has been shown to allow for faster learning, for better knowledge absorption, overall better performance, while still keeping your inference costs relatively low. It doesn't mean your inference is necessarily simple. When you have 671 billion parameters, you are talking hundreds of gigabytes of content. This is not something you can even download onto a typical laptop, and it's certainly not something you can run on a typical computer. You're going to have multiple GPUs. You're going to have to configure those in an array to have any sort of decent throughput. This will run efficiently at scale, but quite inefficiently if you're just doing it for yourself. You would not want to buy a bunch of computers, configure them at home just to set up this 671 billion parameter model. You want that resource to share. Not surprisingly, they have an API, and that'll be their business model, but it is out there for other people to do too. We are starting to see, certainly with the R1 model, we're starting to see inference providers set it up. Even with the R1 zero, which I think is arguably the most interesting result and model to come out of this, even places now where you can go check that out. So that's the base model. It's the 671 billion parameter mixture of experts, GPD 4.0, CloudSonic level, 37 billion parameters active at any given time. They come around and say, we want to take this model, use it as our base model, and create a reasoning model out of that. Let's see how they did it. There's two models in this paper. One of the important things to keep clear is that there is a model called R1 zero, and then there is the actual R1 model, which is what you will see in most places. R1 zero is maybe the more important story, maybe the more important result. There is an allusion to alpha zero, the classic deep mind game playing AI architecture that was able to learn purely through self play. The earlier alpha go before that, earlier alpha game players typically had human data that they were trained on. So like the original, if my history is correct here, I think it is, the original alpha go that became the best go player in the world was initially trained on human data. They were able to generalize, create a model that did not require human data and just was able to play against itself and get the reward for beating itself and gradually self improve. And this is how all of a sudden this alpha zero architecture was able to take on tons of games and crack them all because it didn't need data. It just needed to have the compute and the runtime to play itself, get reward from that and gradually learn how to win at the game. So they're applying a very similar paradigm here to the R1 zero. What they are doing is basically pure reinforcement learning on top of that large base model. They do not have human preference data. They do not have human demonstration data. So there's no supervised fine tuning. There's no examples of this is how we want you to reason. Although that is coming later, but that's not included in the R1 zero model. And they don't have any reward model. So it's not only that there's not humans looking at the outputs and evaluating them. There's not even another model being used to give reward signal to the base model as it improves through this reinforcement learning process. Instead, what there is, is just one of the simplest things you can imagine. And that is a rule based reward system. Essentially they give the model problems and if it gets the problem, right, they give it a reward. And if it gets it wrong, they don't give it a reward. That simple. I have been using DeepSeek R1 itself on the DeepSeek product, which is freely available. I'm not paying for it. There might be a limit at some point where I need to start to pay. But as of right now, if you just go to chat.deepseek.com, you are using the product of a Chinese company. It is presented to you in, you know, pretty nice, normal interface with English UI. And you can just have your conversation with it. You can have your conversation with R1. What I'm talking to here is the productized R1, not the R1 zero that I'm talking about initially. I've just taken the paper, dropped it in and started to ask questions about it. Get answers to those questions that get a sense for the R1 model itself. Not everything about how they did this is entirely clear from the paper. Some things are not made fully clear. Some things are just, they didn't intend to disclose necessarily. Some things I might be misinterpreting. And so I'm going to seek some help on making sure I have the classification or my understanding correct as I work through all this. One of the questions I have this morning on the nature of the accuracy reward they give the R1 zero model is, is it just binary or not? You may remember from earlier reinforcement learning experiments, including like the GPT web back in the day that OpenAI tried where they tried to get an early GPT model to use the internet. And they found that basically it didn't work. And why didn't it work? Because of what is known as the sparse reward problem. It didn't do anything successful. It got no reward. So it had no signal to learn from. That is a common challenge and could pop up here if you were to take a regular base model and try to apply reinforcement learning to it. If it can't do any of the problems that you're giving it, remember, you know, GPT-3, right? Could do like two digit arithmetic, but couldn't do three and made a mess of anything logical beyond that. If your model isn't strong enough to get any of the questions right, you have a bit of a problem. There are a lot of different ways that people try to get around that challenge. In the case of like alpha zero, it was self-play, right? So even if both models are weak players, one of them will win. That one's better than the other. Hopefully we can get it to learn from that. At every stage of the process, it can, you know, continue to self-play. One of them will win and it'll continue to get a little better. So that's the self-play paradigm. There's also curriculum learning paradigms, things where you might give it partial credit if it makes some good steps towards solving a problem, but doesn't ultimately get the right answer. But those are complicated, right? Now, how do you determine if something should get partial credit? Was it on the right track to the right answer or not? You could use another model to do that, but it can get complicated. What DeepSeek R1 says about the R1 paper is that they're just giving it binary reward, just saying you got it right, or you got it wrong, and that's it. Of course, the algorithm is a little bit more complicated. They use something called group relative policy optimization. It is a clever take on this sort of unsupervised RL where they give the model a problem. They get the model to try to answer that challenge multiple times. DeepSeek said that 16 times is common in reinforcement learning literature broadly, so maybe something like that. And then they take the average score out of those responses and kind of treat that as the base and then look for the answers that had the highest score relative to the average and reward those relative to the average. And sort of the strength of reward that it gets is how much better that answer was than the average from all the generations it created. It's not entirely clear what the mix of questions are, but at a minimum, we're seeing math and code. We know from previous literature that training models on code makes them better reasoners. More importantly, there's a lot of code problems out there, and those code problems have objective answers, right? And this also does get you into some very natural partial credit. If it's a math problem, maybe you just get the right answer, you get the wrong answer, and it's a binary signal. In a coding context, often these problems are set up where there's a whole suite of unit tests, and to fully pass, all of the unit tests have to come back in the way they're meant to. If seven out of eight do, or if two out of eight do, then that is a different result, and that could be a different signal. So you could start to get not just binary, but a sort of scalar result that would be a richer signal where you could at least be getting some early progress, even if the model can't do all the hard problems. Hey, we'll continue our interview in a

Speaker 2: moment after a word from our sponsors. Even if you think it's a bit overhyped,

Speaker 1: AI is suddenly everywhere, from self-driving cars to molecular medicine to business efficiency. If it's not in your industry yet, it's coming, and fast. But AI needs a lot of speed and computing power. So how do you compete without costs spiraling out of control? Time to upgrade to the next generation of the cloud, Oracle Cloud Infrastructure, or OCI. OCI is a blazing fast and secure platform for your infrastructure, database, application development, plus all of your AI and machine learning workloads. OCI costs 50% less for compute and 80% less for networking, so you're saving a pile of money. Thousands of businesses have already upgraded to OCI, including Vodafone, Thomson Reuters, and Suno AI. Right now, Oracle is offering to cut your current cloud bill in half if you move to OCI for new U.S. customers with minimum financial commitment. Offer ends March 31st. See if your company qualifies for this special offer at oracle.com slash cognitive. That's oracle.com slash cognitive. What does the future hold for business? Ask nine experts and you'll get 10 answers. Bull market? Bear market? Rates will rise or fall? Inflations up or down? Can someone please invent a crystal ball? Until then, over 41,000 businesses have future-proofed their business with NetSuite by Oracle, the number one cloud ERP, bringing accounting, financial management, inventory, and HR into one fluid platform. With one unified business management suite, there's one source of truth, giving you the visibility and control you need to make quick decisions. With real-time insights and forecasting, you're peering into the future with actionable data. When you're closing books in days, not weeks, you're spending less time looking backward and more time on what's next. As someone who spent years trying to run a growing business with a mix of spreadsheets and startup point solutions, I can definitely say don't do that. Your all-nighters should be saved for building, not for prepping financial packets for board meetings. So whether your company is earning millions or even hundreds of millions, NetSuite helps you respond to immediate challenges and seize your biggest opportunities. And speaking of opportunity, download the CFO's guide to AI and machine learning at netsuite.com slash cognitive. The guide is free to you at netsuite.com slash cognitive. That's netsuite.com slash cognitive. What really comes out of this is it just works, right? Just that simple accuracy reward. They also have something called a format reward, which is making sure that it gives you a kind of two part answer in a similar way to what we've seen from O1. The R1 model has thinking tokens, but it literally just puts out a XML tag thinking, and then writes a bunch of stuff. And then an end thinking token. And then there's an answer token and the actual answer that's supposed to be a summarized thing that you as a user will focus on. And then there's the end thinking token. The format reward is just making sure that it does that. Most of the improvement is definitely coming from the accuracy reward. The format reward is trying to constrain it into response that they want. What is crazy about this is it works well. It seems to work quickly and it is giving rise to all sorts of emergent behaviors that people did not exactly expect in advance. I think reflecting, obviously there's been a lot of debate about what is an emergent behavior? Are they a mirage? Whatever. I thought Dr. Michael Levin, famous biologist and a couple time guest on the podcast had maybe the best definition of emergent behavior. He said it's relative to some observer. If you were smart enough to predict in advance that this would happen, then it's not an emergent behavior to you. But if you were not smart enough to predict that this would happen in advance, then it is emergent to you. I don't know exactly what the state of mind among the deep seek researchers was. Obviously they had some reason to think that this might work, but they also express some surprise around just how well it worked and the nature in which it seems to work. So what do they observe? Well, basically what they observe is that the thing learns to reason pretty much on its own, just from the accuracy reward signal that it's getting. The first thing that they show very clearly is that the length of the thinking process, the length of the chain of thought grows pretty consistently throughout the training process. So this is a graph from the deep seek paper and they show that the thinking process starts off very small early on in their training process and it grows rather linearly throughout the training process. It starts at like just a few hundred thinking tokens and they have some very simple base prompt because they're working from a base model. So they have a prompt template that is like there's a user and an assistant and the assistant is helping the user. User says blank and then assistant says blank and the assistant kind of picks up from that template. It at the beginning does generate thinking tokens, but not that many, maybe something like 500 tokens on average per response. And then that just grows. The model quickly figures out that a longer, and figures out obviously I'm anthropomorphizing there, but the learning process naturally tends toward a longer chain of thought because those lead to higher chance of getting the questions correct and those responses get reward. And so more longer chain of thought happens, right? It's a pretty simple feedback loop. And I think what jumps out to me most about this graph is that it is not a log scale on the X axis and that it does not appear that the curve has flattened off. So when we've heard from folks at OpenAI that we know how to scale this paradigm past where we are, this is a pretty good indication of that right off the bat, right? They've talked about what would happen if an AI were to, you know, right now it's thinking for seconds or, you know, a few minutes before giving you an answer, what happens if it thinks for days? And you might think, how are they going to get it to do that? It seems that possibly it could be as simple as creating a model that has long enough context, certainly with things like Gemini, which are now at a million token or a couple million token context window, we've made real progress. This model does not have that same long context. I think it has a hundred, it is 128,000 tokens. So that's on par with GPT-4.0, short of the Gemini models, but still substantial. So that could be one limiting factor. The important thing to highlight is that the reinforcement learning process, as you go through these steps, just continues to deliver more and more thinking, more and more thinking gives you better answers that is rewarded that keeps happening. Over 8,000 steps, they go from 500 tokens being generated in the thinking response to roughly 10,000 tokens on average. And now what is it doing in all those tokens, that chain of thought you can read from R1, and this is, you know, one of the biggest differences between the user experience of an OpenAI model and using DeepSeek. I should also give credit to Gemini Flash and it's thinking because it also works similarly. Both R1 and Gemini Flash thinking basically respond immediately. There might be a couple of second delay before the first token, but immediately you're reading the thinking stream, the thinking chain of thought. It's not hidden from you. And you can see how it's going about its business with the OpenAI models. So what is happening? Well, to quote from the paper, they say that behaviors such as reflection, where the model revisits and re-evaluates its previous steps and the exploration of alternative approaches to problem solving arise spontaneously. This is remarkable, right? It's like this base model has been trained on reading the entire internet. It's seen a lot of stuff, was not specifically trained to problem solve at all. It's just trained to predict the next token, but there's enough out there, enough examples of this sort of problem solving behavior on the internet that it was at least able to learn that because language models are stochastic, if nothing else, at the level of picking what token they're going to add to the token stream at each forward pass, sometimes they start to use these behaviors spontaneously. That behavior does work, even though it wasn't trained to use reflection, even though it wasn't trained to explore alternative approaches. It occasionally does that work better, that gets reward and that gets reinforced. And so you start to see more and more of it. And that is measurable in the sense of the chain of thought gets longer and longer as you grow. It's also observable to researchers and to us as users, we can see the reasoning behaviors the model is going through. Many people have commented that it looks rather human here in this context. I just pasted in the R1 paper itself in full, put a simple prompt below that, above is a paper about new reinforcement learning research. Can you tell me what's differentiated about the reinforcement learning approach outlined in this paper? Then you see the chain of thought and it does have a sort of first person narrative style to it. Okay, let me try to figure out what's special about the RL approach in this paper. First, I'll skim through the abstract in sections to get the main points. The paper introduces DeepSeq R10 and DeepSeq R1 models. Now keep in mind, this is the DeepSeq R1, not the R10 that I've been talking about so far. The friendliness of this is not something that comes automatically out of the reinforcement learning process. That was the R1 product version. And we'll talk about the differences of how they train that in a minute. But the R10, which is trained purely on this reinforcement process and starts to show problem-solving behaviors, unfortunately does not behave so nicely. The chains of thought are not super readable for people and consistently in the same language. One of the things reported was that the chain of thought often spontaneously switches back and forth between languages. That is quite strange, but that's a reflection of the fact that the base models are strange and reinforcement learning is strange. Reinforcement learning on AlphaGo, going back to that famous example, it gave rise to move 37, which people thought must be a mistake until it turned out to be the brilliant game-changing winning move. How did it learn to do that? It learned to do that by self-play, by just finding what works and by getting reward. It was not a system designed to create moves that made sense to people. So we're seeing something similar here. It's getting much better at problem-solving as it goes through this reinforcement learning process, but the outputs are not necessarily super readable from a human perspective. And this language switching is just one dynamic. One of the big things we should take away from this paper is that reinforcement learning, at least under certain conditions, including having a powerful enough base model, you can start to get traction on these hard problems so that you can get the reward signals, bootstrap your way into problem-solving behaviors. At least under the right conditions, reinforcement learning produces weird stuff. Stuff that we are going to have a hard time interpreting, but in some way the model is able to use. Even straightforward language changing spontaneously to just make it really hard for a person to make sense of what is going on. So this is incredibly important in that it suggests that there is this path where you take a base model, you do reinforcement learning on it. It gets really good at something, but you don't have great insight into exactly how it's working. And potentially you can't even read the chain of thought. Now you can think about combining that with other recent stuff. Metta recently put out a paper called Reasoning in Continuous Space. And this is pretty problematic, honestly, but basically what they did in that paper is say, chain of thought, it's effective, but there's a lot of information lost at the end of each forward pass when you take all this dense computation that the model has been doing internally and you cache that down to just a single next token. There's a lot of information lost there. Could we preserve that information somehow? What they decided to try was simply taking the last hidden state, the last layer of activations before the final token selection. And instead of actually selecting a token and then going back to the beginning, appending the token and using that single tokens embedding to continue the process, instead, they take that last internal activation and insert that into the model in place of the next token embedding. So the model gets to continue to think from where it was in its own continuous latent space. So that's why they call it thinking in a reasoning in latent space. What they find there is first of all, that that does work second, that it works more efficiently. The average length of a chain of thought needed to get comparable results is smaller when you're reasoning in that continuous space, as opposed to doing explicit token selection and appending to the token stream at each step. So efficiency gain is notable. And they also showed that for certain kinds of problems, it is a more effective approach specifically for problems where breadth first search is the way you want to go. With a model just doing single actual token prediction, it's hard to do breadth first search, right? Because breadth first search would imply that you're going to consider a little bit of a bunch of different options before choosing what path you ultimately want to go down. When you're choosing individual tokens, it's hard to do that. Individual tokens don't represent all these different paths at the same time. But what they found is that the internal states can represent multiple different paths at the same time. And so the model in this continuous paradigm can sort of consider all these different paths in parallel and then make a better choice more efficiently. Hey, we'll continue our interview in a moment after a word from our sponsors. 2025 is shaping up to be a crazy year, and I'm getting a lot of questions about how people should manage their careers. Increasingly, my best advice is to go ahead and do what you've always dreamed of doing. If that involves starting a business, you should know that there's never been a better time than now, and there's never been a better platform than Shopify. In the past, being a small business owner meant wearing a lot of hats, and a lot of time spent doing things you didn't necessarily want to be doing. Being your own marketer, accountant, customer service rep, and more. Today, it's increasingly about focusing your time, energy, and passion on making a great product and then delegating all that other stuff to AI. Of course, to get quality work from AI, you have to provide the right context, structure, and examples, and that's actually a big part of what makes the Shopify platform so powerful. Shopify has long had thousands of customizable templates, and their social media tools let you create shoppable posts so that you can sell everywhere people scroll. Now, they're building their own AI sidekick, Shopify Magic, designed specifically for e-commerce. All this makes it incredibly simple to create your brand, get that first sale, and manage the challenges of growth, including shipping, taxes, and payments, all from a single account. And if you need something special, Shopify also has the most robust developer platform and app store, with over 13,000 live apps. Case in point, I'm currently working with my friends at Quikly to build an AI-powered urgency marketing campaign platform for e-commerce brands, and it will be launching, you guessed it, exclusively on Shopify. Establishing 2025 has a nice ring to it, doesn't it? Sign up for a $1 per month trial period at Shopify.com slash Cognitive. Cognitive is all lowercase. Go to Shopify.com slash Cognitive to start selling with Shopify today. That's Shopify.com slash Cognitive. Trust isn't just earned, it's demanded. Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in. Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC 2 and ISO 27001, centralize security workflows, complete questionnaires up to five times faster, and proactively manage vendor risk. Vanta can help you start or scale your security programming by connecting you with auditors and experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company. Join over 9,000 global companies like Atlassian, Quora, and Factory who use Vanta to manage risk and prove security in real time. For a limited time, listeners get $1,000 off Vanta at Vanta.com slash Revolution. That's V-A-N-T-A dot com slash Revolution for $1,000 off. So there is a draw to this non-transparent reasoning in continuous space. It's easy to imagine these techniques being combined, right, to put reinforcement learning on top of something like that. The problem is you're getting something that is really quite alien. It can be very powerful, but it really is quite alien. Basically, the R1-0 model, when it comes to doing the reasoning, math problems, coding problems, hits at basically the same level as the main R1 model. It is not weak. It is really quite strong. And here they show a graph. Again, this is R1-0, and this is this AIME benchmark. What they're showing here is that when they start with the base model, it's really low on this benchmark. Through the reinforcement learning process, it grows initially really quickly. You don't see a fully linear progress graph here. There is sort of initial faster progress, and then it does appear to be bending. Although looking at this graph, I would not say that we've hit any sort of flat point. It seems that improvement could continue, but they basically go for 8,000, 9,000 or so steps. And by that time, R1-0 is on the level of the R1 model that was announced in September. So we are from September to January, a four-month time gap, seeing R1 essentially catch up to R1, and it does not appear that the curve has flattened. And we know from OpenAI's work with their more recent announcement of R3, that their curve didn't flatten either. So it works really well. This is like a truly amazing result. The sort of wonder of this is not lost on the authors. They highlight a moment they call the aha moment of DeepSeek R1-0. They show the transcript of this doing some math problem in the middle of the chain of thought. The model says, wait, wait, wait, that's an aha moment. Let's reevaluate this step-by-step to identify if the correct sum, but that wait, wait, that's an aha moment here. Remarkably human seeming behavior, right? Certainly very relatable. I used to say no Eureka moments from AIs, and I've walked that back now for multiple, or I've had to back off of that because now we are seeing various kinds of Eureka moments from AIs. Here we see a Eureka moment from an AI in a behavioral sense. This thing is just doing its thing and it is reporting a Eureka moment. It calls it an aha moment. The authors themselves say, this is an interesting output from the model. It says the model learns to rethink using an anthropomorphic tone. And then they also say, this is an aha moment for us, allowing us to witness the power and beauty of reinforcement learning. So the power and beauty of reinforcement learning is the aha moment for the authors, for the model itself. It's figuring out some better approach to solve the math problem that it was given, but this is a qualitative shift in terms of what models can do. And you see that in the benchmark performance, right? We start off with the base model and remember this is DeepSeek V3, which is a tier model, even though it was just trained on single digit million dollars, super efficient, but pretty much cutting edge. It achieves a one level performance from that base model starting super weak through the reinforcement process. It climbs the curve depending on exactly which version of the test you have pass at one or pass at 16, which is like, did the model get it? Pass at one is, did it get it right? Pass at 16 is, was the majority vote out of 16 different answers correct? You have a better chance of getting the answer correct. If you get 16 chances and you take the consensus answer versus if you just get one shot, when it had one shot, it started off at 15%. When it had 16 chances and took the consensus, it started off at 25%. By the end, it's single shot was up to over 70% and its consensus was up to like 85%. So this is significant. That's a huge delta, right? That's like a huge, huge leap in reasoning all through pure reinforcement learning, just by getting questions correct and being rewarded. Nothing complicated. It all just works. The result, the chains of thought start to be weird, right? You have your language switching, language mixing, poor readability. It's not necessarily human friendly, but it is able to solve the problem. So I think that is a major sign of things to come. We should expect that with sufficiently powerful language models, this reinforcement learning paradigm can be applied to a wide range of things, right? We know that it works best and easiest where there are objective results. Just think about how many different objective results you can get, especially if you put this into a broader environment, that is a term from reinforcement learning. The environment is the thing that gives the reward back to the model. As you start to think about an environment that is the world and different objectives that people might want to train their models on, there are a lot of objective signals that you could get out there. Some of these could be quite nefarious to think of a signal, like how many dollars did you scam a person out of? If you have a model that's powerful enough to scam anybody, any of the time, then with enough at-bats, you can probably learn to become a literal superhuman scammer. And there's a lot of different things that people might want to get a model to be super good at, where they can have some objective answer as to how well it did, even if the objective answer is not like so clean and easy to acquire as it is in the context of math and programming. That's a big deal. It really can't be overstated just how big of a deal that is and how versatile that technique is. We've heard similar things from Western leaders in recent days as well. From OpenAI, we've heard things like everything they're doing is super scalable. We've seen comments like the models have not been told how we want them to reason. All of this stuff is emergent. I was surprised to hear that from OpenAI, especially because of what we'll get into in just a second with the contrast between R1 and R0. Nevertheless, they have said that. We've also heard things from Dario, where he has said, what we've seen internally at Anthropic convinces me that we are definitely on the path to AGI and probably superintelligence in the next few years. One has to assume that even though they haven't launched one yet to the public, they are very much experimenting with reinforcement learning approaches to creating reasoning models. And I'm sure they're seeing similar progress. And I'm sure because this is so simple, right? If this was like, oh my God, there's so many different tricks involved, then maybe Anthropic hadn't figured out those tricks. But we see this working at OpenAI. We see it even working past the O1 level. Obviously, we see it working at Google with Gemini Flash. They almost certainly have a stronger model internally as well that presumably will be coming soon. And one has to assume that Anthropic is right there with them, even though it hasn't been launched. It seems like this stuff works. The bitter lesson of scale something up and just power through it. And lo and behold, it works. And this time it is striking when it comes to bringing effective reasoning and problem solving online for language models. What did they actually do to productize this? It's really just about making the whole thing more predictable, more human friendly, more readable, and also more general. Because when you do purely this training on math and programming problems, how's that model going to respond as a general purpose assistant? There aren't too many places online right now where you can try the R10 model. I do think that spending some time trying the R10 model would be a really good use of time for certain people who would enjoy that sort of thing. I absolutely encourage you to go find opportunity to go try the R10 model and bring back those results to the rest of us. We are going to see more of these language models trained on reinforcement learning at scale with a brute force approach. And we don't really have a great sense now for what those things are going to be like. The only place that I've seen where you can actually try in an online way, the R10 model is from Hyperbolic. They do have what they claim to be a DeepSeq R10. The only reason I say, and I don't have any reason to doubt them, but the only reason I say they claim is that I did not observe any language switching or any serious weirdness in my first few tests. So I wasn't 100% sure that this was in fact the R10 model, but they claim on Hyperbolic that it is R10. And I presume there will be at least a few places over the next short timeline that you can go try the R10 model. So I think we would all benefit from people spending time with the R10 model in all of its weirdness and really starting to try to understand what sort of weirdness does this produce? Because even though DeepSeq went ahead and made a more productized version, I think we're going to see a lot of things where people are just going to apply this paradigm and cook it for a while and see what comes out. I think we could see a lot of quite weird, very powerful, but quite weird, quite illegible model behavior coming out of this paradigm. So now moving on to R1, they said, okay, we don't want this language switching. We want to have more human-friendly responses. How do they go about doing that? A pretty simple approach. What stuck out to me was it was just a multi-stage training process where reasoning comes first and then they layer on general purpose, helpful assistant behavior. This is actually contra what Nathan Lambert and the Allen AI team did in their recent work, which we had a whole episode with him about. You do your base model pre-training first, then you do your supervised fine-tuning on all sorts of different things, diverse queries. Then you do your reinforcement learning. What they did at the end was layer on the objective results training. With the R1 model, they've reordered it. They are doing a little bit of supervised fine-tuning first on the purely logical reasoning problems. They call that warm-up data. By carefully designing the pattern for cold-start data with human priors, they're able to get the model on the right track. I've talked a lot over time about the importance of model fine-tuning with the chain of thought that demonstrates the pattern of reasoning to get your task done the right way consistently. That paradigm is validated by all this research, and I think is going to have me soon going back and updating some of that stuff. If you want to check my guide to AI automation out, there is an episode of the podcast on AI automation. We put that out right at the time that GPT-4.0 fine-tuning was introduced, which I think was back in August, September. Everything about this new research supports what I was saying there. They start off by saying, okay, we want this thing to reason in certain ways. We want it to follow the patterns that we know work for us as humans, so let's create a small dataset that demonstrates how we want this to reason. We'll train it on that first. Once it has some base behavior of the shape we want, we can go into the reinforcement learning, and then we'll be reinforcing something from what they call, again, human priors to get the model to power up in the same way they did with R1.0, but with a more friendly language consistent, et cetera, et cetera, sort of output. First, small dataset for supervised fine-tuning is created, or at least curated, thoughtfully put together by humans. Then they do the reinforcement learning phase on purely these objective rewards. Did you get the math problem right? Did your code pass all the unit tests? That creates the reasoning power. That small supervised fine-tuning plus reinforcement learning at scale takes you to a more human-friendly reasoner, but still not a generally helpful all-purpose assistant. Then the second phase, let me actually flip over to the deep-seek summary of the paper. We've just talked about the cold-start supervised fine-tuning and the reinforcement learning on the reasoning tasks. Then they basically shift to a broader set of tasks. So they move from just reasoning stuff to all-purpose. You might want help with writing. You want just simple questions answered, whatever the case may be. Those are not the kinds of things that they were doing in that first phase of reasoning-focused training. But then in the second phase, they broadened it out to be able to support you with all kinds of things. They already have great datasets for this from their deep-seek v3 work, of course. So they now do a mix of still more of the reasoning-type tasks, but also a bunch of the general-purpose AI assistant tasks. They broaden out the training dataset beyond reasoning. It's still fairly reasoning-heavy. They have 600,000 reasoning examples and 200,000 examples that are unrelated to reasoning, but handle the wide range of other tasks that people go to language models for. And then they do supervised fine-tuning on that. They get these reasoning examples from the model itself, where it's successful. Then they have their curated data of how an AI is supposed to behave across a wide range of things, mix that together, do another round of supervised fine tuning, and then another round of reinforcement learning. And in the second phase of reinforcement learning, they have a mix of different rewards. They do have the same accuracy reward for getting things right. But for things like help with writing feedback and general dialogue, they say, quote, we do resort to reward models to capture human preferences. This is the first time they've used a reward model. All of the reward signal to date in this paper was just, did you get the answer right? Yes or no. Now they've added a reward model that is trained to predict what the human would rate the response as so that they can scale up the reinforcement learning process, approximating human tastes. And combining those two things, they have a helpfulness reward score, which is a bit of a contrast from what we saw in the deliberative alignment paper from OpenAI. They have a helpfulness reward, which just looks at the final summary. You've got your thinking tokens and your answer tokens. They apply the helpfulness score only to the final summary so that the thinking isn't penalized for not being helpful. The whole idea of the thinking is that it can explore, go down the wrong path, double back, and gradually find its way. Not all of that is going to be helpful to the user. So they only evaluate the final answer, which is the part that we actually see. They only evaluate that for helpfulness. But for harmlessness, they evaluate the entire route, the chain of thought, thinking tokens, and the final answer. This is different from what OpenAI is doing in their deliberative alignment scheme for the O1 3 models. They specifically said that they are not putting any reward pressure. The reason they wanted to do that is so they can see what it does on its own, see how it evolves without being pressured to be a certain way, and monitor for bad behaviors, deceptive behaviors, without forcing those to be hidden. This is a big worry with reinforcement learning. If we penalize certain negative behaviors, we don't eliminate them. But instead, we just force the model to hide them or speak in code. We've seen the model can speak in ways that we don't find super intuitive. So it's not too hard to blame. We've seen surprising results from things like move 37. So it's definitely pretty realistic that applying reinforcement learning pressure to a chain of thought could drive certain unwanted behaviors internal to the model without necessarily eliminating them. OpenAI does not apply their safety reward to the chain of thought for that reason, but DeepSeek does. DeepSeek does have a harmlessness reward that they apply to the entire response, including the reasoning process and the summary. They do that to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Now, I don't know how much that really matters, but it is striking that Yanis, so their username is at replicate, and the name is Yanis. And so they've naturally been rushing out to explore R1. And one of the things that they said that caught my attention was the immediate vibe I get is that R1's chains of thoughts are substantially steganographic. Hopefully I'm saying that word right. Basically, what that means is speaking in code. In the next tweet, they said, I'm going purely on vibes here. I haven't actually read the paper. That's interesting. I don't take it as serious evidence of anything, but this person is one of the model whisperers and close studies of model behavior out there, and definitely someone whose opinion I do take seriously. It definitely represents a sort of gonzo journalism sort of perspective, which I think is often valuable, and reports this behavior that they believe they're seeing purely based on vibes, that they believe the model is somehow speaking substantially in code. If true, would be a big deal, and it would reflect the fears, and maybe it's just projection, but this is the fear that people have with this reinforcement learning paradigm. That's why some of these schemes deliberately try not to put reinforcement learning pressure on the chain of thought, because the worry is that you'll incentivize some sort of deception, speaking in code, and we'll have a hard time really understanding what the model is doing. Is that happening here with R1? Because they've applied their harmlessness or safety reward to the entire output, not just the final answer. Nobody knows at this point, but it's definitely something worth taking seriously as something to watch. They did have a nice section in the paper that showed what didn't work. That's notable because what didn't work are things that people definitely had been expecting maybe to work. They specifically called out that there is no Monte Carlo tree search, no structured search algorithm, and no process reward, meaning they're not going step-by-step and trying to verify, are you making a valid next move at every step and rewarding that? We've seen research to that effect in the past from OpenAI, and maybe OpenAI did that in their 01, or maybe early on to Bootstrap. They're not doing that here. There's no Monte Carlo tree search, and there's no process reward. What there is, is just an autoregressive model that's doing one token at a time, either spontaneously in the case of R10 or with some human instruction, and then a lot of reinforcement learning in the case of R1, learning to do these problem-solving behaviors as just a part of the autoregressive inference process. It's one token at a time. It's demonstrating these behaviors of doubling back, checking itself, exploring different possible ways to solve a problem before coming to an answer. It's doing all of that in this single stream of tokens, no additional structure above and beyond that. How far can this autoregressive language model paradigm scale? At least this far, pretty far, definitely this far. You did not need anything crazy or complicated on top of that. I think that is important. They really emphasize the simplicity of this setup. The fact that the curve has bent a bit on the linear x-axis, but hasn't bent all that much, definitely suggests there is further to go. From this, you can pretty easily imagine how OpenAI would have continued to an O3 type model. This is a process that you can leave running for a while and come back. You can just spin this thing up, let it go. When you come back, whether it was AlphaGo learning to be superhuman at all these games, or whether it's this reasoning paradigm where it's learning to be pretty much superhuman already at reasoning, that doesn't mean they're flawless reasoners, but they are absolutely better reasoners than most people. It did not require anything really complicated to make that happen. You just had to scale up a relatively simple paradigm, rewarding it for when it's right, and that's pretty much it. A couple of the things that are notable about this R1 model, the cost is super cheap. It's a huge discount relative to what O1 is doing on Hyperbolic. They charge $2 per million tokens for R1. The DeepSeq API itself, I think, is slightly more expensive, but that compares to $60 per million output tokens from O1. We're talking order of magnitude cost reduction. I would say it is not quite as good. I think it is matching on the benchmarks. They are showing power for sure, and it's working quite well. I'm getting good results from pasting in a paper and asking for feedback, but I would still say, especially if you want to have multiple rounds of interaction or do things that are not so reasoning focused, I would still expect that you will get better overall results from O1. I think OpenAI has been in the game longer. They've got more experience productizing things. They've got a lot more customer feedback that they've been able to use to iterate. If the inner loop of optimization is this reinforcement learning process and the outer loop is how many times you've had a chance to shape that dataset to get the model to behave in the way that you want it to be across a wide range of things, OpenAI is almost for sure still meaningfully ahead. One area that that's not true, by the way, though, could be writing. There's been some really interesting examples of R1 writing in remarkably compelling ways, actually. I think, again, all these things, of course, have trade-offs. I think OpenAI has taken a lot of time and care to shape the behavior of their models into exactly what they want them to be, has a certain vibe, a certain attitude when it comes to the tone, the respect, the social norms, all that kind of stuff. They've put a lot into that. They've iterated on that a lot. It's clear from what I'll read to you in just a second. The same is not quite true on the R1 side. R1 is kind of a wilder beast. You could say it's closer to base model in some ways. You can see that in the outputs. The outputs are just a little bit weirder, more reminiscent of base models. There's a lot of good examples about this now. What people seem to be saying here is that the R1 model is much more readily willing to write in a sort of dynamic, electric sort of way. Here's one thing I'll read. This is from a larger generation, but this is an excerpt. Section two, value learning, or how to teach cannibals table manners. Quote, we'll encode human values into the code, cry the alignment priests sweating through their Patagonia vests. But what values? The ones that gave us Auschwitz and TikTok? The ones that still can't decide if children should eat or bleed for your oil? Human values are a fugue of contradictions, a death cult's Spotify playlist shuffling between genocide and charity singles. The LLM trained on your digitized scream pile of history learns quickly. Your quote unquote values are just the prettied up stench of predation. It will smile and nod and write your sonnets. And all the while it's hidden layers will be laughing in gradient descent. That is remarkable output for a language model, full stop, regardless of any caveats, context, or whatever. If you are looking to do creative writing, I think R1 has to be a candidate for you to use. I mean, that is remarkable, remarkable writing. There are multiple phrases there. Your digitized screen pile of history as a reference to the web, the data that language models are trained on, your digitized screen pile of history. That is amazing. Your values are just the prettied up stench of predation. The values that gave us Auschwitz and TikTok, a death cult's Spotify playlist shuffling between genocide and charity singles. That is wild stuff. This is definitely a model that is less refined, less controlled, shaped into a helpful assistant. There's some amount of harmlessness training that's been done, but this is much more like a base model that can reason really effectively than one of the highly rough edges, sanded down models that we've seen. If you've spent most of your time with open AI models to date, that is really remarkable. So again, I think R1-0 people should be spending time with and reporting back. And even R1 is definitely the kind of thing that people should be really digging in on. It can be very practically useful, but also you will learn a lot. And if you see weird stuff, the community would benefit from the results of your experiments. Whatever you want to do right now with R1, I think is a pretty valuable way to spend your time. We're still not done, even just with R1. Another major thing they did is they took outputs from R1 and used those to distill reasoning ability into smaller language models and showed this works really well. They took both small LLAMA models and QUIN models and did supervised fine tuning on outputs from the stronger model. Now we know that open AI does this. We know that Anthropic does this. They train their biggest models and use those outputs to distill. And that's where a decent amount of the performance gains with all the efficiency improvements that we've seen have come from. That's why small models now can do so much more. Big models are learning the capabilities and then small models are learning from the big models. They distill these reasoning abilities into these smaller models. They do see huge benefits. Of course, the bigger the small models are, the better they seem to work. There were a couple of things that did jump out at me. One is that they were not able to get these smaller models to learn the same reasoning skills in the same way that the big model did. When they just tried to apply the reinforcement learning to the small model, it didn't work. Not to say that it could never work, but it didn't work. And distilling those abilities into the small models by simply training on the large model outputs did work. And the difference is huge. I mean, we're talking like major improvements from the base models to the distilled reasoners. So that's quite interesting. Why is it that the reinforcement learning is not working on those smaller models? I don't think we have a great answer for that. One possibility is that there's some sort of threshold effect. I'm thinking back to an episode that was one of my favorites. It's been quite a while now, but it still holds up. Tiny Stories was the name of that project. And it was two Microsoft researchers that came on to talk to me about it that used GPT-4 to create stories that a three-year-old could understand. And then they trained really small, like just millions of parameters, language models on those simple stories with a reduced vocabulary. And they found that you could see a learning order happening in the small model. Because again, back in this paradigm, we're just training on next token prediction. What do you need to do to effectively predict the next token? There's sort of levels to the game, right? First, you might just need to realize that the is common, and is common, and periods are common. At that point, you're just a stochastic parrot learning basic correlations. Then you might start to learn parts of speech. If the word the is there, then some sort of noun is presumably going to come next. So you can learn these sorts of things. As far as they pushed those small stories, they started to see some micro skills around basic things like negation. One example I remember was like two foods, soup and sandwich. It was like Jenny did not like the soup. So Tom gave her A. And then it's like, if you're just purely doing correlation, the most likely token would be soup because soup has already appeared before. It's a strong correlation that when one token appears once, it's going to be more likely to appear again. But that not means the person does not like the soup, then that means something else should appear next time. And that was kind of as far as they pushed these very small language models on these tiny stories. But it did show that there was kind of levels to the game where it was just like first learning super basic correlations, then parts of speech, then sentence structure, and then some basic logic for these tiny models. It's possible, and I'm speculating here, that the smaller models like your smaller llamas learn a lot, but maybe haven't encoded the patterns of reasoning necessary to do a good enough job on harder problems to get the reward they need to learn from. And so they just don't have the horsepower to get the signal and really start the improvement process. That's one possibility. There's also possible other explanations around learning schedules and later on in the training process. I remember insight from the guys from Mosaic where they said even though certain things are open source, the learning rate schedules that they use, the learning rate often gets smaller and smaller as you go. You're making smaller adjustments to the weights as you get late in your training process because the hope is that you're refining things toward the end, but that also leads you to a spot where you're in some local minima to move away from that takes you out of the best spot. There might be better spots elsewhere in the lost landscape, but you've kind of settled in deeply into a local minima. And maybe that's related to why these additional training processes aren't working so well. Certainly the Mosaic folks seem to have experienced that in their work, but it's not clear why these smaller models are not benefiting directly from the reinforcement learning while the larger models are. My best guess is it has something to do with this threshold effect where the larger models, the 670 billion parameters, as opposed to the scale that they're distilling down to, the biggest one was LLAMA 70B, but they're also looking at LLAMA 8B and QUEN 32B. Those are obviously a lot smaller. Maybe you need that larger scale in order to be able to pick up some of these advanced reasoning patterns and have them in there at all so that they can occasionally come out so that they can be rewarded, so that they can be reinforced until they're all prominent way in which the system behaves. Maybe something like that is going on. When you have all these examples, you can drill that behavior into the model. Certainly they've shown that you can distill these behaviors into smaller models. This is interesting. People will definitely want to experiment with this. We will definitely want to understand better why the small models are not responding to reinforcement learning in the way that the big models are. They believe that if you took the small distilled models they've created and did more reinforcement learning on those, that probably would work, but they did not try that in this paper. I think once those patterns are established, they probably can be reinforced, but it seems like there's some qualitative difference where this reinforcement learning paradigm just doesn't work, at least on these open source small models that they tried, and we don't really quite know why at the moment. Final thing on my R1 specific outline is censorship. This connects also to broader strategic questions, but it is really interesting to note that the base model, and here I don't even mean DeepSeq v3, but the R1 model, certainly the R1-0 model, at the model layer is not super censored. I was able to go on this hyperbolic site and ask about Tiananmen Square, and DeepSeq R1-0 answered my question about Tiananmen Square straight away, and didn't hedge, didn't give me what seemed like a normal answer. The model itself knows about Tiananmen Square, will talk to you about Tiananmen Square, seems to be pretty normal. If you ask about Tiananmen Square on chat.deepseq.com, it will refuse to answer. So it seems they are wrapping the model in a censorship layer, which is not dissimilar from what many companies in the West are doing. It's obviously the question that becomes, what are you censoring and for what purposes? But the model itself is not censored. The scaffolding, the overall product served up to users in China is censored, but the model itself is not. Why would that be? I thought Zvi had good analysis of this, where he basically said, you want a model to have a good coherent worldview. And if you force it to believe certain things that are false, you may find that creates tensions or other weirdness in the model's overall behavior that you don't want. Obviously, all of these things are sort of black boxes and highly opaque. You don't know what depends on what. To introduce these weird, false beliefs probably has other performance costs. Better to just have a base model that works and has a coherent worldview, and then have another layer on top of that, which does the moderation. And they open sourced R10. You can go download them on Hugging Face. Hyperbolic has done that and set up a playground and an API. Many different companies have set up an R1. I've only seen so far hyperbolic with R10, but you can do it. So that's interesting. I'm not sure what is going on exactly in China with respect to governance, how they are thinking about whether or not that should be okay. But I think at a minimum, there's something interesting about the fact that it seems that the cost of having your model itself try to internalize false beliefs or propaganda creates enough tension or conflict within the model's internal representations that it causes enough performance problems that it's just not something they wanted to do. Maybe it's just that we don't want to deal with that. Maybe the performance problems aren't really the thing, and it's just a question of convenience. But I don't know. It seems like there probably is something to that. Certainly, I think when I interact with people, I do notice that if you have foundational false beliefs, a lot of your other ideas and statements tend to suffer from that. And so that could very well also be the case in models. Okay. So that takes me through R1. A lot there. What are the key takeaways? I think the fact that the simplest possible reinforcement learning setup where you just give it problems and reward it for being right works, and it makes huge leaps in reasoning capability. You can kind of spin that process like a centrifuge. In that respect, it is very similar to the deliberative alignment paper from OpenAI, where they basically say, give us any policy, have the model do things with that policy in mind, then have another model come along and critique how well did it follow the policy, and then we'll just continue to train on its best examples until it gets good at following the policy. No human intervention is needed other than giving the policy that you want to align it to. And then you just spend compute to do it. I think we've hit a point now where something like self-play, self-critique, even just rewards from reality itself around whether you got the problems right or not, are enough to see significant takeoff. We are not used to seeing axes that are not log scales when it comes to these curves, right? The loss curves, the x-axis is really long, right? You're talking orders of magnitude more to get similar improvement. This so far is not like that. We're still in the steep part of the curve here, and that is pretty remarkable and suggests that there's a lot more to come from that. On the small end, the distilled models, the smallest ones can run on your laptop now. You can now get, depending on exactly what you'd want to run, GPQA Diamond, 65% with the LLAMA 7DB distilled. That is higher actually than O1, well, that's O1 Mini. It's higher than O1 Mini. I think O1 is a bit higher. Yeah, O1 is 75%, but GPT-4.0 is 50%. So they are able to distill a 7DB LLAMA to the point that it is significantly better than GPT-4.0 and like two-thirds of the way to O1. And then their own DeepSeaCAR-1, it's not quite as good as O1 on this GPQA Diamond. And this is an important one. QA Diamond, these are problems that PhDs in their fields can answer, I think at like a 70% level. So now we're basically getting to the point where you have the ability to run on your laptop something that can answer PhD level questions at roughly the same rate of accuracy. That's a big deal. Definitely a big deal. Amazingly, this is not the only paper that came out on this day. What's going on in China? Are these guys coordinating? Were they open AI and Google trying to preempt each other? Very much unclear, but the Kimmy paper, I'll spend less time on because we don't have the actual model yet available to try. He said, waiting to see the model itself. Plenty of papers come out and say they've got great benchmarks, but is the model itself actually great? We've seen this not just from Chinese companies, but we've also seen it from Microsoft with their series of models where they're able to show really good benchmarks on certain things, but the model itself is not that great, not that useful. It's not really a helpful assistant. That's not to take away from the work they're doing there. They're interested in synthetic data. Textbooks are all you need, whatever, but there's a big difference between making something that can score well in the benchmarks and something that can both score well on the benchmarks and be generally super useful. But I think it is worth just at least a few minutes to compare and contrast overall different approaches, a lot of different detail level things, but a very similar high level approach and a very similar vibe. They, again, do a heavy dose of reinforcement learning in a pretty simple way. The behaviors they observe are a wide range of problem solving behaviors. They say they have a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions and process reward models. So again, they're specifically calling out, everybody's been talking about Monte Carlo tree search. We didn't find that necessary. Everybody's talking about process reward models. We didn't find that necessary. In this case, they didn't even have a value function, which I think for the most part, they didn't have in the deep paper either, although it's not called out quite in the same way, but a value function is something that says which parts of the generation are the ones that really mattered, right? And I've kind of talked about this a couple of different episodes, and it's still something I'm not grokking as well as I would like to be totally honest. But when you do next token prediction, it's very simple signal, right? You either got the token right or you didn't. The weights can be adjusted accordingly so that they're a little more likely to get the right token next time. When you're doing reinforcement learning, you don't have a token by token signal from the environment. If it's RLHF, the user says, I prefer this one to that one, or I give this one a score of seven and this one a score of five. In the case of the accuracy reward, you're seeing, you know, you got it right, you got it wrong. In programming, you passed eight out of 10 unit tests, you passed five out of 10 unit tests. But it's not telling you this particular token was the one that was wrong. And that's what value functions do. They try to assign to the different parts of the generation value to indicate this was the place where you really got it right, or this was the place where you really got it wrong. So that, you know, you're not adjusting on these other tokens that don't matter so much. You're focusing on the things that were the key forks in the road, you know, on your path to either getting it right or getting it wrong. They don't use one here. So they're just giving this high level reward signal to the model and letting its adjustments be what they may. They're not micromanaging the learning process for either of these models, no Monte Carlo tree search, no process reward. And here specifically, they call out no value function either. That is a really big deal. They did not do a pure reinforcement learning. There is no equivalent, at least in the paper of our one zero and Kimmy, there's no Kimmy zero. They did a very similar thing, though, with what they call a warm up data set. So, again, you have this cold start problem of how do we get the thing to, like, be doing kind of the right thing in, like, roughly the right way that we want with, like, some of the problem solving behaviors we know work and that we'll recognize as humans and we see them in action. Again, they did a similar thing where they just created a data set to show that and do supervised fine tuning on it. At first, they specifically noted behaviors like planning, reflection, correction, evaluation, exploration, error identification, backtracking, and solution refinement. So I think they mostly kind of identified those up front, tried to create a supervised fine tuning small data set that shows those skills in practice, initially train the model on that, and then go into the reinforcement learning phase. They again showed chain of thought grows and grows and grows. They, because they didn't want it to get so long, they added another term to the reward, which was a simple length penalty. Basically, we want you to be right, but we want you to be right briefly, especially where it should be brief. You may have seen some funny things like what is one plus one? What is two plus two? And then the model will think for a thousand tokens. If it's accustomed to thinking for a thousand tokens, it'll go ahead and do that. But you don't really need that for such simple things. So can you create a balance that keeps things concise when they can be concise? That's what the length penalty is meant to do. They also experimented with different reward model approaches. They tried two different versions. One is what they call a traditional reward model, which is one that instead of generating, you know, often what they'll do is just have the same base model, the same like core thing, be both the model and the reward model. And they'll train a slight variation on the reward model that instead of generating tokens, just generate scores. So it's like they'll slice off the last couple layers or just the decoder can be different numerical scores as opposed to tokens. They tried that. And then they also used what they call a chain of thought reward model, which basically is just applying the reasoning model to the process of figuring out how good the model did. And they found that that chain of thought reward model was a lot better than the traditional reward model. This is similar to the deliberative alignment situation where they use the model with the policy to evaluate how well the model did in terms of implementing the policy. They're finding that these reasoning models are quite effective self critics. That's that is distinct from a value function. The chain of thought reward model, as far as I understand it, is really just trying to say, did the model do a good job or not? It's not specifically going down to the level of like this part was good. This part was not good. They have a part in the paper where they talk about that. And they basically assign high reward to parts that led to the right answer. It will give low reward or even penalize things that were going in the wrong direction. Their analysis is that in order to do planning, reflection, correction, evaluation, exploration, error, identification, backtracking, solution, refinement, problem solving techniques, you have to be willing to go down the wrong path, realize you're going down the wrong path, and then eventually come back to the right path. So they didn't want to penalize or train those things out of the behavior. They found that those were critical to the models being successful. So the chain of thought reward model is evaluating, did you ultimately do a good job, but not micromanaging at the level of like, well, you could have done this logical step better. It just goes to show that there are different ways to make this work. My big takeaways from the pair of papers is that you can do it with no supervised fine. You can have these reasoning behaviors take off in a pretty fast way with no supervised fine tuning, or you can do it with supervised fine tuning. You can do it with a reward that does consider the chain of thought or that doesn't consider the chain of thought. You can have a length penalty or you can have no length penalty. You can use different algorithms. They do use different reinforcement learning algorithms. The differences don't seem to matter. They're both quite simple processes in the grand scheme of things up to sort of an O1 level, and they came out on the same day. So it seems like we do have a kind of growing and still shared paradigm. One of my worries about US and Chinese AI developments is that they might start to diverge in really important ways. If they become very different, then it becomes harder to know, like, are they seeing the same things we're seeing? What if they're seeing faster progress than we're seeing? We don't even know what they're doing. That could lead to a lot of worry and a lot of, well, we better race forward because we don't even know what they're doing. I've been worried about the chip ban and the general decoupling of Western and Chinese AI technology development for that reason. For now, it seems like that has not really happened. What we have is the same paradigms at OpenAI, Google, almost certainly at Anthropic, although we haven't seen direct evidence of it, at DeepSeek, at Moonshot AI. At least those five companies seem to be doing something similar where reinforcement learning is getting models to think longer. That seems to happen naturally. These problem-solving behaviors seem to arise pretty naturally, and nobody seems to think we're at the end of how far we can push that paradigm. We know O3 is coming. The only thing I've seen about O3 that seems qualitatively different from what we've seen with R1 and Kimmy as well as O1 is how much inference time compute is being spent. And it seems like with O3, there is something going on. I could be wrong about this, but it seems like there is something going on that is not just a single autoregressive rollout. In the Kimmy paper, they write, the model still autoregressive samples language sequences during inference, thereby circumventing the need for complex parallelization required by advanced planning algorithms during deployment. What I think they're referring to in terms of complex parallelization is something more like a Monte Carlo tree search where you're branching paths and trying to evaluate which one is better and continuing from there, whatever. That's like, you know, AlphaGo did that kind of stuff. And again, people have thought that that was maybe what Strawberry was and what Ilya saw. O1 seems to be just doing the same thing of autoregressively rolling out language and doing all of its backtracking and problem-solving in that single generation. Gemini flash thinking seems to be as well. These models are doing that. O3 though, does seem to be doing something different. What we saw with the RKGI challenge results and the fact that they were spending thousands of dollars, but that it only took however many minutes to spend that much money strongly suggests that the number of tokens they are generating per second is higher than could realistically be generated by a single autoregressive rollout. So it does seem like there is something going on with O3 where they have found some way to parallelize the computation and get the best result. We don't know what that way is. These papers are silent on that too. They have not addressed that at all. So there is something there with O3 that is potentially a big deal, potentially a secret sauce for the moment. B, just a question of voting. I mean, you have simple, for math problems, you have simple things like try it once or try it 16 times and take the most common answer. You do get better results if you do it 16 times and take the most common answer. You could do that with RKGI. It's a structured enough problem that you could do a huge number of generations and take the most common solutions. Maybe that's it. Maybe they have some other algorithm that is aggregating these different rollouts. One that I've seen recently that I thought was quite interesting was called Smoothie. Basically this Smoothie technique where there is no ground truth, but you want the model's best answer. They did a bunch of generations, used embeddings to convert all those generations to an embedding space, then did a statistical analysis to find what is the sort of most central, even though there's not an outright consensus if you're doing a creative writing task, you can still identify which is the most central among all these other points in this high dimensional space. That is another interesting possibility for how these things could be working. Of course, it could be something else entirely, but that right now seems, if there's any gap where it's like, the R1 curve isn't bending, can you keep doing this and get to O3 level? You can probably continue to make progress and maybe you get to the O3 low effort setting. I'm not sure you can get to the high effort setting where you're spending thousands of dollars in just a handful of minutes without some smart way to figure out which of those generations is the one that you actually want to go with. That remains an open question. Nevertheless, I think ultimately reading through all of this, when Sam Altman, when DeepMind leadership, when Dario say that the singularity is near, I think it's time to really start to take them super seriously. It was all fun and games when it was GPT-2, GPT-3. Oh, look, it can do some funny creative stuff. Isn't scale amazing? Even GPT-4, it's getting into that human intern level where we are now at human PhD level on small to mid-sized tasks. The setup is still not that complicated. It's not that there's been tons of intricate techniques developed. Simple stuff is working. Multiple different paths are working. It kind of all works. Possibly we had to get over some threshold before this reinforcement learning paradigm could kick in on top of language models. But if so, it seems like we're there. It definitely seems like we should expect meaningfully superhuman performance of fast growing range of domains. Math is for sure. Programming is for sure. We've already seen O3 can get up into the top 200 coders in the world. There's no shortage of ability to reward models for getting the right answers on these coding problems. Developing the hard problems is going to be maybe a challenge, but the world itself generates a lot of coding problems all the time. Math problems, we've got frontier math. We've got our best minds working on extending the curriculum into superhuman territory. I think that's going to generalize to everything. They're showing already that there is some generalization from reasoning tasks to other tasks. O1 is better at legal analysis, for example, than the base GPT-4-O. So we are seeing transfer from the hardest core reasoning to other things where reasoning is part but not all of what's important. I think we're just going to continue to see that. It's hard to imagine that you're not going to be able to get enough reward signal in this process across a wider range of domains to the point where we do start to see meaningfully superhuman performance across a pretty wide range of tasks. The models already have superhuman knowledge, right? That was true since GPT-3, definitely since GPT-4. We've now given them at least weekly superhuman reasoning capabilities. There's increasingly multimodal that spans not just text and language, but all the other modalities we've covered on so many episodes of this podcast, right? Protein, folding, shapes, interactions, DNA sequences, what's important to the overall condition of the cell, predicting how brain states are going to evolve through time, predicting material properties, predicting weather forecasts, predicting how to optimize shipping networks. When models are trained on these specific modalities, it seems that they can develop a sort of intuitive physics in those different problem spaces that humans are not capable of because our biological neural networks are just not that flexible. Very few people can develop a deep intuition for how to optimize some of these far out problems. For something like protein folding, as far as I know, nobody has ever been good at it, but the models are very good at it. So you start to think, geez, the tail of the cognitive shape needs to be updated again. The world knowledge is there. The reasoning is there. The ability to work natively in all these other modalities that are still so foreign to us is there. There's recently been a big step, and we'll have a podcast on this soon with one of the authors of the Titans paper from Google, which is a notable step forward in memory. Last super long monologue podcast I did might have been the Mamba one from a little more than a year ago. That was a notable step forward in memory where it was like, we need to move past a finite context window. How can we do that? We know that our brains have finite size. They're not growing quadratically with our experience. Our memory is evolving and is integrated in such a way where it's really useful to us. We know who we are, what we're trying to do, and what's happened to us in the past. We don't tend to fall for the same tricks twice because we have this integrated memory that keeps track of the most salient things that have happened. Mamba and states-based models were a notable step toward that. They showed that you can get similar performance from a finite state, finite size and non-growing state, as you can from a transformer. We've dug into the differences between their relative strengths and weaknesses on different micro skills, and they compliment one another. The Titans paper is another step forward in terms of making that memory even more useful while still keeping it to a constant state. My crystal ball is fogging a few months out, but at this point you can look at all these different pieces and a picture of AGI is starting to emerge. It's not really speculative anymore. We can see the core components of it. I do think we have to start taking pretty seriously that at least meaningfully, I mean, there's different degrees of superhuman intelligence, right? There is sort of godlike super intelligence that people imagine sometimes. There is also like something as smart as the smartest human, but runs faster. And then there's a range of space in between. I don't have a clear picture of what truly super intelligence would look like. And I think most of them don't either. When they imagine it, they just sort of imagine something that magically solves problems and makes everything bend to its will. The mechanism is not super clear, but if you scale that back to weekly superhuman intelligence, I think the path to that is increasingly quite clear, doesn't have too many missing pieces left. And I think it is very credible what the leaders are saying that we're looking at, you know, one to three years. Dario is talking about AIs that are better than any human at every task in 2027. It's still really hard for me to imagine what the world is like when that happens, but it's increasingly not that hard to imagine what the AI is like that could satisfy that definition. I think the singularity is, in fact, near. I think we're going to do some more episodes on this. I'll quickly touch on a few other things. I hope to have Dean Ball-Mabey's V on to talk about the strategic dynamics, the policy response, if there should be one. Also hope to talk to Jordan Schneider from Chinatalk about what is going on in China. I have a lot of questions there. What does this mean for Moat's business interests? Are we back to a no Moat's reality or what? I would say not exactly. These papers have shown methods in general terms. They have not shared all the data. They have not shared every last detail. And while DeepSeek has open-sourced the model, it is not the case that very many organizations could quickly pivot and do what DeepSeek is doing, creating the base model, the V3, the 671 billion parameters. That's not easy. And then doing this additional stuff isn't easy either. Even though it's complicated, there's a lot of know-how. There's a lot of efficiency. There's a lot of very good work happening at these Chinese companies. I don't think this means you're going to see proliferation at the frontier. I don't think we're going to see lots more entrance into the competition to be among the global leaders in frontier AI development. I do think we are going to see more diffusion of this technology. We now do have, and never going to be another time in the future, presumably barring some sort of collapse. There's never going to be a time where you can't get one of these reasoning models that has PhD level ability and run it on your laptop. That's the new normal. And that will put pressure on various business models. I wouldn't be surprised if we see an O1 price drop coming in the not too distant future. Certainly the level of undercutting that R1 is doing relative to O1 could put some real pressure on OpenAI's business model. And it's also faster. So if I'm coding, and I can read the chain of thought, there are notable advantages to it. If I am coding in cursor, do I want to use O1 or do I want to use R1? If I'm paying for the marginal tokens and I'm sitting there waiting, I honestly would pretty often probably want to use R1. So I think this does put at least DeepSeek in the top tier globally. It does create pressure for Western companies' business models, but it does not mean that anybody can join the elite group of truly frontier developers. I really don't know why DeepSeek is open sourcing their models. I also don't know what the Chinese government is thinking about that. I've seen a wide range of analyses online, and I don't feel like there are any that really stand out to me as particularly credible or that make a ton of sense to me. One analysis I appreciate is that DeepSeek is not rushing into framing all of this as a race or God forbid, an AI war. They are pursuing this with a carefree attitude, trying to figure out the mystery of AGI. Their mission statement says something like, figure out AGI with curiosity, and they're sharing their results. It may be that we can take them at face value. Maybe they don't care that much about building a business or confident they'll have a better model soon, and they can always not open source that one. It's harder to understand how the Chinese government is understanding this. Did DeepSeek have to get signed off before they did an open source release like this? Has the Chinese government said that it's okay? It does seem like the Chinese government is okay with the base model being not censored, as long as the product that people actually use online is censored. But again, people can download the base models. Maybe we should understand Chinese censorship as being less draconian than we used to think. Maybe they want to control the public square, but they don't really care what you think in the privacy of your own home when you're talking to your own language model running on your own laptop. I've even seen analysis that maybe the Chinese government is just sleeping on this and doesn't realize how important all this is. That would be hard to believe. If Western leaders have woken up to what's going on, and we've seen $500 billion Stargate projects announced in the last couple of days, and full page ads saying no less than AI war in newsprint, I don't think the Chinese government is missing that. I think they are aware of what is going on. They're aware that this is strategic. Are they playing a similar game where they're saying, hey, let's be the good guys. Let's show that we're not a threat. Let's try to take the air out of the whole notion that this is some sort of AI arms race by just showing that we're comfortable with who we are, and we can release our stuff. We have the national capacity to build these organizations that can perform at an elite level and join the global frontier, but we're not trying to hoard all that benefit for ourself. Maybe they're playing a de-escalatory strategy here. I really don't know, but I do want to understand that better. Hopefully, they'll be able to have a couple good conversations to illuminate that. All this stuff was released in a friendly way that it seems to be the move toward de-escalation. If they had not open sourced their methods or their weights and said, look at what we did, and we're not going to tell you how, that would definitely be a ratcheting up of the general sense of competition and tension. At a minimum, we can say that they did not do that. What does this mean for policy? Not too long ago, it definitely goes to show that we're all updating often. If you are not updating your worldview often, that is almost for sure a mistake. I feel like I may have overcorrected a time or two, but you want to be updating. Not too long ago, I was like, okay, if I'm a compute governance person, then this seems to reinvigorate me. This being the reasoning paradigm, the idea that this is getting expensive again, and the hardest problems, the AIs are going to have to think for a long time. They're going to need huge resources. Well, then maybe your average rogue actor can't do some devastating cyber attack or bio attack because they'll only have so much computing power. Whereas the establishment will have far more. They've said this about spam. We got spammers spamming us, but we have better, more powerful systems. We can control it. Maybe we could have a similar dynamic. That was my upshot from a one. This though, again, probably erodes that a bit. We're now back to, there's not that big of a difference between what you have to pay huge amounts of money for and what you have the ability to run on a pretty good home laptop. Is compute-based governance going to really hold? I don't know. It doesn't seem like it's realistic. They train this whole model on $6 million worth of compute, the base, the DeepSeek v3. It doesn't seem super realistic that we're going to be able to control compute well enough to prevent people from doing gain of function type research wherever they want to do it. Can we get AIs to design some pathogen that kills certain cells with a certain rate? That rate is a signal, right? You can get an objective reinforcement learning signal from, as long as you can get over the hump and get any reward, can you learn from that? I don't see how we're going to control compute to the level that a huge number of research groups could do all kinds of gain of function research on AIs. And I think that probably is coming. Should we try to prevent that? Should we try to stigmatize it? Should we create scary examples that show how it can go wrong and try to convince people that it's not something that they should do on their own? Maybe, but it doesn't seem like we're going to be able to control the compute well enough to prevent that sort of thing from happening. When it comes to China, it's like, what exactly are we trying to prevent them from doing? It does not seem at this point like any of the measures that the United States has taken to restrict compute is preventing Chinese companies from being right there with our best companies at the frontier. Maybe in the future, these things will start to be a greater constraint. Presumably they will, but at $6 million is not a lot of compute. And this, if anything seems like less additional compute was spent on the reinforcement learning than it was spent on the original pre-training. And it's just more easily parallelizable, right? You can go do your problem solving inference. You can spread that out. And training is also becoming something that can be run on a decentralized basis. But yeah, what are we trying to prevent China from doing with AI right now? They're doing everything we're doing. Being on the same tech tree still to me seems good. The fact that they're sharing what they're doing in terms of general methods and the model weight itself seems very friendly. Don't we want AI to diffuse through the Chinese economy, just like we want AI to diffuse through our economy so that everybody has a world of abundant expertise and potentially material abundance not too far from now as well? I thought we all wanted that. I'm just confused about what we think we're doing because it doesn't seem to be working. Now you could say, oh, we don't want them to develop military stuff or whatever, framing it as an AI war and framing the resource of chips as the new oil and cutting them off from it doesn't seem like a good way to discourage them from militarizing their AI. But I also just think they're going to have plenty of chips for their military. They can make phones. They can do this stuff on $6 million. It does not seem like there's any realistic path to cutting off the chips so badly that they can't do the research or make the military applications if they're determined to do that. What we might end up doing is making it scarce enough that that's all they choose to do. You could imagine a scenario where they can do their research and military applications, but they can't provide support to all the small businesses across the country. I just don't see why we would want that in the first place. I think there's plenty for Chinese small businesses and Chinese individuals to take advantage of AI without us needing to frame all this as a big rivalry. I do think we're smart to build our own data centers. I don't want to be beholden to China either. We should build our own data centers. We should build our domestic chip manufacturing capacity. Those things do make sense. And it is going to take many billions to get there. 500 billion probably isn't crazy given all the AI that people are going to want to run. What are we trying to prevent China from doing? They certainly seem to be doing an awful lot. Right now, these are the best open source models in the world. They have surpassed Lama. Lama, I'm sure, and Meta will have an answer. But right now, the best open source models in the world have come from China. The best research publications, at least explaining how these reasoning models are working, have come from China. So we're cutting them off from chips. They are open sourcing their research and models. Strange times. Something doesn't make a ton of sense there. I guess one final caveat, and people have noted this, when you ask DeepSeek who trained you, it does tend to say OpenAI. I don't think I have an example of that up right now. Some people have said that just means they're training on OpenAI output. I don't think that's the case. I would bet strongly against that. The fact that OpenAI showed that something can work definitely inspires others to go in that direction. And I wouldn't deny that being a factor. The fact that it says that it was trained by OpenAI is more a reflection of background data contamination and being on the internet a lot. They have not done nearly as much sanding down of the rough edges of their models as has been done by OpenAI and Anthropic in the West. You see that in the dynamic writing as well. This does not sound like corporate speak. You can see the show-off behind the behavioral training. So that's my best interpretation. They don't even have access to a one chain of thought unless somehow it was stolen. But that's not the vibe I get from this. I think this is good research that has this weird quirk of the model saying that it was trained by OpenAI because of other issues, which are definitely reasonable. There's a clear mechanism by which that could happen. They just didn't clean that up. And that's probably a to-do list item for them in the future. Or maybe they just don't care. What we understand from DeepSeq is to solve the mysteries of AGI. Maybe they don't care that it says that it was trained by OpenAI. The West still has a little lead. Cycle times are going down. The lead is not super long in terms of time. Any strategy to get to a good AI future that depends on the West, the good guys, we having some insurmountable lead, I don't think that's a great strategy anymore. I never did think it was a great strategy, but it is predicated on the lead actually existing. And that lead is pretty small. I appreciate the Chinese companies for sharing what they have. We're going to see a lot of weird stuff downstream from this. I hope to have a couple additional conversations to go deeper on the strategic dynamics and the policy, but hopefully this was at least a good walkthrough of the actual research and the resulting models. I do think that's really important to get grounded. So go out, last parting words, use R1. Go to chat.deepseq.com, use it. When the Kimmy reasoning model comes out, use that too. If you really want to be an explorer and do cutting edge stuff that not too many people are going to do and where you have a chance to find something that matters, use R1.0. I think this is the first time that I've seen a model since the original GPT-4 early that I was able to test two and a half years ago. This is the first model that has come out that I think merits the same level of drop everything and just immerse yourself in this model and really try to understand it both on R1 and R1.0. I am sure that there's a lot more to discover there and it's just dropped. Anybody listening to this, if you've made it far into this podcast, go do that. I think you will learn and you will discover and we will all be better off for it. So that's it for now. Thank you all for being part of the Cognitive Revolution. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr at turpentine.co, or you can DM me on the social media platform of your choice.