Speaker 1: In recent years, the field of Artificial Intelligence has experienced rapid advancements, with large language models paving the way towards achieving Artificial General Intelligence. One recent remarkable model is OpenAI's O1, which introduced innovative inference time scaling techniques to enhance reasoning capabilities, achieving groundbreaking results on complex reasoning tasks. However, the O1 model is closed source. Today, we dive into a groundbreaking research paper by DeepSeek, titled DeepSeek R1 – Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. This paper introduces a state-of-the-art, open-source reasoning model, and provides a detailed recipe for training such models using large-scale reinforcement learning. Before we dive in, let's do a quick recap of the training process for large language models. Today, large language models undergo three main stages of training. In the first stage, LLMs are pre-trained on a vast amount of text and code to learn general purpose knowledge. This step helps the model become proficient at predicting the next token in a sequence. For instance, given an input like WRITE-A-BEDTIME, the model would be able to complete it with a reasonable word, such as STORY. However, after the pre-training stage, the model still struggles to follow human instructions. To address this, we have the supervised fine-tuning stage, where the model is fine-tuned on an instruction dataset. Each sample from the dataset consists of an instruction-response pair, where the response is used as the label. After this step, the model becomes good at following instructions. In practice, large language models continue to be improved in a third stage, using feedback. A powerful method for this is Reinforcement Learning from Human Feedback, or RLHF in short, where the model is trained on human feedback. Gathering large-scale, high-quality human feedback, especially for complex tasks, is challenging. Therefore, another common approach is Reinforcement Learning from AI Feedback, or RLAIF in short, where an AI model provides the feedback. For reinforcement learning from AI feedback to work well, a highly capable model is needed to provide accurate feedback. The paper we are reviewing today eliminates or partially eliminates the supervised fine-tuning stage. Specifically, to train DeepSeq R10, the first model out of two presented in the paper, we start with a pre-trained model called DeepSeq V3 Base, which has 671 billion parameters. The supervised fine-tuning stage is completely omitted. Additionally, to be able to run reinforcement learning at large scale, we don't use the standard reinforcement learning with human feedback or AI feedback, but rather a rule-based reinforcement learning. Let's expand on how the rule-based reinforcement learning works. The method is called Group Relative Policy Optimization, or GRPO in short. It was developed in-house at DeepSeq. Given a model we want to train, and an input problem, we feed the input into the model and sample a group of outputs. Each output consists of a reasoning process and an answer. The GRPO method then observes the sampled outputs and trains the model to yield the preferred options. It is doing so by calculating a reward for each output using rules. One set of rules is used to calculate an accuracy reward. For example, in the case of math problems with deterministic results, we can reliably check whether the final answer provided by the model is correct. For code problems that come with predefined test cases, a compiler is used to generate feedback based on the test cases. Another type of rules helps to create format rewards. In the following table from the paper we can see how the model is instructed to provide an answer, putting its thinking process within think tags and the answer within answer tags. The format reward enforces the model to follow this formatting. With this rule-based mechanism, no neural model is used to generate rewards, making the training process both simpler and cheaper, supporting running this process in large scale. Additionally, the researchers found that such models may suffer from reward hacking, where the model discovers a loophole or unintended way to maximize the reward, which does not align with the desired goal. In a moment we'll proceed with performance insights and the second DeepSeek R1 model. But before we continue, if you're finding this content valuable, please don't forget to subscribe and hit the like button to support the channel. We also send 1-minute read summaries by May about the papers we review here. You can find the link to join in the description of this video. Let's now discuss a few performance insights of the DeepSeek R10 model. In the following table from the paper, we can see a comparison with the R01 model on reasoning-related benchmarks. Impressively, DeepSeek R10 is comparable to R01 and even surpasses it in some cases. Another interesting figure from the paper shows the improvement progress during training, as measured on the AIME dataset. Notably, the average PAS-1 score on AIME shows a significant increase, jumping from an initial 15.6% to an impressive 71.0%, reaching performance levels comparable to OpenAI's R01. Another key insight is the self-evolution process of the model, illustrated with the following figure. The x-axis shows the number of training steps and the y-axis indicates that as we progress in training, the model's responses lengths increase. During reinforcement learning, the model naturally learns to allocate more thinking time when solving reasoning tasks. Amazingly, this happens naturally, without any external adjustments. If that's not enough, there's another phenomenon which the researchers refer to as the aha moment of the DeepSeek R10. We can see this in the following example from the paper. Given a math question, the model starts its reasoning process. However, at a certain point, the model begins to re-evaluate its solution. The model learns to re-evaluate its initial approach and correct itself if needed. This is crazy and again, it happens naturally during the reinforcement learning training. Let's now move on to discuss the training process of the second model, called DeepSeek R1. But first, why do we need a second model given the remarkable capabilities we've just seen? There are two main reasons. First, DeepSeek R1's outputs suffer from poor readability. Second, it often mixes languages within a single response. Both issues make DeepSeek R10 less user-friendly. An ablation study actually shows that guiding the model to be consistent with one language slightly damages its performance. It is quite fascinating that the model learns to express itself better by using more than one language, unlike humans, which usually stick to a single language. To address these issues, DeepSeek R1 is trained in four phases. We start with the pre-trained model DeepSeek V3 base. In the first phase, which is referred to as Cold Start, the model is trained using supervised fine-tuning on a small dataset of results collected from DeepSeek R10, which were validated as high-quality and readable. This dataset contains thousands of samples, making it relatively small. Incorporating a supervised fine-tuning phase on this small, high-quality dataset helps DeepSeek R1 to mitigate the readability issues observed in the initial model. The second phase is reasoning reinforcement learning. This phase applies the same large-scale reinforcement learning we reviewed for the previous model to enhance the model's reasoning capabilities, specifically in tasks such as coding, math, science, and logic reasoning, where clear solutions can be used to define rewarding rules for the reinforcement learning process. The third phase is rejection sampling and supervised fine-tuning. In this phase, the model checkpoint from phase 2 is used to generate many samples. With rejection sampling, we only retain correct and readable samples. Additionally, a generative reward model, DeepSeek V3, is used to decide which samples should be kept. Some of DeepSeek V3's training data is also included in this phase. The model is then trained on this dataset using supervised fine-tuning. This dataset includes more than reasoning-oriented questions, enhancing the model's capabilities across more domains. The fourth and final phase is another reinforcement learning phase, which includes diverse tasks. Rule-based rewards are utilized for tasks that allow that, such as math. For other tasks, a large-language model provides the feedback to align the model with human preferences. It's also worth mentioning that various smaller open-source models were distilled using the dataset constructed in the third phase, offering smaller alternatives with high reasoning capabilities. We conclude this video with the following figure from the paper, showing the remarkable results of the freely available DeepSeek R1 compared to OpenAI's R1. We can also see impressive results for the 32 billion parameters distilled model. Thank you for watching and stay tuned for more reviews of AI papers.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now