Service pricing & terms
API pricing & terms
Estimate by minutes & options
Earn lifetime discounts
Savings for students & educators
Savings for charities & NGOs
Savings for climate orgs
Speed up research · 10% education discount
Compliant and confidential
Court‑ready transcripts
HIPAA‑compliant accuracy
Expand capacity and revenue
Evidence‑ready transcripts
Streamline team communications
Turn sessions into insights
Ready‑to‑publish transcripts
Education
Education
Our story & mission.
Meet the people behind GoTranscript
Services across 140+ languages
How‑to guides & industry insights.
Open roles & culture.
Security & compliance overview.
Customer success stories.
Integrations, resellers & affiliates.
Find answers and get support, 24/7.
Schedule a call, confirmation within 24 hours.
Speak with a specialist about pricing and solutions.
High volume, API, labeling for AI.
Help with order status, changes, or billing.
Ask anything about GoTranscript.
Explore open roles and apply.
PO setup, Net‑30 terms, and .edu discounts.
30,000+ Professional Language Experts Ready to Help. Expertise in a variety of Niches.
Unmatched expertise at affordable rates tailored for your needs. Our services empower you to boost your productivity.
Service pricing & terms
API pricing & terms
Estimate by minutes & options
Earn lifetime discounts
Savings for students & educators
Savings for charities & NGOs
Savings for climate orgs
GoTranscript is the chosen service for top media organizations, universities, and Fortune 50 companies.
Speed up research · 10% education discount
Compliant and confidential
Court‑ready transcripts
HIPAA‑compliant accuracy
Expand capacity and revenue
Streamline team communications
One of the Largest Online Transcription and Translation Agencies
in the World.
Founded in 2005.
Our story & mission.
Meet the people behind GoTranscript
Services across 140+ languages
How‑to guides & industry insights
Open roles & culture.
Security & compliance overview.
Customer success stories
Integrations, resellers & affiliates.
We're with you from start to finish, whether you're a first-time user or a long-time client.
Give Support a Call
+1 (831) 222-8398
Find answers and get support, 24/7.
Schedule a call, confirmation within 24 hours.
Speak with a specialist about pricing and solutions.
High volume, API, labeling for AI.
Help with order status, changes, or billing.
Ask anything about GoTranscript.
Explore open roles and apply.
PO setup, Net‑30 terms, and .edu discounts.
Speaker 1: In recent years, the field of Artificial Intelligence has experienced rapid advancements, with large language models paving the way towards achieving Artificial General Intelligence. One recent remarkable model is OpenAI's O1, which introduced innovative inference time scaling techniques to enhance reasoning capabilities, achieving groundbreaking results on complex reasoning tasks. However, the O1 model is closed source. Today, we dive into a groundbreaking research paper by DeepSeek, titled DeepSeek R1 – Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. This paper introduces a state-of-the-art, open-source reasoning model, and provides a detailed recipe for training such models using large-scale reinforcement learning. Before we dive in, let's do a quick recap of the training process for large language models. Today, large language models undergo three main stages of training. In the first stage, LLMs are pre-trained on a vast amount of text and code to learn general purpose knowledge. This step helps the model become proficient at predicting the next token in a sequence. For instance, given an input like WRITE-A-BEDTIME, the model would be able to complete it with a reasonable word, such as STORY. However, after the pre-training stage, the model still struggles to follow human instructions. To address this, we have the supervised fine-tuning stage, where the model is fine-tuned on an instruction dataset. Each sample from the dataset consists of an instruction-response pair, where the response is used as the label. After this step, the model becomes good at following instructions. In practice, large language models continue to be improved in a third stage, using feedback. A powerful method for this is Reinforcement Learning from Human Feedback, or RLHF in short, where the model is trained on human feedback. Gathering large-scale, high-quality human feedback, especially for complex tasks, is challenging. Therefore, another common approach is Reinforcement Learning from AI Feedback, or RLAIF in short, where an AI model provides the feedback. For reinforcement learning from AI feedback to work well, a highly capable model is needed to provide accurate feedback. The paper we are reviewing today eliminates or partially eliminates the supervised fine-tuning stage. Specifically, to train DeepSeq R10, the first model out of two presented in the paper, we start with a pre-trained model called DeepSeq V3 Base, which has 671 billion parameters. The supervised fine-tuning stage is completely omitted. Additionally, to be able to run reinforcement learning at large scale, we don't use the standard reinforcement learning with human feedback or AI feedback, but rather a rule-based reinforcement learning. Let's expand on how the rule-based reinforcement learning works. The method is called Group Relative Policy Optimization, or GRPO in short. It was developed in-house at DeepSeq. Given a model we want to train, and an input problem, we feed the input into the model and sample a group of outputs. Each output consists of a reasoning process and an answer. The GRPO method then observes the sampled outputs and trains the model to yield the preferred options. It is doing so by calculating a reward for each output using rules. One set of rules is used to calculate an accuracy reward. For example, in the case of math problems with deterministic results, we can reliably check whether the final answer provided by the model is correct. For code problems that come with predefined test cases, a compiler is used to generate feedback based on the test cases. Another type of rules helps to create format rewards. In the following table from the paper we can see how the model is instructed to provide an answer, putting its thinking process within think tags and the answer within answer tags. The format reward enforces the model to follow this formatting. With this rule-based mechanism, no neural model is used to generate rewards, making the training process both simpler and cheaper, supporting running this process in large scale. Additionally, the researchers found that such models may suffer from reward hacking, where the model discovers a loophole or unintended way to maximize the reward, which does not align with the desired goal. In a moment we'll proceed with performance insights and the second DeepSeek R1 model. But before we continue, if you're finding this content valuable, please don't forget to subscribe and hit the like button to support the channel. We also send 1-minute read summaries by May about the papers we review here. You can find the link to join in the description of this video. Let's now discuss a few performance insights of the DeepSeek R10 model. In the following table from the paper, we can see a comparison with the R01 model on reasoning-related benchmarks. Impressively, DeepSeek R10 is comparable to R01 and even surpasses it in some cases. Another interesting figure from the paper shows the improvement progress during training, as measured on the AIME dataset. Notably, the average PAS-1 score on AIME shows a significant increase, jumping from an initial 15.6% to an impressive 71.0%, reaching performance levels comparable to OpenAI's R01. Another key insight is the self-evolution process of the model, illustrated with the following figure. The x-axis shows the number of training steps and the y-axis indicates that as we progress in training, the model's responses lengths increase. During reinforcement learning, the model naturally learns to allocate more thinking time when solving reasoning tasks. Amazingly, this happens naturally, without any external adjustments. If that's not enough, there's another phenomenon which the researchers refer to as the aha moment of the DeepSeek R10. We can see this in the following example from the paper. Given a math question, the model starts its reasoning process. However, at a certain point, the model begins to re-evaluate its solution. The model learns to re-evaluate its initial approach and correct itself if needed. This is crazy and again, it happens naturally during the reinforcement learning training. Let's now move on to discuss the training process of the second model, called DeepSeek R1. But first, why do we need a second model given the remarkable capabilities we've just seen? There are two main reasons. First, DeepSeek R1's outputs suffer from poor readability. Second, it often mixes languages within a single response. Both issues make DeepSeek R10 less user-friendly. An ablation study actually shows that guiding the model to be consistent with one language slightly damages its performance. It is quite fascinating that the model learns to express itself better by using more than one language, unlike humans, which usually stick to a single language. To address these issues, DeepSeek R1 is trained in four phases. We start with the pre-trained model DeepSeek V3 base. In the first phase, which is referred to as Cold Start, the model is trained using supervised fine-tuning on a small dataset of results collected from DeepSeek R10, which were validated as high-quality and readable. This dataset contains thousands of samples, making it relatively small. Incorporating a supervised fine-tuning phase on this small, high-quality dataset helps DeepSeek R1 to mitigate the readability issues observed in the initial model. The second phase is reasoning reinforcement learning. This phase applies the same large-scale reinforcement learning we reviewed for the previous model to enhance the model's reasoning capabilities, specifically in tasks such as coding, math, science, and logic reasoning, where clear solutions can be used to define rewarding rules for the reinforcement learning process. The third phase is rejection sampling and supervised fine-tuning. In this phase, the model checkpoint from phase 2 is used to generate many samples. With rejection sampling, we only retain correct and readable samples. Additionally, a generative reward model, DeepSeek V3, is used to decide which samples should be kept. Some of DeepSeek V3's training data is also included in this phase. The model is then trained on this dataset using supervised fine-tuning. This dataset includes more than reasoning-oriented questions, enhancing the model's capabilities across more domains. The fourth and final phase is another reinforcement learning phase, which includes diverse tasks. Rule-based rewards are utilized for tasks that allow that, such as math. For other tasks, a large-language model provides the feedback to align the model with human preferences. It's also worth mentioning that various smaller open-source models were distilled using the dataset constructed in the third phase, offering smaller alternatives with high reasoning capabilities. We conclude this video with the following figure from the paper, showing the remarkable results of the freely available DeepSeek R1 compared to OpenAI's R1. We can also see impressive results for the 32 billion parameters distilled model. Thank you for watching and stay tuned for more reviews of AI papers.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now