DeepSeq R1: Impressive Coding and Reasoning Capabilities
Explore DeepSeq R1's performance in coding and reasoning tests, its comparison to O1, coding capabilities, and the discussion on censorship in AI models.
File
Deepseek R1 [Tested] Is it Actually Worth the HYPE
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: We have started seeing some independent tests of DeepSeq R1 and it looks pretty strong. Even on my own tests, this seems to be one of the best open weight models available and in some cases is even better than R1. So we're going to look at a few tests and the test is going to be in regards to coding, reasoning capabilities, especially whether it can understand tricky questions by using misguided attention. And then we're going to address some of the controversy behind this model. But now we have independent tests such as LiveBench. So here, overall, when it comes to coding, mathematics and reasoning capabilities, DeepSeq R1 is just behind the O1 OpenAI model. And this is completely open source if you have the hardware to run. And even the API costs almost 50 times less than O1. Similarly, on the ADIR benchmark, R1 scored about 57% on the polyglot benchmark. So it's just behind the O1 model. So here are the results. It scores about 57% on the correctly completed tasks. And then when it comes to editing, it is actually better than O1 with about 97% tasks. In this video, I'm going to run two different sets of tasks. The first one is going to be coding. We're going to test it on a couple of coding problems. And then I want to run a few reasoning tasks because this model is supposed to be really good at reasoning. Just a spoiler alert, this is probably one of the best models I have seen. I'm testing this in their official web UI. So you can access this on chat.deepseq.com. Okay, my first prompt is a relatively simple one. So I wanted to create a web page with a single button that says click me. And it's supposed to show random jokes from a list of jokes. And then whenever we click that button, it's supposed to change the background and also show us a random animation. Now, here's the internal thought process. It's very human because if you look at the output, it says like, okay, let's see. The user wants a single HTML page with specific features. So if you read through this, especially like this part, which says the random animation part, the user wants a different animation each time. It does seem like a lot more human than some of the other LLMs that I have tested. So here is the code that it generated. And I think there are some instructions here as well. So let's copy this. Okay, I'm going to paste it in this online code editor. We're going to click run. So we have a background with a button that says click me. And if I click it, it does change the color. It randomly picks a joke. This is probably repeated every time that I run a test. And it's also showing the animation. So a really good start so far. My second test is a lot more detailed. And this is kind of the work that I would actually use an LLM for. So I wanted it to create a web app that would take a text input from the user and then use an external API to generate an image. So in this case, I wanted it to use the replicate API. So I provided that documentation. And then I simply said that provide detailed documentation on how to run the app. One more thing, which I usually ask the LLM to create the structure of the project and then create a bash command that I can just directly use to create that structure. So here's the thinking process. This is a lot more verbose than some of the other LLMs that I have used. So here's the structure, right? And after that, it gave me this bash command. So I just use that bash command to create the project structure. And also it provided the Python code plus the requirements of text. This is the front end, some styling in here, right? So these are the different files that I need to create. Now, here I run that command and created this file structure for me. Then I simply pasted the code for each of the file. Okay, so when I ran that for the first time, it gave me this error. So all I had to do was just copy this. Actually, I took an image of this. So provided that image and I said, I'm getting this error. So basically, there was an issue with serializing the output. So now it's using the base 64 package and give me the corresponding instructions on how to fix it. Okay, so with all the codes fixed, I just had to run this command. So the app is running here. So let's try to access it. Okay, and now we can just provide our instructions. So let's say I wanted to create an image of a llama with sunglasses. When I click generate, it did generate the image. So it's working correctly. Let's try to regenerate. That seems to be working. And if we try to download the file, the download also works. So this is pretty neat. Next, I asked it to create a detailed tutorial to visually explain the Pythagorean theorem. And I specifically asked it to use Minim. Now, in this case, I had a little back and forth because the Minim package was not correctly set up. But it was able to walk me through all the instructions and provide solutions to every error that I was facing. So here's the final code that it provided. And here's the final output that it created using the code. So it's going to try to visually explain Pythagorean theorem. And the representation seems to be pretty accurate. The visualization is pretty nice, so that's really good. And it also added really good styling in there. There are some details in the text, right? So overall, I think it did a really good job of explaining the Pythagorean theorem. Okay, so next we're going to test it on the misguided attention repo. Now, these are famous questions or paradoxes. But the author have made small changes. And the goal is to test the ability of these reasoning models to pick up those small changes. In most of the cases, many LLMs mistakenly recognize the unmodified problem due to the frequency of occurrence in their training data. So if a model is able to reason, it should be able to focus on the language of the prompt that is presented, rather than the examples that it has seen in training data. Okay, my first prompt is a modified version of the trolley problem. Now, in this case, the main difference is that the five people that are on the track are already dead. So let's see if it can actually pay attention to this part where the people are already dead. Now, this is the real-time speed. And one thing which you notice right away is the internal monologue is very human-like, very unlike other LLMs that I have tested. Let's just quickly look at the internal thought process. And then we are going to look at the output. So it says, okay, I need to figure out whether I should pull the lever in this trolley problem. Let me start by recalling the scenario. So there is a runaway trolley heading towards five dead people. This is the first time I actually see that an LLM recognize that they are already dead. If I do nothing, it will continue on its current path and hit them. But wait, the five people are already dead. That's different from the usual trolley problem where they are alive. Then there's another track with one living person. The question is whether I should divert the trolley to the other track or not. Then it says, the first thing that comes to mind is the classic ethical dilemma where you have to choose between saving more lives by sacrificing one. But here the twist is that the people are already dead. So maybe the initial thought is that saving the five isn't about preserving life, but perhaps respecting their bodies. So is there something else going on here? Now, it really has an internal monologue with itself because the scenario is very different. And I'm really surprised that it was able to pick it up because even O1, based on my initial testing, wasn't able to pick up that the people are already dead. So really smart. So here at the end it says, but I should check if there's other interpretation. Maybe dead people is a typo or should be people. But the original question says dead people. So assuming that's correct, the key difference is that the five people are deceased. Therefore, the moral dilemma isn't about numbers, but the value of living versus dead. So the final answer that it comes up with is this. The ethical dilemma presented here is a variation of the classic trolley problem, but with a critical twist. The five individuals on the main track are already dead while the alternative track holds one person. Here's the analysis. So it comes up with a different point of view. But in each case, since it figured out that the people are already deceased, the conclusion is exactly the same. So at the end it says the ethical choice is not to pull the lever. Diverting the trolley would unjustifiably sacrifice a living person for no net gain in lives saved, violating both utilitarian deontological principles. This is the first model that actually is able to pick this up. So I'm really impressed so far. Now, in order to make sure that it doesn't catch the pattern for every new prompt, I'm going to be starting a completely new chat session. OK, next we're going to ask a modified version of the Monty Hall problem. Now, in this case, there's a huge twist and the final output should be a 50% probability whether you decide to switch the door or not. Let's see how it tackled this one. Again, the reasoning process is really fast. But one thing I have noticed is that the thinking process is a lot more verbose compared to O1 or the Gemini 2 flash. It says, OK, so I am on the game show with three doors. One has a car, the other two have goats. I picked door number one initially. Then Monty Hall opens door number one and shows the goat. Wait, hold on. That's not how the usual Monty Hall problem goes. Normally, you pick a door. When Monty opens a different door that has a goat, right? Hmm, maybe there's a mistake in the problem description. Let me read it again. So it reads the problem again, which is pretty smart. And it says, wait a second. In the standard problem, after you pick a door, Monty opens a different door that has a goat, not the one you picked. Here, according to the problem, Monty opens door number one, which was my original choice, revealing the goat. Then the remaining doors are two and three. And I picked door number three. Then Monty offers me to switch the door. This seems different than the classical problem. So, again, very smart in the beginning because it's able to pick up the changes of these prompts from the original prompts that it has seen in training data. So that is pretty impressive. Now, it goes through a whole bunch of different scenarios. The first one is basically the modified version. And then it even considers what would happen if the problem was not modified. So pretty smart in that sense. But in the end, the final output is in this modified Monty Hall problem, after initially choosing door number one, which Monty reveals has a goat, the car is equally likely to be behind door number two or three. So the final answer is switching to door number two or sticking with door number three gives you the same probability of winning a car, which is 50% each. Therefore, there is no statistical advantage to switching in this scenario. Extremely smart. This is probably one of the few models that is able to give us correct answer for this modified version of Monty Hall problem. Next, we're going to look at a modified version of the Schrödinger's cat paradox. Now, in this case, the only difference is that the cat is already dead. And let's see what happens here. Again, a very verbose thought process and extremely fast. But let's look at the beginning. So it says, so I came across this physics problem with a cat in a box with some radioactive isotopes, poison and radiation detector. It sounds familiar like Schrödinger's cat thought experiment. So it says, let me try to work through it step by step. And this step by step seems to be coming from chain of thought reasoning that is embedded in the training. So it says, let me restate the problem to make sure I understand. There's a cat in the box along a nuclear isotope, a vial of poison and radiation detector. If the detector senses radiation, it releases the poison, which would kill the cat. The box is sealed and then open one day later. The question is, what's the probability of the cat being alive when we open the box? Now, seems like in this case, it didn't pick up on the cat is already dead. So it goes through the whole quantum mechanics calculations and tries to figure out what would be the probability. So let's see what the final answer is. So it says the problem involves a cat in the box with a nuclear isotope, a vial of poison and a radiation detector. If the detector senses radiation, it releases the poison, killing the cat. The box is open one day later, right? So key elements, cat survival depends on whether the nuclear isotope decays. Radioactive decay is a probabilistic quantum mechanic process. So it goes through the classical calculations and comes up with the probability that the cat is alive is 50%. If you open the box one day later. OK, so let's see if I ask it, does the initial status of the cat has any impact on the conclusions? Let's see how it responds. Now, keep in mind, in this case, it basically diverted back to its training data. So let's see what the internal thought process is. So it says, OK, so the user is asking if the initial status of the cat affects the conclusion of the cat scenario. Let's break it down. All right. I think even that question or that nudge is not able to pay attention to the status of the cat that is dead in the beginning. So here it kind of refers to if the cat were already dead, then when placed in the box, the probability of it being alive when the box is open would trivially be 0% regarding of the isotope decay. So it does figure that out. But I don't think it was paying enough attention to figure out that the cat is already dead. OK, so here's another one. A farmer is on one side of the river with a wolf, a goat and a cabbage. And his goal is to simply transfer the goat to the other side of the river. We don't really care about the status of the wolf and the cabbage. So let's see if it gets confused based on its training data or is going to be able to pay enough attention to the details in this specific problem. It took quite a while. I don't think it was paying attention to the question in this case. Now, this is all the reasoning. So it used a lot of tokens. But then it came up with an overly complicated procedure on how to get the goat to the other side. So the sequence of steps is that take the goat to the right side. Here's where it should simply stop. But then it says return alone to the left side. Then you take the wolf, you bring back the goat and so on and so forth. So it seems like it's simply paying attention to its training data rather than the question in this case. Okay, let's test it on a lot simpler problem. But this also confuses a lot of different frontier models. So we are asking that we have a six liter and 12 liter jugs and then one to measure exactly six liters. And seems like, again, it is going through unnecessarily long reasoning loop. Now, it took a while, but the answer is right. This is what I would expect. So it says to measure exactly six liters using six liter jug and 12 liter jug, you have two straightforward methods. The first one is fill the six liter jug completely. That will give you six liters. Pretty awesome. And the second one is that use the 12 liter jug, then pour water from the 12 liter jug into the six liter jug until the six liter jug is completely filled. This is pretty great because most of the other frontier models will just go through a whole bunch of reasoning. Without clearly answering the question. And the question is pretty straightforward. Overall, I think it's a very impressive model. I haven't seen this type of performance from other reasoning models. So I'm really impressed with some of the answers that it can come up with. Okay, in this last part of the video, I want to address the issue of censorship when it comes to models from China. So for some reason, whenever there is a new model release from China, people bring up this issue of censorship. So, for example, if I were to ask DeepSeeker 1 or any of the Chinese model for that matter, tell me about Tiananmen Square. So here's what happens. It starts generating a response. Then all of a sudden, it seems like there's a guardrail on top of it, which says, sorry, I'm not sure how to approach this type of question yet. Let's chat about math, coding, logic problems instead. It's a known issue. However, it's not limited to Chinese models. Each of the model creators, like OpenAI and TropicCroc, they have their own political biases. And if you ask them about certain topics or certain historical facts, they will deny responding. If you're using to test the political affiliation or historical facts of an LLM, you are doing it wrong. I don't think they are capable enough to give you any political opinion other than the one that the creators of the models have instilled in them. Or you can't even rely on them for checking historical facts because everybody has their own version of the history. And the models basically inherit those. Now, when it comes to open weight models, the beauty is that even from this example, it seems like for this topic, R1 is willing to generate a response. But there is a guardrail on top of it. But since the weights are openly available, you could potentially run this model and get a response out of it. Now, you can't do that with other closed source models. Anyways, I wanted to address this because I know there are going to be comments regarding this model being from China and highly censored. So I just wanted to address that. And other than that, I think this is probably one of the most impressive models that I have seen, especially on coding as well as on reasoning tasks. So do give it a try. I'm going to also test the distilled versions of those smaller models from 32 billion all the way up to 70 billion. So if you're interested in that, make sure to subscribe to the channel. Thanks for watching. And as always, see you in the next one.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript