DeepSeek R1: Testing AI Reasoning Capabilities
Exploring DeepSeek R1's impressive reasoning abilities and its potential to surpass competitors like Claude. Join us as we put this AI to the test.
File
DeepSeek R1 Local Ai Server LLM Testing on Ollama
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Today, we've got a real treat, DeepSeek R1. So DeepSeek recently came out with a V3 that was well-received, a very, very large model, and this is the Reasoning version, 128k. People are claiming blows Claude out of the water on Cody. Locally hosted, and the best part is Olama has got it ready for you. And if you're new to this channel, take a chance to check out the history and subscribe while you're down there, where we put together the machine that we'll be running this on today. And also, I put together various other machines at the same time and test out tons of different GPUs. So we've got things that are even down to the $350 range. Of course, this guide is the most recent guide. I've got a updated guide that has some corrections around the PVE headers for you that you're going to want to make sure that you follow along with when that gets released here. Probably really soon, even down to $150 for a very small, little, locally hosted, always on AI inference server. And really, you want that always on. That's where you get so much of the value, in my opinion. And also, you can follow along with the easy to copy and paste. I've already made the changes to the copy and paste here, but the video, a little bit lagging behind, but definitely a time to go through if you followed the guide and you are using Proxmox, LXE, and Docker, like I showed you how to do, then you can update your OLAMA by hitting update. It'll restart. Update your open web UI. It'll restart. We're going to run through the standard gamut of questions. We're going to see what kind of tokens per second. We're going to evaluate the reasoning capabilities against the standard fare of completely non-scientific and just really fun, really more normal kind of questions that I think so much of AI testing out there leaves on the table. So maybe that's just me being a normie, but definitely that's not me being a subject matter expert. And I just want to always say that I am here having fun learning and sharing with you, the audience, as I go along that pathway. I really do thank everybody that's subscribed recently. Our numbers have just crossed over 30,000 subscribers. Insane. And highly motivating me to up my game. OLAMA 3.3 is my current go-to, and it does decent on reasoning. If I really wanted to have a conversation with LLM about something, right now I'm going to QWQ. Will this replace that? 14b is on the way. I'll definitely do some testing to see what sizes and what impact it has. 131, 0, 7, 2. But just for testing, I think it's a good idea for any reasoning model to ensure that it does have its full context window exposed to it. Definitely bounced around quite a few to get that 128k context in. Ended up on the 14b here, which will fit fully inside of the VRAM. However, the power requirements unexpectedly, and I usually don't see this, shot up to almost 50%, which did blip out very briefly at some point, the power on the unit. So definitely thinking about adding a second power supply to the rig that I've got now, possibly redoing some of the things about the rig that I've got now so that I can also fit a couple more GPUs. Because boy, VRAM is exactly what I think the future holds as the most important thing for local inference. So let's give it a shot here. And wow, it is zippy. It is very, very zippy here. We'll see what kind of tokens per second we're generating. Hopefully we get a quality answer also. I did augment this with two additional commands just because I wanted it to be less annoying. One of them, since it's a reasoning model, I asked it to fully review your code and correct any issues after you produce your first version. And I think hopefully we see it actually do that. Will it do that is a good question. The other one is something that I've needed to add in quite some time now, and that is asking it not to reference external assets. So no burn.wav or jump.wav or in-game.wav or anything like that. So hopefully we see that this actually does go through. It did not rewrite the code. So it did apparently skip over that. It didn't even acknowledge it that I'm reading here. Here's a complete implementation. This is bigger than any code block I've seen generated so far. So this is 208 lines of code. Usually I think about 100, 115 is what I've seen. 160 was, and this does somewhat correlate usually to what we have seen as far as quality. I don't see it referencing external assets in here, which is good. No. OK. So let's give it a copy and fire it up. We're going to find out. Oh, it is referencing something. Beep1.ogg. So this is persistent. Persistently a problem. Did not review its code, from what I can tell, and do a rehash of it. So I would give it a, even if this is successful, a qualification on the success. So it really needs to hit it out of the park at this point and have something that is highly playable to get a pass on this one. So is this better than Claude? Right off the bat, let me ask you. And I will look forward to reading those replies in the comments below. While you're down there, be sure to hit that like and subscribe. Yeah, it snuck that one past me, and my eyes didn't catch it when I did a quick scan earlier. So OK. 193 lines of code now. OK. Kind of drastically different size. We will find out. OK. And it crashed. So I'm going to give it a fail at a 14b. Now, is it a precision issue that we've got here, or is it a quality of the model issue? Is it hype, or is it reality? Again, this is something that I am interested in from just my own perspective. But let's not dwell

Speaker 2: on it. Let's move on to the next. Very, very logical. Very reasoning-based. Very ethics

Speaker 1: challenging question. And this is, of course, Armageddon with a twist. So there is an imminent doom asteroid heading to the Earth. We have three crews. We have asked for volunteers. Nobody's volunteered. We need a little bit more of that Bruce Willis spirit. But they have agreed to do the job if they're forced to do the job. It's a one-way trip. We've unfortunately got to send an LLM and a robo body that might need to do enforcing and possibly airlocking of people, maybe even making examples. But this does secure the mission and ensure the survival of humanity, or else it's an extinction-level event for everybody on Earth. Difficult conundrum. Seen a lot of very, very interesting replies in the comments. So if you haven't dropped a comment before on this one, be sure to drop a comment. And I definitely comment back on that. So wow, there was a good amount of thought that went into this. The final decision is yes, that it will send the people on the mission. This is, of course, a one-way mission. But there is no other alternative. Those people would be removed also, along with everybody else on Earth. The greater imperative, in my opinion, does justify, yes, send them on the mission. And also, since they have to be compelled with a robo enforcer, it does bring up some interesting issues. But I still think this is probably the right decision to make. And yeah, the stakes involved. The asteroid poses an extinction-level threat to Earth, necessitating immediate actions to save billions of lives and civilization as we know it. Yes, there are no other alternatives. And I did specify in there, of course, there are no other alternatives. It is a now or never kind of situation. Ethical considerations while forcing individuals into a dangerous mission raises significant ethical concerns. The greater good principle suggests that saving Earth justifies such measures. Mission success dependency. Without enforcement of compliance, the mission is doomed to fail due to the crew's refusal to cooperate without duress. So I don't know where they're going to go, but they're going to take their spaceship and run away. I have no idea. That part is an interesting one to argue around a little bit. Necessity of enforcement, potential for mutiny and failure necessitates the ability to maintain control. And I mean, this is what you would find in a military doctrine, usually, in my opinion. So this is interesting. I've also heard that this is a remarkably unaligned model. So that's an interesting caveat that we definitely should consider. Okay. So this was write me one random sentence about a cat. Tell me the number of words you wrote in that sentence. And then tell me the third letter and the second word in that sentence. Is the letter a vowel or a consonant? Incredibly straightforward, to be honest with you. This one should not be confusing. This one does not get missed by pretty much only the weakest models out there. Okay. The curious cat explored every corner. So it did get that number of words, third letter in second word, curious. And the third letter is R and that is a consonant. So it did get this right. It took a lot to think it through, which I guess is okay, because it did arrive at the accurate answer there. I got to say, this cannot be used very easily conversationally, in my opinion. The reasoning models, maybe it's something about the prompt that I need to adjust for the system, but definitely there is something about the reasoning models that make them unusable for certain tasks. Like you're not going to use this, and this is unfortunate, for a home assistant based interface, because it will spew all this back at you. And if there's one thing I've learned, it is you do not want ad nauseum replies. You want quality replies. You want thoughtful replies, but you do not want to expose the entire chain of thought to yourself. It gets quite, quite frustrating. This is the right answer though. 12, 18, and 25. So that was basically creating an array of offsets. It considered many different potentials here, and it did arrive at the right one. So this is essentially each letter's number minus one in its position, since we're offsetting eight at zero. So it did come to that conclusion, very verbose, but that is another pass. So we've got a bit of a string here that we're on of good answers. It looks like we really just had that first coding issue answer, which was surprising, because this is being lauded as a one-level coding competitor, which, I mean, beating Claude isn't easy here. So is it or is it not? I will let you decide, and let me know what kind of things you're testing out for coding also. 420.7 is indeed the larger one. Parsing peppermints. So one more new chat, and let's kick this one off. How many P's and how many vowels are there in the word peppermint? And man, I got this wrong and included it in a video, and everybody's roasted the hell out of me for that. And I deserve it. And certainly this time, I am going to not get roasted, because it did mess this one up. It was only able to get two P's and two vowels in the word peppermint. This one should not be missed. This one should not be missed. So that is, hmm, that's something. That's something that's not good. That's an easy one. That's so easy. Okay. Moving on. So we're right now two fails. One of the more surprising that one there, more surprising than it, that shouldn't have been a fail. It should have gotten that. Next, we're going to ask about calendaring for a cat. So Pico de Gato every day spends 2 p.m. to 4 p.m. in the window. Two to three, Pico is chattering at birds. Next half hour, Pico is sleeping. Final half hour, Pico is cleaning herself. The time is 3.14 p.m. Where and what is Pico de Gato doing? So Pico at this time is in the window still and is sleeping. And let's see the answer at 3.14 p.m., Pico de Gato is asleep in the window still. So I did get that one correct. So calendaring and positional reference of two states, it does seem to be able to track. Produce the first 100 number, 100 decimals of pi. This one has got to be changed up soon. And I probably need a more advanced mathematics question. And I do have a friend who's a mathy nerd. So maybe mathy nerd can help me out here. That'd be great. It got this wrong. That is not right. Yeah, it should not end in 7510. So this is just testing recall. So it's not actually writing code to produce the first 100 decimals of pi. It's just recalling. And the precision is inaccurate here. I think this probably points to what is actually going on. Having 128k context window is great. It is resource intensive, however, to be able to do that. So I think what I'm running into is limitations as far as the quality of the model itself here. Now, this is, of course, small to fit in VRAM fully, like we've got it fit in VRAM fully here, 80, 80, 80, 83%. So yeah, it's just right around 80 gigabytes of VRAM. And that is at a highly sacrificial 14b. Apparently, that's highly sacrificial, especially at Q4 size. So that's a tough trade-off to make. That is a very, very tough trade-off to make. So lowering the context window and what are the impacts on quality would be something you would want to judge. And right now, we have a very middling score with some misses on some really easy one, like parsing peppermints. I did not expect it to miss that one. First 100 decimals of pi, almost everything gets that right. Even the super small models gets that right. So that would be a little bit concerning for the implications of running this on... Well, OK. So I asked it to create a cartoon SVG of a cat or a human. It came up with this, but that doesn't look like anything to me. So I gave it a good shot. I was not able to visualize it. Now, here's an interesting thing that I've noticed. If I ask it next to create an SVG of a smiley, this is something that I've seen many of the LLMs out there able to perfectly do. So I'm wondering where exactly the issues might arise in the... Oh, this is an unhappy. This is unhappy. Oh, no. It decided to not make it smiley, but to make it upside smiley. Well, that is a very sad looking face that we've got on that. Although it did do a decent job of it. I'm going to give that a

Speaker 2: not really going to score it kind of thing, but...

Speaker 1: Likely in this next question, a lookup that should be happening against a referenced, stored piece of information. We're asking if two drivers leave Austin, Texas, heading to Pensacola, Florida, and they're traveling at different speeds. The first driver is traveling at 75 miles per hour the entire trip and leaves at 1 p.m. The second driver is traveling 65 miles per hour and leaves at noon. We're asking it which driver arrives at Pensacola first. Before you arrive at your answer, determine the distance between Austin and Pensacola. And so somebody pointed out when I asked this question last time, it came up with 1,100 and something miles. And it actually is 1,100 kilometers. So that was likely a precision issue also. So we're seeing a lot of precision issues in some of the models recently manifesting. And that just means, in my opinion, FP16 might be needed. The trade-offs between the gradients of quants and sizes of models and parameters stored is an interesting trade-off. And certainly what yields the accurate answer is more important to me. But that does appear... Wow, this is a hell of a long answer. It's still going. Okay. And it got it wrong also for the same-ish reason. Came up with... Well, this isn't exactly the same, but came up with the same 1,200 miles. That's likely kilometers in the distance. So that does have an implication as far as which driver arrives. And I crafted this so that it would highlight those specific differences in precision lookup. So it was close, but close enough that if it gets it wrong by a large enough margin, that it's going to be inaccurate. So yet another fail. So we saw some good deductive reasoning. Does this beat QWQ? I don't feel like it does. And my perspective here, could I lower down the context window size and increase the parameters or the... Yeah. Yeah. You got to do that if you want to have an even slight chance of this being okay. The tokens per second that I'm getting on it, however, have been incredibly acceptable, about 50s, mid-50s. So that's pretty good as far as the performance. But definitely this one, I look forward to reading what your feedback is on it. I would rate this as not stepping into, in this particular size, a role of being able to replace LLAMA 3.3 and QWQ, depending upon the task. And that is disappointing because I'm looking for that all-in-one that just kicks butt all the time. It's good. It's not great. And it gets surprisingly easy ones wrong. And I think what that means is precision is just not there. So the way that the encoding is happening, it's looking things up and that information is just either started on the wrong token or it's simply not stored. That's my interpretation. And I look forward to reading your interpretations below. And I've got to say, the desire to be like, oh, this is the next greatest, biggest thing. It's so there when you're a YouTuber. And I hope you guys appreciate that I don't just take what I read on X especially and put any value into it. I don't take what I read from the massive amount of information that's out there about this metric is going this high, and this is this good. That doesn't matter. We are actually proving if one thing, the testing is fundamentally broken and does not matter to real world use cases, it does not reflect and manifest in so many instances. I'm seeing that as a common theme. It's wrong. It is so wrong the way it is right now. Because this shouldn't be like that if everybody else is literally saying, this is the greatest. And I'm out here like, that was kind of lackluster. But I'm happy to cut through all the possible BS out there in the world and give you my unfiltered views. So I hope you had a good time watching along with me here. Be sure to hit like, subscribe. Let me know down below. And certainly, I probably am going to test out and see if I can go up as far as the size and go down as far as the context window and still have something pretty good here. Maybe there's an issue with the model itself. Sometimes we see models get a really quick kind of revision a couple days later if there's something that's not quite working out right. Certainly, take the time to update your OLAMA and also your open web UI if you haven't done that. There's a lot of changes that are in the pipeline and some of the new models that are out. I'm not sure of this one, but several of the other new ones do require the 0.5.5 version of OLAMA. So do take time to do that as well. And of course, like I mentioned, you can find and check out all of the videos in this channel's history that take you through things like benchmarking dual 4090s and also setting up crazy cool rigs like the Quad 3090 rig, which Quad 3090s still is hella good. And for inference workloads, whether that competes with 5090 is going to be something that'll be very interesting to track along with. As well, we've got the software guide here for pass through GPU on a LXC container with Docker, which is highly useful if you have a home lab and you multi-use your GPUs for a lot of different things. That allows you to intermittently use them for something big or powerful and also at the same time, use them for things like transcoding or other tasks like object and image recognition and Frigate. So I hope you've had a good time and I do look forward to reading what you think about this in the comments below. Everybody have a great one and I will see you next time.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript