Benchmarking Local LLMs for 2025: M4 vs M2 Performance
IndyDevDan tests local language models on M4 Max, M2 Max MacBooks, focusing on tokens per second and accuracy. Will local models meet future demands?
File
M4 MAX MacBook Pro BENCHMARKED Deepseek v3 vs Qwen, Phi-4 and Llama on Ollama
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Yup. Uh-huh. Alright. Hey. Why are you without my cards? Mom. Your cards are with me. Mom, I need your cards. No. But you already have the cards of your choices. or you I promise you. I don't know what I'm doing here. What's up, engineers? Welcome back. IndyDevDan here. Powerful local models like Llama4 are coming. Now the only question we have to answer is, are we ready for them? And more importantly, how will we know? This is where benchmarking becomes ultra important. Today we're going to look at the two most important metrics for local language models. Tokens per second and accuracy. We're going to do this on my brand new M4 Max MacBook Pro. This is a fully specced out, top-of-the-line M4. As you'll see in this video, on a top-end device like the M4 Max, the GPT-40 level models may already be here. Let's understand local LLMs and really SLMs at a fundamental level with benchmarks to prepare us for 2025's local models. So how are we going to benchmark the M4? Let's go ahead and open up everyone's favorite. We're going to be using OLLAMA. We're also going to be using a tool I'm building out called Benchy. Benchy is all about building out benchmarks you can feel. If we scroll down here, you can see here we're running the LLAMA 3.21 billion parameter model. We have some stats and we have a nice huge list of prompts that we're going to be testing against. You can see here, we're also going to be looking at local LLMs. We're also going to be looking at LLAMA 3.2, Falcon 3, 5, 4. This is an insane 14 billion parameter model. We're going to dig into these stats in just a moment here. And then, of course, we also have 0.215 Coder, 14 billion parameters, 32 billion parameters. Then we're going to compare it all to a top-end, multi-hundred billion parameter model, DeepSeq version 3. Whenever I'm running these benchmarks, I like to have a powerful control kind of cloud model to give me a good relative mark of where local models really stand. You can imagine these local models will always lag behind powerful cloud models that can run on massive NVIDIA GPU rigs. We're going to run this and if I hit play benchmark here, you can see this kind of kicking off automatically. But before we look at all these stats and before we really dig into this, let's just run a simple test. On the right here, I have my M2 Max MacBook Pro. This was my previous primary developer device. And I thought it would be cool to first take a step back, start simple, and just run the M2 and the M4 side-by-side and to see if I really got my money's worth, right? I know a lot of people thinking about upgrading to these MCHEP devices are always wondering and comparing, you know, is it really worth it? So in this video, you're going to find out if upgrading to the M4 is really worth it if you have an M2 and you're just looking for that local model performance. Let's open up the terminal on both sides here. All we're going to do is run this prompt. Paste it in here. And so we have this exact same prompt that we're going to run on both sides. So let's break down this command piece by piece, right? We're running olama run lama 3.2 1 billion parameter. We're running in verbose mode so that we can see our tokens per second. Then we have this simple AI coding prompt. We want Python code and we want our prompt to output code exclusively. And then we're passing in this information dense function definition. Def CSVs to SQLite tables. Then we're specifying two parameters for this function. CSV paths and SQLite path. The plural here is really important. We're giving the model a lot of information just by passing in this function definition. If you've taken principle AI coding, you know exactly what this is and how this works. If you haven't, you'll see exactly what this prompt is going to do here in just a moment. We have this running on both sides and all we're going to do here is hit enter at the exact same time. So here we're going to be looking at one of the two most important metrics for local language models. Tokens per second. We'll get to the second metric when we look at our benchmark suite. But let's go ahead and just fire this off and let's see the performance between the M4 and the M2. So here we go. So you can see both devices are extremely fast. Let's see what the final output was here. So we can see on the M4 Max, we have 200 tokens per second. And on the M2 Max, we have 160 tokens per second. If we go ahead and hit up and just give them another shot, we should see a similar output here. Nice. So around the same thing, right? 200 here, 160 here. So what we're seeing here with our first comparison between the M2 and the M4 is that the M4 definitely has a significant advantage, right? 40 tokens per second is quite a lot. This means that the M4 so far is about 20% faster, right? 20 to 25% faster. So let's go ahead and scale this up. Let's see if this trend continues when we bump up the model size. What we're going to do is bump this up to a Falcon 3, 10 billion parameter model. Same thing for the M2. And we're running that exact same AI coding prompt. We want to get a function written out for us that converts CSVs into SQLite tables. The big difference here is that we're running a 10 billion parameter model. Shout out Falcon 3. Big shout out to the local llama subreddit. I was completely unaware of Falcon 3, 10 billion parameters. And as you'll see in our benchmark, this is quite a powerful small model. Let's see how the M4 Max compares to the M2 Max on a 10 billion parameter model. Okay, so they're both kicking off. They're a lot more precise. We're getting a lot less text and a lot more actual function definition. If we look at the stats here, you can see that difference has shrunk quite a bit. Okay, we're now getting 55 tokens per second on the M4. And we're getting 42 tokens per second on the M2. Let's go ahead and hit up and let's run this again. I want to note that we're still getting great performance out of both of these devices. We have 55 tokens per second, 41 on the M2. You can also see that the model performance has improved quite a bit, right? We're getting a more concise response. You can see it's very similar on both devices as you would expect. And, you know, it's outputting more code exclusively. We are getting a description here, but this happens. Let's go ahead and move on to an even larger model, right? Let's scale this up. And what I want to do here is give you a good sense of when you're running local models on device, what's the performance you can expect as you scale up the parameter count of the model? Because that's the biggest correlation that you can draw. The tokens per second of your language model and the total duration it takes to run is going to correlate strongly with the parameter count. Let's scale it up once again. Let's use an even more powerful model. We're going to paste this in. And we're now using Quint 2.5 coder, 32 billion parameters. So we're tripling the size of the model. Let's see how our M4 and M2 Mac devices can perform while operating on a 32 billion parameter model. Here we go. We're going to enter at the same time. We're running that exact same AI coding prompt on Olama in verbose mode. All right. So the M2 is now taking some time to load here. It may just need to load the model. I've noticed that the Quint 2.5 coder 32 billion parameter is a bit more chatty. So it adds a bit more textual information. We're getting about 20 tokens per second on the M4. And on the M2, we're getting 14 tokens per second. If we run these both up, we're getting 20 and 15. Let's go ahead and run this again. Now they're kicking off at the same time. That's good to see. All right. The M4 has completed. And about the same stats, right? So we're getting about 20 tokens per second on the M4. And we're getting 15 tokens per second on the M2. Really interesting to see this, right? We're continuing that trend of seeing about a 15 to 25 percent improvement on the M4 Macs over the M2 Macs. So let's push this one step further. We're nearing what I like to call the dead zone. Right now, we're running, you know, still above 10 tokens per second. When you dip below 10 tokens per second, the model truly becomes unusable in my opinion. Comment down below. Let me know what's the minimum tokens per second that you run a local model at that you're willing to accept. For me, it's 10. Once we get below 10, it's just not usable. It's too slow. So when I'm looking at both these devices, the M2 Macs is going to kind of stop out at this 32 billion parameter count level, right? Anything below this is good. The M4, on the other hand, is handling this at 20 tokens per second. That's still a pretty solid speed. Let's see if the M4 and the M2 can handle 72 billion parameter models. Final test here before we jump into our benchmarks. Let's see how both these devices handle 72 billion parameters. Okay, this is a big model, right? For an on-device model, these are huge. Let's go ahead and kick these off. And let's look at the performance on the M4 Macs versus the M2 Macs. Here we go. All right, so now they're both just spinning. The M4 Macs has kicked off. M2 is still thinking, still spending some time. We can see here we are almost crawling here. Yeah, we're below that 10 tokens per second on the M4 Macs. It has completed. And now we're kicking off on the M2. This is also crawling, but it doesn't look as slow as I would expect it. Let's go ahead and give it some time here. Okay, we're coming in at 7 tokens per second on the M2. And we're coming in at 9 tokens per second on the M4. So definitely the 72 billion parameter model is the hard drop-off point. It looks like we can run this on the M4. But as I mentioned, this is my drop-off point. Anything below 10 tokens per second is not really bearable. It's not really functional. I don't have the patience to wait for 10 tokens per second. Maybe you do. I'm going to run this one more time here. Now they're both kicking off at the same time. There is some initial model loading that happens. Fantastic. Okay, so about the same thing, right? 9 tokens per second and 7. Again, we're seeing that kind of 15% to 25% advantage to the M4 Macs if you fully specced it out. Over the M2, again, fully specced out. Really, really interesting to see. Now what we're going to do is let's scale this up. We're just looking at one prompt. We're looking at one example. M2 versus the M4. You can kind of see the difference here. If you're thinking about upgrading to the M4, my advice is if you have an M2 and you're not using local models every single day, I would just hold off for the next generation. If you're running on the M1 or anything older, upgrading to the M4 will be a massive, massive update and a massive upgrade for you, especially if you're looking to position yourself ahead of the curve for local models in 2025. You can see on a 72-billion-parameter model, I'm getting about 10 tokens per second out of my M4 Macs with 128 gigabytes of unified RAM. To make it absolutely clear, the M4 Macs, again, with Mac specs, as many performance and efficiency cores as you can get, and with the maxed out unified RAM of 128 gigabytes, this is the best single-device purchase you can make if you want to get a ton of performance for language models out of the box without doing any configuration. I know there are ways to improve these numbers. I know there are tools like MLX and Llamafile that I'm going to be spending time with to see if I can get these numbers cranked up. Like, subscribe, and comment if you want to see additional videos on how to crank up local model performance on your device. Now let's scale this up. We're getting a good idea of the relative performance of tokens per second between the M4 and the M2. But in order to really understand how models perform for your specific use case, we need to look at large suites of benchmarks. So let's go ahead and do that now. So focusing in on the M4 here, let's look at Benchy. So Benchy is a benchmarking tool that I'm working on to build benchmarks that I can feel, that I can understand, and get a deeper sense of how models perform by looking at the performance of models relative to each other. So let's break down this simple function coder benchmark. So first off, let's configure this a little bit so we can see all of our models. So I'm going to hit show settings. I'm going to go into bench mode so that we can collapse the headers. I'm going to move our model stats to simple mode so we can just get a reduced look here. I just want to see accuracy and tokens per second. We can see Lama 3.2, 1 billion parameter is blazingly fast. And if we reduce the block size here, we can see all of our models. So if we shrink down to about this size, we can see the performance of all of our models side by side. So, you know, just looking here, top to bottom, we can get some valuable information. We can hide the settings and we can get some good information here, right? I'm using DeepSeek V3 as our control group. Whenever I'm creating benchmarks, I like to have a powerful cloud model. I like to have this as a control group to really see how powerful are these local models. We can see here DeepSeek V3 aced this benchmark. We can see Quint 2.5 coder, 32 billion parameters, aced it. 14 billion, also aced it. And then PHY acing it as well. We pick up a couple errors when we drop down to Falcon 3, 10 billion parameters. And of course, Lama 3.2 latest. This is the 3 billion parameter model. You know, the tokens per second here is really important to pay attention to, right? Of course, DeepSeek cloud model. So there's no tokens per second information for us. But you can see here, you know, by running this benchmark and we'll dig into what exactly this benchmark is testing in a second here. But by running this benchmark, we can see that we don't need Quint 2.5 coder, 32 billion, right? We can settle on a 14 billion parameter model coming out of a Lama. We can use 5.4. We can use 2.5 coder, 14 billion. And if we're OK with accepting some errors and you're willing to sacrifice for speed, you can do that here, right? We can even push it even further to get four times the tokens per second by using a 3 billion parameter model. Of course, our accuracy is dropping down 17 percent. So this is the idea behind, you know, benchmarking and benchmarking tools and Benchy specifically, right? So I'm building this benchmarking tool. Link is going to be in the description for you if you want to check this out. It's a work in progress, but this is something I'm working on to really understand the performance of local models against each other. And also, I'm just setting up a bunch of benchmarks for some really exciting work that I'm working on. I'm sending up a lot of benchmarking work for upcoming releases around principled AI coding. Stay tuned for that. But benchmarking is ultra, ultra important. If you want to understand what capabilities are available to you today and tomorrow, right? Because once you have the benchmark up, it's so easy to just plug in another model and then get results out of it. All right. So let's go and just scale this up a little bit and we can see more details on a per prompt basis. Let's double click into this prompt and let's see what's going on here. This is our Lama 3.2. It's made a mistake on the first prompt here. And if we just go and open this up, we can take a look at this prompt. And matter of fact, we can just copy the input prompt out. Let's open up code. And before we dig into the actual prompt configuration file, let's just look at the prompt and let's go to XML. And we can see here we have this very simple prompt. Generate a function for the given function request. So you can see we have a dynamic variable here and we're passing in an AI coding prompt for a function definition. This is a powerful coding technique you can use just by specifying the definition of the function and a small detail. You can generate the entire function. Right. So here's a super, super simple example. In this prompt, we're saying, you know, generate the add function. And then we're also passing in the function arguments. OK, so we're saying use the provided arguments one and two. And of course, it's going to generate this. We wanted to also print the result. And this is the prompt we're building. So you can see here we're building this prompt in a very unique way. How is it unique? It's a self-contained prompt that we can immediately evaluate in a pass-fail way. Why is this so important? If we look this prompt again, we can see our model's response. Llama 3.21 billion parameter. Just completely fudged this. OK. Misread the instructions completely and just got this answer completely wrong. Look at a successful model here. So let's look at the three billion parameter response. And if we click into this, we can see a perfect model response. So take a look at this model response output. This is a unique type of output that is evaluable. OK, so that means that we can build up benchmarks in a pass-fail way. This is really important because it makes the benchmarking process much simpler. We take this code and we run it through a Python executor evaluator. And it's giving us a full on execution result. Right. So inside of our benchmark suite, we can then just set up the expected result. And if it's not that, it's wrong. OK. So obviously there are some parsing related things that go around this. But this pass-fail mechanism, this full on evaluation, makes for really simple benchmarking, really simple testing with your language models. So let's go ahead and take a look at this exact benchmarking format. So you can see here a couple of things. Let's go ahead and collapse everything and break it down one by one. So you can see we have our benchmark name. We have the purpose. And then we have our base prompt. OK. And the base prompt is, of course, what we were looking at before. And here we have a XML-ish concise level four prompt. I'll link the four levels of prompts video in the description if you want to check that out. We have a powerful prompt here because it has a purpose, clear instructions, and then dynamic variables. We definitely could have added examples to this to improve it as well, but it's not necessary. This is the prompt that we're working with. Now, the cool part about this benchmarking framework that I'm building out is that you can, of course, list several models. So these are all the models that you want to run this prompt against. But most importantly, we have the prompts. OK. And so we have a list of prompts. Let's go ahead and just collapse to this level here. So we have a whole list of dynamic variables and expectations. And what happens here, as you can see in our benchmarking test here, is that for each model, we're going to loop over every set of dynamic variables. Every set of prompts. And the prompt has dynamic variables and an expectation. So if we look at DeepSeq Chat, of course, it's going to have every answer correct. This is what we pay for. So if we scroll down here, you can see we have this test here where we're passing in the function definition count vowels. And we're getting this nice, clean, simple Python response out of DeepSeq. In the instructions, we say generate the function, call the function and print the result. And so you can see here we're passing in that function definition. And then we're also replacing the dynamic variables. Right. So for each run, you can see here, if we open this up, we're replacing function. And we can go ahead and open these up here. So for each prompt, we're replacing every one of the dynamic variables. And that becomes a brand new execution. Right. And so this is a great way to run your tests, run your LLMs and run your functionality at scale in a quick way. So you can see here we're generating an add method, multiply list, reverse string, so on and so forth. Right. I have, you know, 30 different unique prompts here that we're running to test all of these models in a specific way. Right. And a really important piece of this, like I mentioned, you need to specify an evaluator, a way to actually say that your model's output was correct or not in code. You can take a look at the code base if you're interested. We have this execute Python code with string output evaluator. And of course, you can build out different evaluators to test different outputs from your LLMs. There are popular benchmarking tools. I also like PromptFu. This is my take on building out a, you know, really kind of opinionated benchmarking tool. Again, the link is going to be in the description for this. If you want to check out the benchmarking tool that I'm building up for my work products and projects. So this is what this looks like. Right. It's simple. It's intuitive. It scales well. And it's all about just defining the right evaluators and then building up a nice suite of tests. So this is cool. Let's go ahead and look at some additional tests. Right. This is just one set of tests that we can run. If we hit reset here, we can drag or drop our configuration file or a completed execution of a bunch of tests. Let's go ahead and run this simple math benchmarking file. If I drag and drop simple math at YAML, it's going to kick off the benchmarks. And if we open up the server here, you can see we're kicking off these benchmarks. We're running along with 3.2 running at a nice speed here. Then it's going to run the next model. And, you know, you can see the list of models that we're going to run here. 3.21 billion, 3.23 billion, Falcon 3, 5.4, two Quinn models. We're looking at the 14, the 32. And then, of course, DeepSeek V3 as a control cloud model. So while this is kicking off, let's go ahead and just look at the base prompt. If you understand what the base prompt is doing, you can really understand what's going on. What we're doing here in this benchmark is we're evaluating the ability of a language model to perform simple mathematical operations in Python. So here's another simple way to benchmark local language models. We're asking it to do simple math by writing the math operation in Python. And then it just prints the result. So a big key element of benchmarking and testing your prompts and testing your language models is making sure that it's evaluatable. And the simpler the evaluation, the less noise there is in the evaluation step. The more confident you can be that the changes that you're making to your prompt and the strength of the model is actually coming through your benchmarks. We're saying simply evaluate statement in Python and print the result. And they're passing in the statement here. Right. So super simple. We can take a look at the dynamic variables here. Add five and five. And our expectation is, of course, 10. Add five and five. Then split in half. Then triple. Fifteen. Multiply three by four and add seven. Nineteen. So on and so forth. Right. So you can add entire sets of these. This is why using dynamic variables in your prompts is so important because it allows you to scale to tens, hundreds, thousands, millions of executions. Right. Anyone building products with language models knows this. Dynamic variables and level four prompts are essential for scaling your prompts into products and tools. Our execution should be about finished. We're running Quentin 2.5, Coder 32 billion. Nice. Ran through that. And now, of course, we're running Deep Seek. This is just hitting their API. So this should be finished relatively soon. I've been really impressed with Deep Seek V3. It does appear to be near the level of Coop 3.5 Sonnet. It's definitely fractions of the price and it's giving effectively the same performance. It's allowing me to scale up a lot of my AI coding work using spec prompts to a insane level. More on that in upcoming videos. But you can see our benchmark has completed. So if we scale this down a little bit, we can get, you know, a nice, nice view of this. Let's look at the stats here. Right. So kind of interesting. Right away, let me open up to verbose mode here. You can see we're getting 18 right, 12 wrong from Lama 3.2, 1 billion. The 3 billion model is doing really, really well. Look at this. Right. Looks like simple math operations. You only really need a 3 billion parameter model to do simple mathematics. Right. So this is a good thing to see. And again, this is why you make benchmarks. You really want to know what parameter count do I really need for my use case? How powerful does the local model or even the cloud model, if you're benchmarking cloud models, how powerful does it really need to be to do this specific thing for me? Falcon 3, same level of performance. 5.4 gets everything perfect. Love to see that. Shout out Microsoft Code Bros. Quint 2.5, perfect. Quint 32 billion, getting quite a few mistakes here. That's very interesting. We can dig into why that is, but let's go and just kick this off. So to be clear, you know, this benchmark runs and executes all the prompts and then it saves all the results. If you look at the code base here, if we collapse inside of the server directory, we have our benchmark data. So, you know, this is our, you know, YAML files here. I'm going to commit this code base with all these simple benchmarks for you, if you're interested. And then we have these reports. So this is really cool. Right. So every benchmark that you run generates a full report. And so what we've just done here is generated this new math report. And you can see here we have all the results. So what we're doing here when we're clicking play benchmark is we're just replaying it as if it was live. But we've already collected all the data. So here we go. We're kicking this off. It's really just incredible to see what 200 tokens per second look like. Right. Llama is done. Let me scale this a little bit. There you go. So we've got 30. Falcon's still running, coming in at, you know, 57 tokens per second. So this is what that looks like. Right. We can see here 5, 4, 39 tokens per second. That's finished. Falcon finished. Quint 14. And you can see here, right, as the parameter size grows, tokens per seconds goes down. Right. If I just highlight this, you can see as the model grows larger, tokens per seconds goes down. Right. This is just a simple trend. Really important to keep in your mind. A simple mental model that you can use and build up as you're understanding local models and models in general. Right. Interestingly, though, accuracy isn't always going to scale parameter size. It correlates very strongly. But you can see here Coder 32 has made several mistakes that smaller models have not. So if we click into one of these, we can kind of see it's just outputting the answer. OK, so it's kind of being so it's being kind of too smart for its own good. Right. Instead of actually writing the Python code we are looking for, it just outputs the answer. Right. That's not what we're looking for. We can't execute that. OK, so let's look at another one. Yeah. It also outputs the answer here. This is just wrong. So it's printing the wrong answer. Again, printing the answer. And what we really are looking for from this test is we want to generate the code in Python so that we can execute it. Right. So you can see this is a correct answer. And of course, if we look at Deep Seek, we can go to problem 30 here. And you can see this was correct. Right. So the problem here was convert the fraction three eighths to decimal form. So you can see now that 100 percent for Deep Seek, 100 percent for Quint 2.5 Coder 14B, 100 percent for 5.4, 14 billion, 96 percent for Falcon 10B. I am very impressed with Falcon. I had no idea this model even existed. 10 billion parameters performing really well, performing really quickly. Take note of this. Right. 67 tokens per second. Very, very powerful model. It is pretty crazy to see the difference, though, between if we just rerun this, like watch the difference between the tokens per second between these top three. Right. Let's let's go and scale these down a little bit. Watch the difference between the speed of 210 tokens per second. 150. First time we go up by 50, then we go up by 100. So let's just replay this and then look at how quickly this is. Right. 200 tokens per second is insane. Right. So 1B is done. 3 billion is done. And you can see here Falcon had a slow start. It had some boot up time. But this is what 60 tokens per second looks like running 30 prompts. Right. Not as fast as you would think. But again, it is running 30 prompts and it's maybe running a I don't know. How large is this? Let's let that finish. Nice. And if we look at this prompt, we can see the size. Right. So if we just highlight the base prompt, we can see we're running 154 token prompt. Very tiny prompt. So this all looks great. You know, just by looking at this and by looking at our AI coding function generator and our simple math benchmark, it kind of looks like local models are great already. Right. Not so fast. Let me show you a much more difficult problem that these language models will just trip up on pretty rapidly. I'll hit reset here and let me pull up instead of dropping in a full configuration file for it to execute on. What I'm going to do is open up a existing completion and I'll just drag and drop this over. This is a completed benchmark report. So it's going to automatically just open up with the prompts completed. It doesn't need to run these. Right. If we just scale everything down, if we go to simple, you can see here we're only running five models and we have 15 prompts. And look at the scores. Right. 0 percent. 0 percent. 0 percent. 0 percent. 26 percent. There are many problems out there. There are many prompts that local models cannot tackle yet. OK, and I want to bring this to your attention. I am very balanced in my portrayal of these tools and technology. Great engineering is balanced. You need to always consider tradeoffs in order to build out the best tools, features, products, and to really use generative technology properly. You need to be honest about where they're falling apart. OK, so I have this much harder problem. We can go ahead and kick this off and talk about it. So I'll just hit play benchmark and I'll go into verbose mode. You can see the horrendous stats from basically every one of these models. And then while this is running, let's quickly just look at what this prompt is and why it's so challenging. Let's look at something that was correct. So we have this first prompt here by Deep Seek V3. What is this? So what's going on here? So we want an LLM to parse the given natural language request. This is going to be a speech to text request and then produce the right CLI command. So we want a command output to our terminal. It's going to run type of commands and it's going to run it against a Python script main.py. So we're looking for something very precise here. OK, so how does this work? Right. Let me just pull up this full prompt. Let me just copy this all out. Let's go to VS Code. Open this up. We have a few instructions here at four instructions. And what we want to have happen here is we want our speech to text request, basically our natural language query. Right. Ping this server quickly. And what we want this to do is call the proper command. Right. Call the proper type of command automatically for us with the right params based on the function definition. So this prompt needs to read this code. And there's quite a few commands here. Right. You can see we have over 20, maybe 30 commands here in this command set. And our language model needs to read all these, read in the natural language request, and then give us the right command where we can immediately call this method. Right. So you can think of it like function calling. It's not exactly function calling because you need to position everything perfectly. But it's very similar. You definitely could use JSON parsing to get this moving. But this is another great test for language models. So we can see here our expected result was Python main.py ping server and the execution result nailed it. Right. So this is a correct response. So this evaluation gets marked as correct. If we go over to our type of command configuration file and we open everything up, let's close the base prompt. We're using a raw string evaluator. And so, again, I just want to talk about evaluators. They're really important. They allow you to pass fail the response of your language model. And so when we look at these prompts here, we have a whole list of prompts and you can see that exactly. Right. So we're passing in this variable that gets replaced from our base prompt when we go and collapse the XML here. We have one dynamic variable here. Right. Speech to text request. So that's getting replaced and we're expecting this result to come out. Right. And so if this comes out, you can imagine we're talking to our personal AI assistant and we say, you know, ping the server quickly. We wanted to automatically execute this method for us. OK. And so this continues all the way down the line. You can see we have a ton of failures. Basically, every one of our small models bombed this. We can click in and inspect why. So, you know, Falcon three, 10 billion just responded with check health. OK. So that's just wrong. Five for latest. You can see here we pass in the natural language query. Delete user ID. User one, two, three. Skip confirmation. And so this is the expected result. Right. Delete user. So we're calling this type of command. User one, two, three. Confirm. OK. And you can see five just talking, giving us a lot of information. Not relevant. We just want this one command. So this is an example of a prompt that just completely falls apart, even for powerful models. Now, 100 percent. I can improve this prompt. And you've caught me literally in the act of improving this prompt. I want to create this video to kind of share my process, share what I'm working on and share why benchmarks are so important. Right. So I am literally in the process of improving this prompt when I'm writing prompts at scale. What I always do is I make the prompt work for a powerful cloud model and then I try to improve it to scale it down to smaller models. Right. Oftentimes, this means you need to add additional information, additional instructions and examples. And, you know, you need to clean up the purpose and maybe you need to add additional static or dynamic variables. Whatever the case is, you'll often need to add more and more instruction to your prompt in order for it to work for, you know, the 10 to 30 billion parameter model size. Once you get below 10, adding more things doesn't really help. Right. Because the model just can't handle it. It cannot handle the complexity. If we click into, you know, even Lama 3.23 billion parameter, it's just completely wrong. Like this is just wrong. It's giving us a whole set of it's giving us Python code. Right. It's just completely irrelevant. If we go to 13 here, you can see it's writing code for us. We don't need code. We need a single answer. You can see Falcon 3 also having some trouble. Right. Remove user with ID user one, two, three, skipping confirmation. I need the actual CLI command. I don't need you to say this back. Right. So this is the execution result. This is what Falcon said back after we passed in this prompt. OK, so local models are improving. It's important to have a way to understand their capabilities by benchmarking. We can do that and we can position ourselves ahead of the curve. You can see here I have the brand new M4. I don't just make predictions. I make predictions and I bet on them. I am predicting we're going to get powerful local models and I want to be ready for it. That's why I have this device. I'm betting on my predictions. You can see that here. I'm going to link last week's 2025 predictions video. That was a really important video. Everything we're going to be working on on the channel. We talked about, we discussed and really broke down in that video. Big shout out to everyone over the past couple of weeks that's been digging into principled AI coding. I am continuously, constantly working on upcoming improvements, stabilizing the product so that we can expand it and do new cool things with it. Hint, hint. That's why these benchmarks are going to be increasingly important over time. If you're interested in benchmarks, you can check out this tool called PromptFu. I'll also link our previous PromptFu videos that we've worked on in the past. That's a great benchmarking framework if you just want something out of the box. If you want to experiment with Benchy, the Prompt framework that I'm actually building up right now, you can go ahead and do that. It's a work in progress, but it's here if you're interested and you want to check it out. Thanks for watching. If you like the video, you know what to do. Drop a like, drop a comment, drop a sub, and I will see you next week. Stay focused and keep building.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript