Speaker 1: Progress in AI is increasingly hidden behind closed doors, but not all of those doors are locked, so let's piece together what we do know. We know, for example, that OpenAI are targeting particular AI agent benchmarks, and I'll give you the highlights of two papers to showcase what that might mean. And no, this is not a video on that new ChatGPT task feature, which I tried to find interesting, but I just couldn't. Meanwhile though, Sam Altman markedly changes gear on takeoff speeds, in other words how fast superintelligence is coming, while telling the hype bros to chill. But DeepSeek, based in China, proves that open source models aren't that far behind with their new R1 model, so whatever Western labs cook up could well be served to all in short order. Whether any of this honestly affects your work directly this year will more depend on how digital your work is and how quantifiable or benchmarkable it is. That, to be honest, will give you the best gauge of what 2025 will mean for you with AI. But first, I want to give you some numbers, and I don't just mean the cost of O3 when it is released, which apparently will still be $200 on the pro tier. Given that they are already losing money with O1 Pro, it does kind of make you wonder about the economics of serving out O3 Pro for $200 a month, but let's see what happens. No, I more mean the numbers behind the operator system that OpenAI looks set to be releasing quite soon. We can already glimpse options to toggle on the computer use agent, or operator, or force it to quit and stop. I'll get to the two relevant papers in a moment, but at face value, if the O series has proven anything from OpenAI, it's proven that it can rapidly improve in any domain that can be benchmarked. So is that why yesterday we got this headline in Axios? Coming soon, PhD level super agents. It's a decently long article, but I'm just going to give you the two or three highlights. A top company, possibly OpenAI, in the coming weeks will announce a breakthrough that unleashes PhD level super agents to do complex human tasks. That's all their words, not mine. That denomination of PhD level is highly disputed, of course. OpenAI CEO Sam Altman has scheduled a closed door briefing for US government officials on the 30th of January. And there's not much other information in the article other than this sentence. Several OpenAI staff have been telling friends that they are both jazzed and spooked by recent progress. Now, while that's vague, we already know publicly that OpenAI are hiring aggressively for a multi-agent research team, which also specializes in equipping models to do more with tools. Think teams of agents, each one of which specialized in the apps and tools you use on your computer. OpenAI want you to be able to delegate tasks that would take a long time to complete and involve complex environments with multiple agents. This is something that they are marching towards this year. Of course, if fulfilled, that could mean the massive disruption and dislocation of jobs in the medium term, according to one White House national security advisor. This was again an exclusive in Axios. And that advisor, by the way, spoke with an urgency and directness that was rarely heard during his decade plus in public life. Suffice to say, though, the first version of this computer use operator agent from OpenAI, according to leaks, won't be capable of much of any of that. It can't yet reliably generate profits or issue meme coins, although I doubt OpenAI would release a model that could. As we enter this year of AI agents doing our work for us, what can we expect from this first version of OpenAI's computer use agent? What kind of tasks are involved in WebVoyager and OSWorld? How about this one, where you type search Apple for the accessory smart folio for iPad and check the closest pickup availability next to this zip code? That is pretty cool that the agent could do that. My only question is that would take me quite a while to type out. I mean, I guess I could speak that to the agent, but if I was typing it out by the time I type that out, I probably could have got the answer just by browsing the web. This one's kind of cool. Find this particular recipe that takes less than 30 minutes to prepare and has at least a four star rating based on user reviews. I think stuff like this is going to work because you could just immediately verify if it's giving you something that meets your criteria. Likewise for Amazon searches, I could well imagine listing a bunch of criteria for something that I want to buy and it just popping up with the item that matched that criteria. Definitely not a long horizon task in a complex environment, but it's a start. The tasks in the OS world benchmark seems to be somewhat harder. The prompt was I illegally downloaded an episode of Friends to practice listening, but I don't know how to remove the subtitles. Please help me remove the subtitles. Now, this is the kind of thing that I am looking forward to. Honestly, it takes me like an hour at least sometimes two to edit these videos in Descript and I'm looking for an agent that can kind of mimic my style of editing and just like immediately edit these videos. Why can't existing agents already crush the simpler tasks? Well, apparently more than 75% of their clicks are inaccurate. Must be pretty frustrating to be an AI agent that's repeatedly clicking the screen and not being able to click the right thing. Oh, and also they were attracted by advertisement content which affects their judgment. Just imagine you in the future having given your credit card to an AI agent and watching Helpless as it clicks on an ad and buys a random product. Now, I know the flaws of agents can seem silly sometimes and like we're years and years away from usable agents, but let me give you a little anecdote. Just almost for fun one time years ago, I created over 200 pages worth of mathematics puzzles and quizzes with explainers. Now, as it happened, those quizzes prove really quite useful to benchmark early AI models like the original ChatGPT and as you probably experienced yourself, those early models like again the original ChatGPT flopped hard on pretty much all of the questions except the most simple calculation ones. Fast forward two years after the initial release of ChatGPT and O1, when I got access, crushed pretty much every single question. This is O1 in pro mode. Obviously, there had been incremental progress before that but even tougher challenges like this one O1 pro aced. So, I guess I'm saying that I feel like we will go from laughing at AI agents to being super impressed with them in actually less than two years this time, possibly within this calendar year. And I echo what Noam Brown said, who is a lead researcher on the O series of models, when he said it can be hard to quote feel the AGI until you see an AI surpass top humans in a domain you care deeply about. Competitive coders will feel it within a couple of years, he said. Then when he refers to Paul, he's talking about the writer behind Taxi Driver, who said the AI came up with better script ideas than he could and he said, Paul is early but I think writers will feel it too. Everyone will have their Lee Sedol moment at a different time. Of course, the legendary player at Go who was beaten by AlphaGo and I don't think that's necessarily contradictory with this post that he said earlier. Lots of vague AI hype on social media these days, of course. There are good reasons to be optimistic about further progress, but plenty of unsolved research problems remain. Now, speaking of vague hype though, that issue is not helped by none other than the CEO of OpenAI who has reversed his position on fast takeoff timelines. First, let me give you his current opinion as of a week ago.
Speaker 2: What's something you've rethought recently on AI or changed your mind about? I think a fast takeoff is more possible than I thought a couple of years ago.
Speaker 3: How fast? Feels hard to reason about but something that's in like a small number of years rather than a decade.
Speaker 2: Wow. What do you think is the worst advice people are given on adapting to AI?
Speaker 3: AI is hitting a wall, which I think is the laziest way to try to not think about it and just, you know, put it out of sight, out of mind.
Speaker 1: Now, let me play you a brief extract from a video I just published on my Patreon about what he thought just 18 months ago or so. Short timelines and slow takeoff will be a pretty good call, it's the prediction he would make, but the way people define the start of the takeoff, reaching the human baseline, may make it seem otherwise. Of course, in an ideal world, we would have clearer communication from these companies about just what the frontier is, but we don't live in that world. And honestly, it is hard to keep up sometimes with the changing opinions of the CEOs of these AI labs. When OpenAI was founded, Sam Altman said, obviously, this is to Elon Musk, we'd comply with and aggressively support all AI regulation. 18 months ago, he personally implored Congress to regulate AI and I covered that at the time. But then this week, we got this very corporate economic blueprint from OpenAI, which was not fun to read in full. In short though, it implores the US government not to stunt AI through regulation. Later in the document, it's promised that OpenAI would never facilitate their tools being used to threaten or coerce other states. Meanwhile, that principle doesn't always seem to be top of mind of OpenAI's CEO. The Anthropic CEO, who chose not to make such a donation, did say this about the stakes for 2025 and his sense of urgency on regulating AI.
Speaker 4: And I feel, I feel urgency. I really think we need to do something in 2025. If we get to the end of 2025 and we've still done nothing about this, then I'm going to be worried.
Speaker 1: I don't know if you guys remember the days where companies used to take six to eight months to safety test their models before release and open source was claimed to be at least a year behind the frontier. These days, speaking to official safety testers and others, and correct me if you feel differently, but it feels like get the model out as soon as you possibly can. And no, open source does not feel like a year behind as proven by DeepSeek R1. It was announced literally an hour and a half ago while I'm filming the video, so no, I haven't read the paper in full, but I have digested some of the benchmark results and noticed that the pricing, by the way, is like 95% cheaper than, for example, O1 when it comes to output tokens. Now, you might agree with me at this stage that official benchmarks tell us less than they used to, and that each of us really should come up with our own benchmark and see which model performs best. I will say it didn't do particularly well on my benchmark, SimpleBench. This is just on the public set of questions. We are going to do a full run very soon. Let me know if you experience the same, but it repeatedly says, wait, no, wait, I'm going to do this. Wait, no, I'm going to do something else. But more seriously, when the OpenAI operator or computer use agent comes out, it will be very interesting to see how quickly Chinese labs can catch up with that. The fact, by the way, that OpenAI's O-series sometimes thinks in its chain of thought in Chinese is perhaps a story for another video. Of course, 2025 won't only be about agents. We're also set to see the merger of the GPT series and the O-series. That would be really interesting. And I will be honest here. You know the model that I'm actually looking forward to the most? That would be Claude for Sonnet. I was spending about 50 hours the last 10 days or so working on this coding project with a colleague. And there's one critical task that we needed an LLM to do. And O1 Pro simply couldn't get the hang of it. But Claude 3.5 did almost instantly. I know that's super anecdotal and I'll be telling you much more about what we're working on soon, but that was quite a powerful moment for me. And speaking of powerful moments, I honestly think you might have just a few while listening to the 80,000 Hours podcast. Yes, they are the sponsors of this video, but I genuinely listen to them and really learn a lot. For example, this podcast 209 I was listening to while on a long walk in London. Really interesting, of course, all the shenanigans that are going on with the non-profit oversight of OpenAI. Yes, by the way, they also have a YouTube channel that I know some of you have already checked out and like. So thank you for checking it out. Thank you also to everyone who has participated in the Simplebench competition, which runs for another 11 days. Lots more to say on that front in another video. Honestly, let me know what you think. Will this be the year of super agents or is Twitter hype out of control again? For me, as ever, the truth lies somewhere in between. Thank you so much for watching and have a wonderful day.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now