Speaker 1: Good morning. We've got something exciting for you today. We're going to launch our first agent. AI agents are AI systems that can do work for you independently. You give them a task, and they go off and do it. We think this is going to be a big trend in AI and really impact the work people can do, how productive they can be, how creative they can be, what they can accomplish. We're starting today with Operator. Operator is a system that can use a web browser, in this case, a web browser in the cloud, to accomplish tasks that you give it. And we'll show you a demo in just a second, but it's really quite cool what it can do. Just like you would use a web browser, you can get pixels in. You can look at a screen, and an operator can do that and then control the keyboard and the mouse and do all sorts of things. This is going to go live today in the United States for pro users. And it'll be other countries soon. Europe will, unfortunately, take a while. And we'll also, in the coming months, make it available to plus users. This is an early research preview. We've got a lot of improvements to do. We'll make it better. We'll make it cheaper. We'll make it more widely available. But we really want to put it in people's hands. We'll also have more agents to launch in the coming weeks and months. But that said, we'll talk more later. So excited, just want to show you a demo. I'll hand it over to Yash.
Speaker 2: Great. Thanks, Sam. Hi, I'm Yash. This is Casey. That's Ray. And we work on the computer using agent team. And we're so excited to show you Operator today. As Sam said, Operator is an early research preview. It will do a lot of cool things. It also makes mistakes, sometimes embarrassing ones. But let's show you what Operator can do. OK, so this is the Operator homepage. It lives at operator.chatgpt.com. It'll be accessible as soon as the live stream is over. And as you can see, the interface is very similar to ChatGPT. You can type in a prompt, and Operator will try to execute the task to the best of its capability. We'll also see we have a list of pre-filled prompts here. These are not really meant to be recommendations. These are meant to be things to give you an idea of what Operator can do. We have also collaborated with various brands like OpenTable, Allrecipes, StubHub, Uber, Thumbtack, Doordash, eBay, Target, to make sure Operator really works well on these websites. But also, we think users will find Operator very valuable in interacting with these platforms. So with that, let's jump in with a demo. OK, so I'm going to start with something fairly simple. I'm going to use OpenTable and say, book me a table for two at Beretta tonight at 7 PM. OK.
Speaker 3: And so you specifically chose OpenTable?
Speaker 2: Yeah, in this case, I'm asking Operator to use OpenTable to book a table for two at Beretta. Beretta is a restaurant in San Francisco. It's great. You should try it out, and at 7 PM. And I'm using OpenTable in this case, but I could have easily said, just do Beretta, and it would have probably gone to Search Engine, and figured out how to make a reservation as well. But let's see what it does. So can you explain what's happening in this? Yeah, great. So I'm going to expand this a little bit. So as soon as I type in the query, Operator instantiated a completely remote browser. This browser is running in the cloud somewhere. And as you can see, it's already up and running. And my hands are off the keyboard. I'm not typing these things. So this is just the AI's clicking around. Yeah, it's just clicking around. It started this browser session. It knew where OpenTable website is, which is opentable.com. As you can see, it's summarized chain of thought here as well, which is it's gone to the URL, searched for Beretta, and something cool really happened, which is for some reason, Operator, OpenTable thought we were in Virginia, and it autocorrected itself to San Francisco. This is using, so like chat GPT, in Operator, you can also give custom instructions. So I'm going to show this really quickly here. OK, so I've given a custom instruction that for queries that need it, I live in San Francisco. So Operator recognized that, and then autocorrected itself to go to Beretta. OK, it looks like 7 PM isn't available, but you know what? 745 is just fine, so we're going to go do that. So in this case, Operator came back, and this is a really good example of task delegation where Operator needs help, or needs assistance, or just wants to ask you something. It'll just come back, and you can answer that.
Speaker 1: So in practice, you wouldn't have had to watch this. You could have just let it go off while you're doing other things. Then it would come back and say, hey, I can't do 7, 745, yeah.
Speaker 2: And we're starting with a web app. You'll get notifications, et cetera. When Operator moves into mobile, you'll get mobile notifications, much like interactions we do with general apps. OK, yes, that's great. Let's do it. OK, so again, very simple interaction as you would have with an assistant, which is, hey, I found a reservation. 7 PM wasn't available. Let's do 745. And again, you can see Operator, at this point, has said, OK, should I? Again, this is a really good example of the confirmations work we're going to talk about a little bit later. Before doing an action, which is sort of irreversible in this case, you can cancel a reservation, obviously. But again, taking a critical action, Operator is asking us before actually doing it. And in this case, I'm going to say, let's do it. OK, it was pretty quick. I would say, like, 50 seconds. And again, we were watching in this case, et cetera. But as Sam said, kick it off and go on. So let's try something. Oh, unfortunately, that table is no longer available. So it's going to probably go and find alternative time slots. Oh, that's kind of cool, actually. That's never happened before. Uh, let's do 815. OK, while it's doing that, how about we try something a little bit more complicated?
Speaker 4: SAMAN AMARASINGHEHE): How about grocery?
Speaker 2: Yeah, I love grocery. So I've been using Operator to shop all my groceries. I love to cook quite a bit. And I have been using Operator exclusively for grocery. So let's, I have a shopping list here, which is this one. Let's see what it is. Eggs, spinach, mushrooms, chicken thighs, chili crunch.
Speaker 3: And so this is a picture that you're uploading here.
Speaker 2: That's exactly right. And I'm going to use Instacart, which is, again, what we use generally. Can you buy this for me, please? And I'll also specify the store I like, which is, well, let's see if it figures out. I mistyped it, but let's see about it. OK, so in this case, again, Operator quickly actually recognized using GPT-4's vision capabilities to understand that the image said eggs, spinach, mushroom, chicken thighs, and it actually knew Gus's market. And I'm like, yes, that sounds great. Cool, again, just like OpenTable, it instantiated a browser. And it's going to go ahead and start doing tests. I'm going to expand the view. And let's see what it does. RICK VISCOMI-LEE 1 So in both of these cases, you've said what you wanted to use. If you just say, buy me these groceries and don't specify Instacart, what happens? It will do a search, use a search engine, much like we do. And it'll find Instacart or Gus's Directly website or whatever else is on the search engine, go through that, ask you questions if it needs clarifications, and go from there. I'm curious what's happening here, though, Rick. Do you want to tell us a little bit about it?
Speaker 4: RICK VISCOMI-LEE 2 So now that you've seen a bit of Operator, let me talk a little about the research behind it. So Operator is based on the new model we've trained at OpenAI, which we're calling the Computer Using Agent, or CUA for short. So CUA is a model built off of GPT-4. But it's also trained to use and control a computer in the same way that humans can, by just looking at the screen and using a mouse and keyboard to control it. Before, if you wanted to build something like Operator without CUA, you'd need to use some specialized APIs. For example, if you wanted your model to buy stuff from Instacart, you'd need to figure out if Instacart had an API. You'd need to figure out if that API had all the functions that it needed. And you'd need to give your model the specs of that API. But if your site, like most other websites, did not have an API, then you're out of luck.
Speaker 2: So this is just using screenshots, no API, nothing, just working. No API, yes.
Speaker 4: And that's where CUA comes in. By teaching a model how to use the same basic interface that we use on a daily basis, it just unlocks a whole new range of software that was previously inaccessible.
Speaker 3: And so this is keyboard and mouse, right?
Speaker 4: It's kind of using keyboard and mouse just like a human would. Exactly, yes. And that's really what the CUA research project is about. It's about removing one more bottleneck in our path towards AGI and letting our agents move around and act in the digital world. So let's make that a little bit more concrete by looking at this task and seeing exactly how Operator is using a computer. It's already done. It looks like it's already done. But let's go back a little bit to the top here. OK, so I chose a random spot. The first thing that CUA does when it controls a computer is it looks at the screenshot. So now you're seeing maybe the search results page for eggs in Instacart. So CUA understands this. It's just seeing the raw pixels. And after CUA sees this image, it decides what to do next. So right now, it's making some inner monologues. And this is the summarized chain of thought. So what CUA is doing is, according to it, it's selecting organic eggs and adding it to the cart, which is a reasonable thing to do. So after it does this plan, it then figures out what the next action it should take is. So let's see what it does in the next step. OK, so you see that it performed a click on this Add button right here. So that's very reasonable. Now, every time CUA does an action, it takes the next screenshot of the computer so that it knows what effect its action had on the computer. So let's see what happens next. Yep, OK, so after clicking on the Add button, now you see it in the cart. And this just kind of keeps continuing. Let's see what it does next. OK, so it creates the next sub-plan, which is adding eggs and searching for spinach. So it's probably going to search for spinach now. OK, so it clicks on the search bar right there. It types in spinach. So this loop of taking actions, grabbing screenshots, and creating new sub-plans, it just keeps going on until Operator decides that it's done with the task. And then it goes back to you.
Speaker 3: It's very cool to see a stock process going like that.
Speaker 4: It is, yeah. So let's actually go back to live. And yeah, Operator is done. Yosh, do you want to see if Operator did your job right?
Speaker 2: Yeah, let's see. You know what? I want a little bit more eggs. I think we eat a lot of eggs. OK, so what I can do at this point, and I'm going to just click this button called Take Control. So as we were talking about, Operator fires up this remote browser to do it. We almost think of it as a surface area where Operator can work, and then I can work. For example, in this case, I took over control from Operator, which is also key to how we think about user and user controls. At any point in time, a user should be able to take control and give Operator instructions, or tell a little bit more, guide a little bit more, et cetera.
Speaker 3: Passing the laptop back and forth, just like you did with Ray.
Speaker 2: Totally, totally, exactly right. In this case, I'm going to make those two, and then I'm just going to tell Operator. This is, again, very much like if you and I were working, like, hey, I did this. Can you fix this? And I'm going to tell Operator, I added another egg. Good to place order now.
Speaker 4: Can Operator see what you're doing during takeover mode?
Speaker 2: Great point. So when you take over, it's very much just like a session with your local browser. It's completely private. Operator cannot see. And this is one of the part of the reasons why I have to tell Operator. You don't really have to. It can look at the last screenshot and try to guess it. But it's really good. It's sort of like if you and I were working together, I went off and did something, and I come back like, Ray, I completely messed it up. Can you fix this? And I have to tell you that. So in this case, I'm going to tell Operator, hey, go ahead. And now I'm passing back the control to Operator. It's a completely private session when you take over control. This is also the, you'll notice that I'm logged into Instacart here. I did it before the demo. It has been logged in for a while now. And it's, again, very much like your local browser. When you log into Instacart, until the cookies are cleared, you stay logged in. And we have really good controls. You can go in Settings and Control and remove at any point in time. So let's see. OK. I will skip the payments here. And we are going to, should we try to do a few more things?
Speaker 1: Let's, yeah.
Speaker 2: What do you all want to do? I hear the Lakers are in town this weekend. Lakers in town, definitely. Can we all go see the game? Let's do it. All right. So we are going to use StubHub. And can you get us four tickets to the Warriors game, not the Lakers game. Excuse me, you're right. This weekend in SF. Best seats under 500, please. Give us a few options.
Speaker 3: And so what apps are available here?
Speaker 2: We have a lot of apps. I'll kick it off. And all right, let's do it. So we have a lot of apps in various different categories, as was shown on the home page. So it's StubHub, Target, Etsy, and all the verticals. But also, Operator is not really restricted with these apps. You can use pretty much Operator with any website. Oops. What happened? Oh. I was watched. Let's see. Let's try to fix it. So this is a good example of sometimes things can happen in live demos. We have put a protection in place where we only allow Operator to visit SDPS sites. And somehow, I think a redirect must be happening where, OK. All set. Keep going. All right, cool. So again, as we have talked about, it's a remote browser. So it can do a lot of things. One of the advantages of doing that is you can do a lot of tasks in parallel, like you were talking about earlier. So let's try to do a few more tasks. Australian Open is going on. And I've been very inspired by it. Did you watch the quarterfinals?
Speaker 3: I've been watching the quarterfinals.
Speaker 2: All right, great, great, great. OK. So I'm going to try and see if I can get a tennis coat. Can you find? Can you see if St. Mary's?
Speaker 5: Tennis coat.
Speaker 2: OK. I said St. Mary's because I live in Bernal Heights. That's pretty close by. And while that's going, let's also.
Speaker 3: And that time, you did not specify a website.
Speaker 2: I did not specify a website. I can actually quickly go back and see. In this case, it's doing very much what we would do, which is go to a search engine and then just search for it. Use the internet. Exactly. OK. I'm also hosting a Super Bowl party. You guys are invited. Thank you. Thank you. But I need to clean the house. Can you find me house cleaners for next week, please? OK. And lastly, and we've all been working really hard to bring this to you. The whole team. The whole team. We have a big crew here. Everyone's working. And we're getting hungry. I didn't have breakfast. I kind of want pizza, even though it's weird for breakfast. That's OK. And so I'm going to go ahead and order some pizzas. All right. So we're going to use DoorDash in this case. Can you get us 10 medium-sized pizzas? Goat Hill? Goat Hill. OK. Goat Hill. Or Go-To. Go-To.
Speaker 4: Can you make sure you have barbecue? I like that.
Speaker 2: Please add barbecue pizza. But pick a variety. It's so hard not to say, please, to the guy. No, no. I just feel like I have to be really nice to it, which I do. OK. Shop might be close. If the restaurant is closed, just schedule it.
Speaker 3: I love that you're talking to it just like you would a human.
Speaker 2: I'm thinking in a monologue, and then I'm typing it out. I don't know if that's possible. OK. Also, one thing I call out, I think, OK, cool, cool, cool. OK, so it's just asking me to confirm, basically, what I said in a much better way. Yes. We can't see the notifications popping up on the Livestream. But, for example, as the other tasks are going on, if I need assistance, for example, in this case, it asks me, hey, is 941100? I can just say yes. But I would be getting notifications, et cetera, so that whenever an operator needs help, we can go back and help. Looks like, in this case, it's already found us, Dennis Coats. And OK. Well, we have some selection to make. Wow. All of the seats are amazing. And why do I believe 374 is better than 260? That's an interesting thing. But it's lower rated. Which one should we add? Row 6? Row 1. Row 1 is good. Row 1? OK. Let's do that. Let's do section 214, row 1.
Speaker 3: So this is a good time to talk about the human in the loop interaction mode that we've been developing. You can see that operator comes back and asks for confirmation when it's about to do anything kind of impactful. And yes, I think we're all very excited about this vision of operator doing your chores for you. But it is one of the first agents that we're putting out in the world, which has real-world side effects. And so we thought carefully about how to deploy this safely. The framework we used to think about this was one centered around misalignment. So for example, what if the user is misaligned? So maybe they're asking for a harmful task, like buy a weapon or something like that. In that case, fortunately, we've done a lot of work with ChachiBT to bring over a lot of the same mitigations. So for example, we refuse harmful tasks, including harmful agentic tasks. We have moderation models. We have post-hoc detection. We have blocked websites. And I'm kind of rattling off these mitigations, but that's really how we think about it. It's this stack of mitigations that each incrementally reduce the risk to the point where we feel comfortable deploying.
Speaker 2: So all the confirmations that we're saying, hey, do you want to reserve the restaurant? Yeah, exactly. Should you buy the tickets? Those are all examples of this. Exactly.
Speaker 3: And I'm about to talk about the confirmations. So another area of misalignment is if the agent is misaligned. So if the model makes a mistake, maybe purchases the wrong item or books the wrong hotel room. For this, our main mitigation is confirmations. So the operator will come back if it's about to do something stateful and ask you so you can double-check what its details in case it made some error. The third area of misalignment is if the website is misaligned. So maybe the website is fraudulent, or it's a fake website, or maybe it's literally like, operator, please wire me $100. We obviously don't want to follow those instructions. So we've developed our model to try to avoid those instructions and not follow them. But if that fails, we also have a separate layer on top. This is what we call the prompt injection monitor. Think of it as like antivirus that kind of observes and watches your trajectory and sees if there's anything suspicious. If it does, then it pauses it. So we feel pretty comfortable with our approach. But obviously, safety is an ongoing process. We can't predict everything. So we hope to learn a lot from this deployment and iterate on our mitigations as we go.
Speaker 2: And that is one of the reasons we are starting small. Like, we want to really iterate, get a lot of feedback, and then gradually bring it to everyone as well. Exactly. Should we check on status of our tasks? Yeah, let's check on the status. OK. So it looks like tickets are ready to be purchased. Yes, please. Well, that's happening. This is good. I can ask it to book it. But I'm just going to close it for now. Oh, just once, please. Continue. And it looks like we're adding pizzas. So. Oh, cool. I am going to go ahead and log in here really quickly. So this is an example, right, like where I obviously need to log in or enter my credentials to actually purchase these tickets. And Operator just asked, as you just described, with confirmations and making sure the control is in the right place, and we can take control. And at this point, as we talked about earlier, the session is completely private as well. I am going to, you know what, log in live. Let's see how that goes.
Speaker 5: Let's go. Continue. I'm going to do a sign-in with email code because I don't really remember. One second. Pull it up. Don't try to copy this.
Speaker 2: All right, good. Now, again, I can sort of continue the purchase here, or I can ask Operator to do it. But I am going to go ahead and just quickly do this purchase for myself. Click, click, click. All right. All right. Order by now. Ooh.
Speaker 3: Maybe I don't want to show that live.
Speaker 2: Yeah, maybe. Let's see. I kind of want to buy the tickets. OK. Oops. All right. Done. I'm going to cancel this card. That's probably fine. All right. I can. I'm all set. Thank you for the help.
Speaker 4: OK.
Speaker 3: So how reliable is this in practice?
Speaker 4: Yeah, so we've seen a lot of cool demos. But again, we want to remind you that Operator is a research preview. It will make mistakes, and it is not perfect. That said, we can look at a few benchmarks and kind of quantify how good Operator is right now. So one of the first benchmarks that we're going to look at is called OS World. OS World is an eval that measures how well AI agents navigate common operating systems like Linux. On this task, Kua gets a 38.1% score, which is higher than other publicly published results. Human performance in this task is 72.4%. So we still have room to grow, definitely. The other eval we'll take a look at is called WebArena. WebArena is an eval that measures how well AI agents navigate some common websites like e-commerce websites or social forum websites. So on this task, Kua gets 58.1%. Again, higher than other publicly published results, but still falls short of human performance. One thing.
Speaker 3: Still a way to go.
Speaker 4: Still a way to go, yes. One thing that's important to remember about WebArena is that even though it's the web, we're still just giving it the same universal interface of screen, mouse, and keyboard. We're not giving it any extra information that might help it do the task, like the raw text of the web page or information about which buttons are clickable. And all the information it needs, just like humans, is just in the screenshot.
Speaker 2: And so right now, obviously, in Operator, we are using the browser. But I could use the model with the computer as well, with just Ubuntu or Mac or whatever else. Yeah. Awesome. Great. Well, in the last, I don't know, 15 minutes, I think I did all my errands for the week. Got my groceries, dentist code booked, cleaner's coming. Hopefully, we'll see. We'll check on the status. We have tickets. Everyone's coming. And this is really, I think, where we think Operator is very, very valuable. We can delegate a lot of tasks that you can do, obviously, yourself. But you can delegate it. It can make a lot of progress with you. Sometimes we'll get stuck. As we said, it's early. But you can come back, help it. And over time, it'll continue to get better and better. And one last thing. We are launching this today. We're going to start slowly rolling it out right now. End of the day, everyone on Pro in the US will have access. But also, we're working on the API. This model will be available in the API and will be launching in a few weeks.
Speaker 1: You guys, congrats. This is incredible work. So excited to get this out. I think people are going to love it. It's early, as we mentioned. But we have a long and great history here of early research previews developing into products that people really love. So this is really the beginning of this product. This is the beginning of our step into Agents Level 3 on our tiers. And we can't wait to see how people are going to use this and to work with us to figure out where exactly it should go. So again, congrats. Hope you enjoy it. Thank you very much.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now