AI Operator Launch: Transforming Productivity
Introducing Operator, an AI agent set to revolutionize productivity by autonomously handling tasks via cloud browsers. Stay tuned for global rollout!
File
El primer AGENTE de ChatGPT Operator OpenAI
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: We've got something exciting for you today. We're going to launch our first agent. AI agents are AI systems that can do work for you independently. You give them a task, and they go off on it.

Speaker 2: I think it's going to be a big trend, and it's going to really impact the work people can do,

Speaker 1: how productive they can be, how creative they can be, what they can accomplish. We're starting today with Operator. Operator is a system that can use a web browser, in this case a web browser in the cloud, to accomplish tasks that you give it, and we'll show you a demo in just a second, But it's really quite cool what it can do. Just like you would use a web browser, you can get textols in and look at a screen, and an operator can do that and then control the keyboard and the mouse and do all sorts of things. This is what we're doing live today in the United States for Pro users, and it will be other countries soon. Europe won't be coming to play well, and we're also, in the coming months, make it available to Plus users. This is an early research preview. We've got a lot of improvements to do. We'll make it better. We'll make it cheaper. We'll make it more widely available, But we really want to put it in people's hands. We'll also have more agents to launch in the coming weeks and months. But that said, we'll talk more later. So excited, just want to show you a demo. I'll hand it over to Yash.

Speaker 3: Great. Thanks, Sam. Hi, I'm Yash. This is Casey. That's Ray. And we work on the computer-using agent team. And we're so excited to show you Operator today. As Sam said, Operator is an early research preview. It will do a lot of cool things. It also makes amazing things. And that's what I'm going to say once. But let's show you what Operator can do.

Speaker 2: OK.

Speaker 3: So this is the operator homepage. It lives at operator.chatgpd.com. It'll be accessible as soon as the live stream is over. And as you can see, the interface is very similar to chatgpd. You can type in a prompt, and an operator will try to execute the task to the best of its capabilities. You'll also see we have a list of pre-filled prompts here. These are not really meant to be recommendations. These are meant to be things that, you know, to give you an idea of what operator can do. We have also collaborated with various brands like OpenTable, Allrecipes, StubHub, Uber, Thumbtack, Doordash, GBA, Target. Make sure Operator Google works well on these websites. But also, we will find Operator very valuable in interacting with these platforms. So with that, let's jump in with a demo. Okay. So I'm going to start with something fairly simple. I'm going to use OpenTable.

Speaker 2: Okay. Put me a table. I'm going to sit at a vereta at 7 p.m. Okay. So you specifically chose OpenTable.

Speaker 3: Yeah, in this case, I'm asking Operator to use OpenTable to book a table for two at Beretta. Beretta is a restaurant in San Francisco. It's great. You should try it out, and at 7 p.m. And I'm using OpenTable in this case, but I could have easily said, just do Beretta, and it would have probably gone to the search engine, figuring out how to make a reservation as well. But let's see what it does. Can you explain what's happening in this? Yeah, right. So I'm going to expand this a little bit. So as soon as I typed in the query, Operator instantiated a completely remote browser. It's introducing all the data. Oh my God. It's brutal.

Speaker 2: and it auto-corrected itself to San Francisco.

Speaker 3: This is using, so like chat GPT in Operator, you can also give custom instructions. I'm going to show this really quickly here. I'm going to put it in dark mode since you're here. For queries that need it, I live in San Francisco. So Operator recognized that and then auto-corrected itself to go to Vareta. So it looks like something is available,

Speaker 2: we know what's going on but it's just fine. So we're going to do that.

Speaker 3: In this case, Operator came back and this is an example of task delegation where Operator needs help or needs assistance or just wants to ask you something, it'll just come back and answer that way.

Speaker 1: So in practice, you wouldn't have had to watch this. You could have just let it go off while you're doing other things, then it would come back and say, hey, I can't do 7, 7.25.

Speaker 3: And we're starting with a web app, you'll get notifications, etc. When Operator moves into mobile, you'll get mobile. Much like interactions we do.

Speaker 2: Yes, great. Let's do it.

Speaker 3: Very simple interaction, as you would have with an assistant, which is, hey, I found a reservation, 7 p.m. wasn't available, let's do 7.45. And again, you can see, operator at this point has said, okay, this is a really good example of the confirmations work we're going to talk about a little bit later. But before doing an action, which is sort of irreversible in this case, you can cancel the reservation, obviously. Again, taking a critical action after asking us before.

Speaker 2: Oh my god, how can they make a mistake at some point and start reserving weird things? It's not available anymore. For taking so long.

Speaker 3: For taking so long.

Speaker 2: Well, let's do something else. I don't like this.

Speaker 3: Let's see. It's really cool. It's great. The truth is that it's really, really cool.

Speaker 2: Oh, yes, please. Don't tell me that this makes you buy. Let's go.

Speaker 3: Oh, my God. And I'll also specify the store I like, which is... Oh, let's see if it figures out I mistyped it, but let's see. Okay, so in this case, again, operated quickly, actually recognized using GPT-4's vision capability. I understand that the image said, egg, spinach, mushroom, chicken thighs, and it actually knew Gus's market, and I'm like, yes, that sounds great.

Speaker 2: Of course, I imagine you have to have pretty good integrations with the products. I mean, you have to be prepared for Instacart.

Speaker 3: Again, just like OpenTable, it instantiated a browser and it's going to go ahead and...

Speaker 2: Of course, you can give it a take control to go to the website and follow the steps if you want.

Speaker 1: So in both of these cases, you've said what you wanted to use.

Speaker 3: If you just say, buy me these groceries and don't specify Instacart, what happens? It will do a search, use a search engine, much like we do, And it'll find, you know, Instacart or Gus's Directly website or whatever else is on the search engine. It'll do that, ask you questions if it needs clarifications, and go from there. I'm curious what's happening here, though, right? Do you want to tell us a little bit about it?

Speaker 2: Well, now that you've seen a bit of Operator,

Speaker 4: let me talk a little about the research behind it. So Operator is based on the new model we've trained at OpenAI, which we're calling the QoS model, or QoA for short. So QoA is a model built off of GPD 4.0, but it's also trained to use and control a computer in the same way that humans can, right? So you can have a screen and a mouse and keyboard to control it. Before, if you wanted to build something like Operator without CUA, you'd need to use some specialized APIs. For example, if you wanted your model to buy stuff from Instacart, you'd need to figure out if Instacart had an API. You'd need to figure out if that API had all the functions that it needed. And you'd need to give your model the specs of that API. But if your site, like most other websites, sites that don't have an API, then you're out of luck. This is just using screenshots. No API, nothing. It's just quality. Yes.

Speaker 2: And that's where Google comes in.

Speaker 4: By teaching a model how to use the same basic interface that we use on a daily basis, it just unlocks a whole new range of software that was previously inaccessible. So this is keyboard and mouse, right? It's kind of using keyboard and mouse just the way it would. Exactly, yes. And that's really what the cool research project is about. It's about removing one more bottleneck

Speaker 2: from the entire page that we have.

Speaker 4: So, let's take that a little bit more concrete by looking at this task and seeing exactly how Operator is using a computer. So, it looks like it's already done, but let's go back to the operator. Okay, so, the first thing that Kua does when it controls the computer is it looks at the screenshot. So, now you're seeing the search results page for eggs in Instacart. So Kua understands this. It's just seeing the raw pixels. And after Kua sees this image, it decides what to do next. So right now, it's making some inner monologues. And this is the summarized chain of thought. So what Kua is doing is, according to it, it's selecting organic eggs and adding it to the cart, which is a reasonable thing to do. So after it does this plan, it then figures out what the next action it should take is. So let's see what it does in the next step. OK, so you see that it performed a click on this app button right here. So that's very reasonable. Now, every time you call that action, it takes the next screenshot of the computer so that it knows what effect its action had on the computer. So let's see what happens next. Yep, OK. So after clicking on the Add button, now you see it in the cart. And this just kind of keeps continuing. Let's see what it does next.

Speaker 2: OK, so it creates the next subplan, which is adding content at a certain percentage. You can see where the click has gone. You can see how it's been moving. That's pretty terrible.

Speaker 4: Pretty surprising.

Speaker 2: Every time it changes, it takes a screenshot and that's it. In fact, this reminds me of a project I'm going to show you now.

Speaker 3: take this button called take control. So, as we were talking about like operator fires up this remote browser to do it, we almost think of it as surface area where operator can work and I can work. For example, in this case, I took over control from operator, which is also key to sort of how we think about user and user controls. Like at any point in time. Like passing the laptop back and forth, just like you did with Ray.

Speaker 2: Totally, totally. Exactly like that. So, to the last step, yeah.

Speaker 3: But, it's really good. It's sort of like if you and I were working together,

Speaker 2: And now you can tell Operator what changed. You have to say, ah, this has changed. Okay, okay. It was cool, can you imagine? Okay, so you add a new one and you say, okay, now you can do the place order. Let's see, I like it, but on the other hand, it's a little weird because you say, it doesn't seem like you could... I mean, it's saving a lot of time, but obviously if you leave it in an autonomous way that searches for you in more than one source, then it can be interesting for it to do it for you. But it doesn't end up... How does this work? How does having the session started work?

Speaker 3: And we are going to... Should we try to do a few more things? Yeah. The Lakers are in town this weekend.

Speaker 2: The Lakers are in town, definitely. I don't know if you would feel safe putting there the data that makes screen capture and everything, I don't know. Three tickets for the Warriors game, not the Lakers. But of course, they are all very specific websites, right? I get the feeling. But I don't know, I don't know if ... not even them, it's just that it has been very strange, because not even them have trusted to put it, right? We have a lot of apps, but we have a lot of apps in a different category. This will improve a little bit, so it's stubborn targeting, but also operating with a lot of locals. But also operating with a lot of locals, you can use it pretty much, you know,

Speaker 3: operating with any website.

Speaker 2: Oops.

Speaker 3: Oh, what happened? Oh, no. Oh, I got squashed. Let's see. Let's try to fix it. This is a good example of, you know, sometimes things happen.

Speaker 2: Your organization doesn't allow you to see this site.

Speaker 3: We put a protection in place where we only allow operators to visit FPPS sites. And somehow, I think a redirect must be happening where, OK.

Speaker 2: Let's keep going. Let's see. Someone is going to be fired. Someone is going to be fired. Again, as we have talked about, it's Look at that bad boy's face. Well, well... Someone is going to be fired later.

Speaker 3: Oh my God.

Speaker 2: Damn. That's weird, isn't it? The demonstration is so unprepared, isn't it? The demonstration is so unprepared. I mean, it looks like a very straightforward demo. Don't you think the demonstration is very weird? It doesn't seem like a product that is very, very, very finished yet, right?

Speaker 3: And, lastly, we've all been working really hard to bring this to you. The whole team. The whole team. We have a big crew here. We're getting hungry. I didn't have breakfast. I kind of want pizza, even though it's weird for breakfast, but that's okay. And so I'm going to go ahead and order some pizzas.

Speaker 2: But he's not doing the demos. He's not doing them. That's weird. Yeah, 10 pizzas. Okay, let's see the pizzas. It's like... Do it again. Sam is not happy at all. Yeah, yeah, yeah. Sam is saying...

Speaker 3: Okay, let's see.

Speaker 2: DoorDash. I love that you're talking to it, just like you would a human.

Speaker 3: I'm thinking in a monologue, and then I'm typing it out. Also, one thing I call out, I think, OK, so it's just asking me to confirm, basically, what I said in a much better way. Yes.

Speaker 2: If that's what I just asked you, of course I'm sure. Let's see.

Speaker 3: I see the notifications popping up on the live stream. But, for example, I know that tasks are going wrong. I need assistance. For example, in this case, it asks me, hey, is 941100. And it's saying that I want to get notifications, et cetera, so that whenever an operator needs help, we can go back and help. Looks like, in this case, it's all ready for me.

Speaker 2: So it's like it wants to show that you can do more than one thing at a time, right?

Speaker 3: OK, well, we have some selection to make.

Speaker 2: But it's like it's jumping between different ones, it's a little bit like... It's very weird, man. It's like... Now, in addition to tabs, browser tabs, we're going to have operator tabs, man. We're going to have a lot of places operating at the same time and telling us which places it has to do.

Speaker 5: That's what it loads.

Speaker 2: At this point, it should have already done it. I ordered it manually, totally. It reminds me of my presentations on prototypes.

Speaker 5: Yeah, so I think we're all very excited about this vision of Operator doing your chores for you, but I don't know, what do you think of the Operator? Putting out in the world and which has real world side effects. And so we thought carefully about how to deploy this safely

Speaker 2: The framework we use to think about this is a product that doesn't work. So what if the user is not aligned? What if the user is not aligned? For example, what if the user is asking the Operator to do something illegal?

Speaker 5: In that case, fortunately, we've done a lot of work with ChatsBTC. We know we're a lot of the same mitigators. So, for example, if you're using the wrong one, including harmful watch mode, we have moderation model, so that the web isn't blocked websites.

Speaker 2: And, no, I'm not bragging about mitigation,

Speaker 5: but that's really how we think about it. It's a stack of mitigations that each, incrementally,

Speaker 2: reduce the risk to the point where it's all over. Can you imagine? It's true, if he's doing a bot Perfect example of over-engineering The truth is that one thing that doesn't finish me is that I haven't finished seeing a case of success yet You know, from beginning to end, which would have been amazing As everyone has left them in the middle It's kind of weird, you know, because you say, damn, but finish one, at least, right? Do one until the end, because it's a bit strange, it's a bit in the middle of everything, yes, you can look for this, the other, but there has not been a case of success, of operator that you say, wow, he has done it perfect as I wanted it and he has taken all my work. This is a trash can wherever you see it, although it will work, I'm not sure this is useful. You have to wait for what Elon Musk says about this. I don't know how to start a session. I imagine you would have it previously, so you can enter that browser, start a session, and that's it, right? I still see it a little green.

Speaker 3: We are starting small, right? Yeah, yeah, yeah. So it looks like tickets are ready to be purchased. Yes, please.

Speaker 2: OK. And now the purchase. This is good.

Speaker 3: And it changes to another one. But I'm going to close it for now. Oh, just once, please.

Speaker 2: So it's changing between different ones. And it looks like we're adding new users. With voice commands, eventually, well, the voice command is already there, right?

Speaker 3: Oh, cool. I am going to go ahead and log in here really quickly. So this is an example, right, like where I obviously need to log in or enter my credentials to actually purchase these tickets. And the operator just asks, you just describe with confirmations and making sure the control is in the right place. And we take control. And at this point, as we talked about earlier,

Speaker 2: What's amazing is that you actually have the browser that is virtual. Oops, I'm counting backwards here with seven minutes. Okay. Come on, how do I ask him? He has to enter his email.

Speaker 3: Now, again, I can sort of continue the purchase here, or I can ask the operator to do it, but I am going to go ahead and just quickly do the purchase.

Speaker 2: I saw, I can click.

Speaker 6: Maybe I don't want to show that live.

Speaker 3: Yeah, maybe. Well, let's see. I kind of want to buy the tickets. Okay, whoops, all right, done. I'm gonna cancel this card.

Speaker 2: Well, at least he just spent $1,000. There you go, $1,000. I'm all set. That's it, he wanted $1,000.

Speaker 3: Thank you for the help.

Speaker 2: What are you going to thank him for?

Speaker 4: How reliable is this guy? Well, at least he made the purchase. But again, we want to remind you that Operator is a research preview. preview, it will make mistakes, it's not perfect. That means we can look at a research mark,

Speaker 2: a kind of quantified version of all the benchmarks that we're going to look at, OSWorld. OSWorld

Speaker 4: is an eval that measures how well AI agencies navigate on common operating systems like Linux. On this task, Kua gets a 38.1% score, which is higher than other publicly published results. Human performance in this task is 72.4 percent, so we still have a lot of work to do. The other eval that we look at is called Webarina. Webarina is an eval that measures how well AI agents navigate some common websites, like e-commerce websites or social

Speaker 2: blog websites. So on this task, what we're seeing is that we're still just giving it the same But I don't know to what extent it was the moment.

Speaker 4: It's like an interesting idea, but I don't know.

Speaker 2: It really hasn't improved either. Of course, when we do it in parallel and you can do more than one thing at a time, or maybe people who find it difficult, who can't navigate, who find it difficult, like your parents, maybe it can be interesting for them to use it. But I would have preferred to use the website and buy the tickets directly, I think it's more interesting to give 4 or 5 clicks than to write, that's me, obviously. Here everyone can have their preferences. But I would have preferred to make all the necessary clicks instead of having to write what I want, I don't know what, I don't know how much, in my opinion. And with audio, it may even be that he also prefers clicks. I'm not saying that tomorrow if it's really well integrated and you can tell Alexa or an entity to do this, I don't know, to make it really totally autonomous, but of course,

Speaker 1: as it is not autonomous and you have to review everything, it is speaking very low. This is really the beginning of this product, this is the beginning of our step into agents, level 3, on our tiers, and we can't wait to see how people are going to use this and to kind of work with us to figure out where it should go, so I can't wait for you to enjoy it. Thank you very much.

Speaker 2: They are testing and it's fine, after all, ChagPT was also a test, ChagPT, which we thought was a star product and such, in the end it was an investigation that OpenAI did of, hey, what if we try to do this? And it was also like a test of these and it went super well, not everything is always going to go well, but it may be that they learn how people use it, people's feedback, that seems very interesting to me. But it's true that in some way, maybe because we are very badly used to it, that can also be and it is true, but it does not end up giving me the feeling that I was well prepared, well aligned, that I was really like everyone expects, I don't know, I don't know. I don't know. That's my opinion. Maybe it's what you say, that they have done it just to try to make people forget about DeepSeek. I have no idea. I expected it better, honestly. I mean, I expected much, much more and the hype has dropped suddenly, honestly. I think they ran for DeepSeek, maybe. Now they are shouting at someone. You can hear the shouts from here.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript