Explore New AI Tools: StageHand, Gemini & More

Convert Your Audio To Text

4.9/5

3720 customer reviews

Discover StageHand's agentic web browsing, Google Gemini's coding prowess, and Exponent's no-code AI tools in this week's AI roundup.

Automate Browsing Coding with NEW AI Tools

Added on 05/08/2025

Speakers

Add new speaker

[00:00:00] Speaker 1: Hello, I'm Mike with this week's new and trending AI tools you cannot afford to miss. Coming up, we'll look at a browser operator with a difference, we'll explore the equivalent of ClaudeCode for Gemini 2.5 Pro, and we'll take a quick look at an AI coding tool that is changing the way things are done from the norms of Cursor, Windsurf, and VS Code. Also, a quick shout out to all of those of you making Studio Ghibli-style photos using ChatGPT 4.0 image generation, and I also had a little play with Runway Gen 4 to animate those images. It's pretty cool. Okay, let's hop straight into it. StageHand released version 2. What on earth is StageHand? Well, it's something that allows you to agentically browse the web with multi-step workflows. So, we've heard of a lot of OpenAI operators and Claude Computer Control and things like that. Those have all been available for a while, but StageHand v2 aims to make this agentic, and what's more, you can use it and test it in the browser without using a single line of code. Let me show you how. All you need to do is navigate to this website to take StageHand, which is behind the scenes of Open Operator for a spin. You'll see, there you go, it's powering on browser base, and it gives you some sample queries here. These are all simple web browser-based activities. I'll link this tool along with everything else down below, of course, and I'll throw an easy one to start. Find the best restaurants in Las Vegas for someone on the keto diet. Let's run. You can see exactly what happens here with our Open Operator powered by StageHand v2. Now, note here, I didn't need to log in or do anything. This is all happening in front of my eyes. It's completely free at the moment. I can't believe this. Now, you can see here straight away, it's already making a Google search. It also gives me the reasoning behind that search. It's a comprehensive search engine that can provide a wide range. And now we can see it's screenshotted Google search results. It's made it past the cookies, and now it's actually browsing all of the search results from Google there. In a very natural and human-like way. Now, you can see this is a multi-step process, and it's saying that it's actually going to click the first restaurant link to get more details. And of course, it will keep doing this until it has a full comprehensive report for me. That is the power of StageHand v2 with its agentic process. Now, while we wait for this to complete, we can actually go ahead and build our own web browsing agent using code. And it's really easy to get started. This is why I like StageHand and wanted to introduce you to it, because really, it's actually just one command to get started, this command right here. So let's crack open Cursor and give StageHand v2 agents a quick go. We'll come back to the Quito investigation later on. Okay, let's run that command. Here, I'm going through the setup, and of course, I can choose which AI model I'd like to use to do my web browsing. I think I'm going to stick with ChatGPT 4.0 Mini for this example. Enter my OpenAI API key, and I'm using Cursor. Now, I've also got the ability to choose to run locally on my own Chrome browser or browser-based. Which is what you've seen me demonstrate earlier with 60 minutes free included. I'll run this locally for these examples. And there we go, it's all installed, and it's even given me the three commands to get this up and running. Okay, so I opened this up in Cursor now, and the coolest thing you'll see about StageHand is it actually creates Cursor rules here, which is brilliant because that means when I prompt Cursor to make something, it will have the context of exactly how to write the code specifically for this StageHand. So let's open the index file. As it said, this is where all the code happens. And with that, there's context here in my agent chat. I'll now prompt Cursor to go to tradingview.com chart and find all the ratings for popular cryptos, report back on the status of each as to whether I should buy, hold, or sell, please. Let's send that to our agent and see what it comes back with. Now, it's installing all the browsers for me here. Boom, we've got some Chromium going on up here. Okay, and it's browsing the crypto information here on TradingView. Look at this, I'm not scrolling, it's scrolling for me. And it's come back immediately with a crypto analysis. And look, right here in Cursor, it's giving me the information on Bitcoin, whether to sell, I should sell, I should sell, I should sell everything, it seems, sell all your crypto. That's what my latest agent has told me to do using StageHand V2. Pretty cool stuff, right? Meanwhile, I can see over here, my agent is browsing the menu and ingredients of a local place in Las Vegas with Quito information. Okay, now you've seen how easy it is, you basically create the StageHand project in Cursor, and then you prompt in natural language what you want the browsing agent to do. Let's try again. Make a new StageHand browser agent to research popular national parks and scenic areas within a two-hour drive of Las Vegas. Find the popular spots to visit and best parking areas for incredible views. Okay, here we are, immediately it's visited the National Park Service Death Valley website. It's actually looking through the website. This is cool, this is not me scrolling. Now, it's found a website for the Valley of Fire, and you can see here, it's outputted everything. It's running in nice JSON, and Cursor's actually summarizing this for me, that Red Rock Canyon is just 30 minutes from Vegas, an hour from Vegas is the Valley of Fire State Park, and Death Valley National Park is two hours from Las Vegas, and even gives some information there. Pretty cool stuff, summaries, park fees, everything I need to explore the surroundings of Las Vegas while I'm there. Oh, and by the way, you're probably wondering how Browserbase Open Operator did looking for my Quito restaurants. Well, despite doing a lot of comprehensive research, it got stuck after step. In 2008, where it found Starburst, Parlor, Quito, Bakery, and Coffee Parlor, it just disconnected. I couldn't get any further. I did try and run it a few extra times, but no joy. Next up, the product manager of Google's Notebook LM, an epic product, Reza Martin, has actually created the equivalent of Claude Code for Gemini 2.5 Pro. Let's give it a test. Now, note that Gemini 2.5 Pro is pretty much the best coding model available right now, and it's free, which is incredible. So, let's give it a try. So, here I am at the instructions for installing Gemini code, and I'm just going to follow step by step so that you can do the same thing. And to install the package, it really is as simple as pip install gemini-code. Next up, you're going to need an API key. Really simple to get in Google's AI Studio. Just get an API key and create one, paste in, and boom. Then we simply launch it by typing in Gemini, and look at that. It really is as simple as that. Let's start coding from the command line for free using Google's Gemini 2.5 Pro. Now, I've heard Gemini 2.5 Pro is good at making games. We tried this last week, but let's ask it to make a game in 3JS that is a first-person shooter like the classic Doom game and see what happens. Okay, the assistant has been thinking things through, and it's proposing to make the index for the Doom clone. I'll approve it. And along the way, it's making CSS styles and all that good stuff. Okay, Gemini 2.5 Pro via Gemini code seems to be completed. Just change directory to Doom clone, run the Python server, and visit your web browser. And here is our Doom clone. We can move with WASD, we can jump with space, and look around with the mouse. This is nothing like Doom with monsters, but I am going amongst blocks in 3D space, and I can indeed jump. I mean, it's a start, isn't it? It's definitely a start. Let's just say in natural language here, add some monsters. And here we are. Well, the monsters seem to be static red cylinders. It's shooting when I click on the mouse. I mean, this is a basic system. It's a quick start. And the more that we prompt with Gemini code, the more we can make. Make the monsters scary, voxel style. Let's restart our Doom clone. Now it's gone a bit darker. It's like it's nighttime, and all these crazy voxel style monsters are coming towards me. I can, whoa, I can move really fast now. Like, this is super fast. Everything is sped up. Gemini code, it's a command line tool. It's easy to install. The product manager of Notebook LM from Google created it for you and I to try. Gemini 2.5 Pro, not only being a very powerful coding model, is also free to use right now. Remember, if you're enjoying, throw a like and subscribe for more videos just like this one. OK, finally, a company called Exponent are looking to change the way we code using no-code AI tools. So we're used to IDEs such as Cursor, WinServ, VS Code. But they want to make an assistant that works wherever you are and acts more as a collaborator rather than you sitting there and letting it work on autopilot. You'll see what I mean. This is new. I just got an invite to Early Access, so let's test it out. So here is Exponent, and I can create my account. This is in exclusive Early Access right now, and I'm going to give it a test. So Exponent claims to be my new AI programming agent. Let's continue. I've got some free credits. That's great. And you can see all of the features, such as running locally, confirming actions, running up in the cloud, autoconfirm. So that's a bit like YOLO from Cursor. OK, it wants me to go all command line, so I popped up my terminal, and I'm going to run these commands. That was quick. Exponent is installed successfully. Let's connect with my secret key here. We've got a success message, and now we need to navigate to our project and run Exponent. All seems good so far, so I'm going to navigate to my project. Let's make a project here and run Exponent. Boom, immediately it fires up my browser, and I'm connected to my local project. This is great. I'm calling it YT Audio, as I want to do some cool stuff here. So it's connected. It's got a working directory, which is local. But note, I am coding inside my browser. I'm using Exponent, which really means I can work with anything, whether it's in the cloud, whether it's on my computer, whether it's on a server I'm running. This is cool. So I'm just going to make it a bit bigger so we can see what's going on. This is the prompt box here. I've got thinking on, so let's just use it exactly as it comes and see how it works to create an app. Make a web-based transcription tool that uses GPT-40 Transcribe, the speech-to-text model from OpenAI. The basic app should allow you to upload an audio file and get a plain text transcript back. And just to make sure Exponent has a fighting chance here, I'm going to paste in the docs from OpenAI's page and hit enter. So Exponent thinking and working away inside my browser now, and I can already see inside the thoughts it wants to create files. Now it's asking if I want to run these commands. I can confirm with Command Y on my keyboard. Now it's writing the code for me. So as it's going through here, all I'm really doing is hitting Command Y to confirm along the way. Okay, it seems to be wrapping up here, telling me the files it created. To use the application, it's telling me to put in an OpenAI app. Hit the API key and then start the application and then visit the web browser. Okay, let's try that. And boom, we seem to have a fully operating audio transcription tool using OpenAI's GPT-40 Transcribe, the latest model, which actually came up on top on a video I did just a couple of days ago, benchmarking all the latest transcription AI models. I've chosen an audio file, which is actually the transcript of the video I did earlier this week. Let's click Transcribe and we hit our first error. So let's go back to Exponent and try and fix it. Error when trying to upload compatible files like MP3, please fix. Just pasting in the error here and seeing how Exponent handles this. Okay, great. It didn't take too long to make that change. Let's see if we can get it working by hitting Control C and then NPM start again. Okay, not only is it fixed things, but I can see it's actually added some more options here, which is really cool. So with my audio file attached, I'll say Transcribe any language model GPT-40 Transcribe high quality. It's also got the Mini and the Whisper version included. And I can also add an optional prompt to improve transcription accuracy. This is cool. Let's process the audio. And within moments, look at this. We've got our transcript of the video back. It's even getting things like Gemini 2.5 Pro Experimental correct. Google AI Studio nicely capitalized. I'm noticing that here it says ChatGPT 4.0. That's where it could be good to give a prompt. This is a video about transcription accuracy using ChatGPT 4.0 and other models. Now, I wonder if we reprocess the audio and by prompting ChatGPT 4.0 Transcribe, not only do we get a nice transcript back, but this time it's written ChatGPT 4.0 Transcribe correctly all the way through the script, which is amazing. And I think it's one of the great features of ChatGPT 4.0 Transcribe itself, a really new text-to-speech model that I'm loving. It's so accurate on my videos. If we go into my account up here and look at the settings, you'll see I'm on the trial. I've used 18 of 100 credits. The price for full access. Of course, I'm paying nothing right now is $50 a month and it's pretty slick. It runs in the web browser. It corrects things. It thinks for you. I don't know what AI model it's using under the hood, but I really like this idea of running something locally on my machine, but having it work away inside a web browser that I could run remotely if I like on another computer. So there you go. Pretty cool stuff for this week. Thank you so much for watching. I really appreciate you being here. Join my community. It's linked up down below. And YouTube is showing a video on your screen right now. You should watch next. Thanks.