Unveiling the Latest AI Breakthroughs and Innovations of the Week
Explore groundbreaking AI tools including top 3D model generators, cutting-edge video editing AIs, and new AI agents surpassing industry giants.
File
New top AI image model, DeepSeek R1, full AI agents, top 3D generator, precise video control
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: AI never sleeps and this week has been absolutely insane. We have a new AI where you can drag and move anything in a video. We have another AI that directly edits 3D videos. We have not one but two AI agents that actually work. You can use it to reply to emails, reserve a table, book flights and do other workflows. We have a new image generator that tops the charts. We have a new 3D model generator which is also ranked number one. We have two new models that even beat OpenAI's O1 and a lot more. So let's jump right in. First up we have a new top 3D model generator called Hunyuan 3D 2. Wait a minute, this name sounds familiar. Isn't Hunyuan the top open source video generator by Tencent? Well yes, yes it is and it turns out that they've also released a 3D model generator which is insanely good. So they've just released Hunyuan 3D 2 which is an upgrade from their previous version and this allows you to create 3D models from just a text prompt. Or you can also upload an image and get it to generate a 3D model from that. And here's how it works. So first you would feed it a text description or an image and it would use a diffusion transformer to basically generate a 3D shape from your input. And then in the next step it creates the texture for this 3D shape. And then finally it combines both the shape and the texture together to create a complete 3D model. Now because it generates the shape and texture separately, you can then use this model to apply different textures to the same base shape. So for example here is one texture. Let's try this texture. Notice how it overlays a different texture on the same teapot shape. Here's another one. So a very flexible tool. Here's another example of this boot. Let's try this brown leather texture and this is what we get. Let's try this one and this is the result. Super cool feature. Now there's actually a leaderboard for AI 3D model generators and this is where people can blind test different models. And from all of these blind tests, note that Hunyuan 3D 2 is currently ranked at number one. Even better than Microsoft's Trellis, which is already really good. I featured this tool in a previous video. So really impressive stuff. Now they have a free hugging face space for you to test this out. So it's pretty simple to use. Here is where you can select to either upload an image and generate a model or enter a text prompt and then generate a model from that. So let's just try a simple prompt, a lovely rabbit eating carrots, and then let's generate both the shape and texture. Let's see what that gives us. All right. So here's the output. Note that it creates a mesh or a 3D shape for you and it creates the texture as well. So first of all, here's the 3D shape and look how detailed and consistent this looks. This indeed looks like a lovely rabbit eating carrots. So that's just the shape. If you click here, here's the shape with the texture. And again, look how detailed and consistent this is. It seems from an initial test that this is slightly better than Microsoft's Trellis. Very impressive. Now instead of a text prompt, I also tried uploading an image. So I just uploaded this test image from the selection and generated a model from that. So here's the resulting shape again. Very detailed. It is aligned with this input image. And even though we don't see the back of this character, it's able to estimate what that would look like and generate it really well. And here is the complete 3D model. Very, very impressive. Now let's try something even trickier. So next I've uploaded this image of a Gundam and this is a really tricky and detailed image. But as you can see, it kind of pulled it off. So here is the 3D shape. Very impressive how it's able to get, you know, everything roughly correct from just one flat image. And it even extrapolates, you know, what the back of this Gundam would look like. Very impressive. And the awesome thing is this is open source and they've already released it. So here's their GitHub repo. And if you scroll down a bit here, it contains all the instructions on how to download this. Plus, there is a plan to add this as a comfy UI integration. Note that the models range from like 1.3 billion parameters to 2.6 billion parameters. So this is relatively small compared to, you know, large language models. So you can easily run this on a medium tier GPU. Anyways, I will link to this page for you to read further. Next up, Netflix. Yes, you heard it, Netflix, which isn't too prominent in the AI space compared to other tech giants. But they've actually released a really cool AI this week. It's called Go with the Flow. And this AI allows you to do a lot of things with a video. For example, you can do what is called cut and drag. So you can basically select whatever you want. So here on the left, let's say you select the two faces of the cat and then you like drag it somewhere else. What this AI does is it then creates a new video incorporating this motion. So here you can see it makes the left cat yawn because we're dragging its face upwards. And then it makes the right cat turn its head because we dragged its face to the side. Or here's another example. You can also like select this character and then make her smaller as the video zooms in. And then again, this AI would kind of incorporate this new motion and give you this really cool zoom in effect. So this gives you really granular control over how you want things to move in a video. Here are some more examples. So you can like mask these two sunflowers and then drag them around and the AI would generate something like this. Or here's another example. You can mask these two stuffed animals and drag them across the table and this is what you get. Or here's another creative example. You can select this person's hand holding a pen and then drag this to the lower right and it would generate a video of this person drawing something on this piece of paper. Or here's another example. You can take this rubber duck and drag it around and here's the final result. And if you compare its cut and drag animations with other competitors like Motion Clone or Drag Anything, which I featured on my channel before, note that this new tool by Netflix is a lot more consistent and accurate. But that's not all it can do. You can also transfer the motion from one video onto a new video like this. Or you can also take a 3D object and then with a prompt, transform it into whatever you want. So here we are generating a squirrel but note that it's moving according to how we moved the 3D model. We can also just take the entire image and drag it around and again it would generate a video that matches the motion of however you dragged it. So this is basically like controlling the camera of the video. How insane is that? It can also do something called first frame editing. This is where you have an original video and you take the first frame of that video and you turn it into something else. For example, you can edit the stuff on top of this cake into some flowers and then if you plug that one frame back into this AI, it would generate the entire video with this new frame but copying the original video's motion. Here's another example. If this is the original video, well you can take the first frame of that video and then Photoshop a lighthouse into the video and then once you plug it into this AI, it would generate a full video of this lighthouse while copying the motions of the original video. Or here's another example. Let's say the original video is this laptop, well you can take the first frame and then Photoshop a book onto this laptop and then plug it into this AI and then it would generate the full video but with this book on the laptop. A very powerful tool. Now this AI actually uses a pretty interesting technique called warped noise to control the movements of objects in the video. So it basically takes a special kind of noise and warps it to match the movements of objects in the video and this allows it to generate videos that are smoother and more consistent. Anyways, if you scroll up to the top, they've already released a GitHub repo and if you scroll to the bottom, it already contains all the instructions on how to install this and run it locally on your computer. Plus there are also plans for them to release a Google Colab option for people without GPUs as well as a comfy UI integration. Anyways the link to the GitHub and some additional examples as well as the technical paper are all up here so I will link to this main page in the description below for you to read further. Now similar to Go with the Flow, this next tool can also edit videos but not just any video. This tool is called Dream Catalyst and this allows you to edit or replace any object in a Nerf video. Now these aren't regular videos so Nerf stands for Neural Radiance Field and it's basically a 3D video. So how this works is usually you have multiple cameras taking a video of a scene at multiple angles and these videos are glued together to create a 3D video. By the way I've already featured another AI which can basically take all these videos at multiple angles and glue them together to create this 3D video. See this video to learn more about that. But anyways back to Dream Catalyst, this is basically an AI which allows you to edit these 3D videos just with a prompt as you can see in these examples. Here's another example where it's able to turn him into Einstein or turn him into an elf or turn him into a skull or give him a mustache. As you can see you can get really creative with this. Here are some more examples. So the video on the left is the original 3D video and this is just a statue of a bear. Now if you prompt it with turn the bear statue into a polar bear that's exactly what it does or if you turn it into a grizzly bear this is what you get. Here's another example. So the source video is on the left and if you prompt it with make it autumn it actually you know turns the scene into autumn. Really powerful tool. Here's another example. The original video again is on the left and if you prompt it with make it look like it just snowed it indeed makes the scene look like it just snowed or if you prompt it with make it sunset it indeed turns the scene into sunset. And of course this technology is perfect for video games. For example customizing characters objects or environments or also for product design and animation and virtual reality. There are so many use cases for this. Now if you scroll up to the top they've already released a GitHub repo and if you click into it it contains all the instructions on how to download and run this locally. Anyways I will link to this main page in the description below for you to read further. Next up this is really exciting. We have a new top image generator. So this is Google's ImageN 3 version 2 and this is the LM arena or chatbot arena which I featured many times on this channel before and this is where users can blind test different image generation models. So for example the user would enter a prompt here and then there are two image generators side by side. You don't know which one is which and then after you generate the image you would pick the winner. So after a ton of blind tests from different users on all of these different image generation models note that Google's newest image gen 3 is now ranked at number one and they have a really high arena score of 1099. So like in second third and fourth place we have Recraft, Ideogram and Flux 1.1 Pro. Note how close their arena scores are. It's like less than a 10 point difference but note that the score of ImageN far surpasses all these other models. This is over like a 60 point difference which is pretty insane and you can try out this newest version of ImageN right now so you just need to go to this labs platform which I'll link to in the description below and it's pretty easy to use and it's blazing fast so let's try a few examples. The first one is two hands making a heart symbol. Now note that here I'm testing it on pretty complicated examples. I'm testing it on anatomy or prompt understanding. All the other image generators out there can already generate basic scenes like portrait shots or things like that but here I'm really trying to test it out on trickier scenes and you can see it generates four images in one go and all of these look fantastic. So here's shot number one, this looks perfect. Here's shot number two, again really beautiful, the hands look flawless. Here is generation number three and again super detailed photo of the hands and fingers and here is photo number four and all four of them look pretty flawless to me. I can't really point out any noticeable issues with this. Let's try another trickier example. This time it's to test out its prompt understanding. So the prompt is photo of a red sphere on top of a blue cube. Behind them is a green triangle. On the right is a dog, on the left is a cat. Let's see if we can pull this off. So here are the four images. Let's look at the first one. We do have a red sphere on top of a blue cube. Behind them is a green triangle. On the right is a dog, on the left is a cat. Perfect. Here is another example and it also nails the positioning of all the objects I specified in the prompt. And both the dog and the cat look super realistic. Here's another example, although it added an extra cat here, but it did get the rest of the pieces correct. You could argue that this green triangle is more to the left rather than directly behind these two objects. And here's the fourth example and again it nails everything. Very nice, very nice. Let's try another tricky example with a lot of different objects in the prompt. So the prompt is an astronaut riding a giant snail with an iridescent shell through a desert landscape. The astronaut is waving a flag that says I love AI. So not only is it testing its understanding of all these tricky elements, but I'm also trying to see if it can generate coherent text in the image. And here's what I got. Again this looks super realistic. The astronaut and the snail look very nice and detailed. Plus it does have an iridescent shell. Plus the flag does say I love AI. Here's another example. Again everything just looks really nice and detailed and flawless to me. Here's yet another example and as you can see here it does have an extra A in the flag. So do expect there to be like one or two images from the four images that are not completely correct. And here's the fourth image. Again the flaw here is that this lig should be on the other side of the shell plus the shell isn't really iridescent. Nevertheless it did get like two out of the four images correct. And note that this is a really tricky prompt that I'm testing it on so I don't expect it to nail everything. Now for my experience using all of these top models, in fact I've made a review video on most of these on my channel, just from these preliminary tests of this newer version of ImageN3, it does seem to be noticeably better than the rest of the image models out there. Everything just looks more detailed and realistic and consistent. Plus it's less error prone. Anyways, I will link to this site in the description below where you can try out Google's ImageN3 for free. Next up, this AI is also really powerful. It's called Daifu Eraser and this can basically erase parts of a video or fill in missing parts. So here are a few examples. On the left is the original video and if you want to erase the dog, well there's a previous method called Pro Painter. And then here's this new method, Daifu Eraser. First of all, note how magical this is. Even though this is a really tricky scene, I mean this AI is able to erase the dog pretty well from this video. And it's a lot more consistent compared to the previous method, Pro Painter. Here's another easier example where again if the original video is on the left and you want to mask out this person, well notice that Pro Painter doesn't really do a good job. There are some artifacts where the person used to be. But this new method, Daifu Eraser, is able to get AI to generate this background very seamlessly where the person used to be. So a really powerful and useful tool. Here's another example. Let's say you want to erase the dancer on the right. Well as you can see, this new tool handles it pretty well. Although I do see a very noticeable flaw here, which is her reflection is still showing. It would be nice if they were able to remove the reflection as well. Note that this doesn't just work for erasing one character or object. You can select multiple objects and erase them simultaneously. So here if we want to erase all four of these kids, well you can see that again this AI can handle it pretty well. Here's another example where we can select to erase both the kid and the soccer ball. Again this AI is able to erase this plus fill in the background pretty seamlessly. Here's another cool example. You can just choose to erase the boy from the video. Note that for a previous competitor, Propainter, it kind of messes up the edges of her arm. But Diffuse Eraser is able to fix this issue and erase the boy pretty seamlessly from the video. Here's another example showing that this new method is just better at filling in the background where the person used to be. And by the way, here's how it works. So it takes your input video and it basically masks out the part of the video that you want to erase frame by frame. And then it uses a diffusion model to basically fill in the blanks based on the background. And it also uses something called temporal attention to basically keep track of how the video changes over time and make the final video more consistent. Now of course this tool would be really useful for animation or creating special effects. Or this could also be used to remove a logo or a watermark or other elements in a video. So there are many cool use cases for this. Now if you scroll up to the top, they've also already released a GitHub repo. And if you click into it, it includes all the instructions on how to download and run this locally for free on your computer. Now currently this is just working with code, but they are also planning to release a Gradio demo. So this is a more user-friendly graphical interface which you can use to run this. Anyways, all the links are up here, so I will link to this main page in the description below for you to read further. Next up we have not one, but two really cool AI agents that came out this week. And both of them are really impressive. First up is this free and open source AI agent called UITars. Now they have both a browser agent that works on your internet browser, or a full desktop agent that basically works on your entire computer. This is not limited to just your internet browser. So here are some examples. So let's prompt it with, get the current weather in San Francisco using the web browser. And you can see here, it's automatically opening Google Chrome and typing in weather in San Francisco. And then it's analyzing the screen and it has outputted the answer for you in the chat interface. Now this is just a really simple example, but here's another example where you can get it to tweet something. So if the prompt is send a Twitter with the content, hello world, note that it's now opening Google Chrome and then typing in twitter.com to open up Twitter. And then it's automatically, you know, typing out the tweet and posting it. Again, a really useful tool for automating a lot of things. Here are some even cooler examples. You can get it to find round trip flights, for example, from Seattle to New York with a set departure and return date. So note how it's now like searching for the departure and the destination airport. And it's also doing this step-by-step so you can actually see its reasoning in the chat interface. And then the next step is to open up the date picker and select the departure and return date. And that's what it does. And then you can see the next step is to click on search. And you know, the really cool thing, and this is an error that I've hit with previous agents that I've tested out, is that they sometimes get stuck if the page doesn't load fast enough. Here it has actually detected that since the page has not fully loaded, it's necessary to wait for it to load first before proceeding further. So here it's deciding to wait for the page to load completely before taking any further actions. And then after that has loaded, it's now going to click on the sort and filter dropdown and then sort by price. Or here's another example. Here's where the desktop app is more useful. So not only does it work on your web browser, but you can use other desktop apps like Word or PowerPoint or VS Code. So for example, here the user is getting this AI agent to help it edit this PowerPoint presentation. So here they prompted it with make the background color of slide two same as the color of the title from slide one. Okay, so right now it's selecting slide two from the sidebar. And then it has detected that it needs to access the background color settings. And then it has decided to select the red color from the color palette. And that's pretty much it. It has successfully completed the task. Here's another really useful demonstration. So here the prompt is please help me install the auto docstring extension in VS Code in the sidebar. So it first needs to open up VS Code. And then again, I love this feature. It has not fully loaded yet. So it's actually waiting for it to load first before proceeding. And then it has detected that it needs to access the extensions view in VS Code. So now it's clicking on that. And then next, it needs to type in auto docstring in the search bar in order to search for the extension. And that's exactly what it does. And then next, it has decided to click the install button to install the extension. And that's pretty much it. Again, you can see that it has successfully completed the task. So a really useful tool for automating a lot of different things. And again, this is completely free and open source. This is way better than Claude computer use, which was released a few months ago. And it was pretty awful. It was prone to a lot of errors. It got stuck in a lot of loops. Plus it's expensive as hell and it's closed source. So it's awesome that we now have an open source model, which is even better. Plus it doesn't just work on your web browser, but it can also interact with your entire desktop. Thanks to our sponsor, Upix. Upix is a realistic AI selfie generator. They've made it dead easy for you to generate high quality, realistic images of yourself or anyone else in just a few clicks. It works on your desktop and on your phone. You don't need to install any additional apps or anything. It just works straight from your internet browser. And it's really easy to use. Just choose a template, upload anyone's photo and click create. It's as easy as that. And look how realistic this is. There are many templates to choose from and more to come. So check it out at upix.app. Now they've released several models. One is 72 billion parameters, which you might be able to run on a higher end GPU. And then the other one is a tinier model with only 7 billion parameters. So this is able to be run on like a lower grade GPU. And a cool feature of this is that it can learn iteratively. So it uses something called reflection tuning to learn from its mistakes and adapt to new situations. And both of them are state of the art. If you look at various benchmarks, here is basically the previous top performer. Now note how UI TARS just outperforms everything. It beats the previous top model on all of these benchmarks and not just by a bit. So like for some of these benchmarks, like GUI Odyssey, it beats the competitor by like over 40%. This one by over 30%. This one over 20%. I mean, this is just an absolutely insane improvement. And here's another comparison of UI TARS with GPT 4.0 and CLAWD on various benchmark metrics for AI agents. And you can see across the board, UI TARS just beats both GPT 4.0 and CLAWD. This is pretty crazy. Anyways, like I said, this is completely open source. All the models are already out on Hugging Face. And here it contains all the instructions on how to download and use this offline on your computer. And also this is under the Apache 2 license. So you can pretty much do anything you want with it. You can edit this. You can tweak it. You can even use this for commercial purposes. It has very minimal restrictions. Anyways, I will link to this page in the description below, which contains all the information you need to get started. And let me know in the comments below if you want me to actually do a full tutorial on how to install this and run it. Anyways, in addition to UI TARS, OpenAI has finally released their long awaited AI agent, which they call Operator. Now this is a web based agent, so it only works in an internet browser. That being said, people have shown that it can successfully do a wide range of tasks such as booking flights or ordering groceries or making restaurant reservations. Now Operator uses its own browser to navigate websites and it can interact with the website through typing and clicking and scrolling, which is basically how us humans interact with websites as well. And it actually uses a new agent model, which is based on GPT 4.0. So here are some examples of Operator in action. So this person is getting Operator to book a one way flight with Turkish Airlines on a certain date. And again, note that Operator uses its own browser to navigate to the website. And then throughout any step, there's also a button for you to take control. So this would basically pause what the AI is doing and you can click or do something before getting the AI to resume. Anyways, here it has selected the dates and then it asks, should I proceed with searching for available flights? So unfortunately, it does ask you a lot and you do need to respond to it. So it can't really just automatically do things by itself. And then here's another example where after it has searched for flights, it's asking should I proceed with booking this flight? So again, the user has to respond yes before it proceeds. And then here again, it asks, should I proceed with booking this option? And again, the user has to respond yes. So this is a very cumbersome process, to be honest, especially if you just want to sit back, relax and let it do its own thing. Well, you can't really do that with Operator at this stage. It can't really just do things on its own. And oftentimes it has to ask for your approval to do something. And then here it finally proceeds to the form page where you need to fill in the passenger information. And again, it doesn't really fill it in for you. So it's asking the user to please provide these details so I can proceed with the booking. Now, this could be a pro or a con, depending on how you look at it. The pro is, of course, this is a lot safer to use. It's not just going to enter your credit card and buy some random stuff. But the disadvantage of this is, again, it just wastes a lot of time. You need to enter the information in yourself. So anyways, that's an example of how you can get Operator to book a flight. Here's another example of a user trying to use it to book a dinner reservation. So the prompt is, can you book me a table in Los Altos at a certain restaurant? So again, it's opening up its native browser and then searching for the restaurant on Bing for some reason instead of Google. And then it has found some available time slots. But again, it needs to ask the user, should I proceed with reserving this time? So the user has to respond, yes, please. And then it proceeds with the reservation. And again, for the next step, it gets stuck and asks, should I proceed with completing the reservation? So again, the user has to respond, yes. Finally, it has confirmed the reservation at the specified time. Now the user is testing to see if it can also cancel this. So it has detected that it'll need to click on the Cancel option on the page. Should I proceed? And the user says yes. So again, a very cumbersome process. It has to get the user's approval at each step, which isn't really efficient in my opinion. Here's another example from another user. He got Operator to find the latest papers on AI agents and then summarize them. So it's now opening its native browser and then going to arxiv.org. And then it knows to click on the dropdown and select the Computer Science category. And then it's searching AI agents. And note that this video, as well as the previous examples, are sped up. So it actually takes quite a long time to think through every step and actually do each action. It's quite slow at this stage. And here it keeps getting stuck. It just could not find any results. So finally, the user clicks on Take Control. And then the user has to manually enter the correct search terms in order to get this to work. And then he hits Return to Operator for the AI agent to proceed. So finally, after using those search filters, the AI agent was able to find some papers. And then after clicking through each one of these papers and reading the PDFs, again, it kind of gets stuck. And so the user has to take control and then ask it to please finish up the task. Just summarize what you have already. So that's what it does. And after a long time, it outputs the summary of each paper in the chat interface. So not too impressive of an example. But I hope that gives you a sense of what Operator is capable and not capable of. So that shows you a few use cases for this. Note again that while Operator is quite promising in automating some tasks, there are some limitations. So it can't really handle complex or specialized tasks. And it prompts you a lot. Like almost at every step, it always asks if it's okay to proceed. Plus, again, note that this is limited to their native browser. So you can't even use this on your own Chrome browser. You can't use this on your desktop. So it can't really interact with anything that's outside of this native browser interface. Plus, because this is OpenAI, this is quite closed source, to play it safe, they have implemented a ton of safeguards for this. So for example, things like entering credit card details or passwords, again, you can't really get it to automate that. Not that you would want to, of course. This could be a privacy issue. But anyways, just note that there are guardrails in place. And here's the other thing. You do need to be a ChatGPT Pro subscriber to use this. So not even the Plus plan, but the Pro plan, which costs $200 per month. So you can see at the bottom here, it says access to research preview of operator. Plus, you do need to be located in the US in order to get access to this. Next up here is a really cool tool by Google. So it's called Tokenverse. And this basically allows you to take any object or element in multiple images and merge them together to create a new image. So it's a really useful tool for you to mix and match different visual elements to create new and interesting images. So here are some examples. Let's say you have these four input images, and you have descriptions corresponding to each image. So here is a doll wearing a jacket. Here is a cat wearing glasses and a shirt. Here we have a dog wearing a hat and necklace. Here is a forest with light. Now let's say I want to generate a new image with this doll and this shirt from the cat and this hat from the dog and this light in the forest. Well, I can plug it into this Tokenverse tool. And here is the result. How cool is that? It indeed matches, you know, this rabbit doll, this red hat from the dog, the shirt from the cat, and this light from this forest photo. Here's another example. This time, let's say we have this doll sitting on a bench. We have the same cat picture wearing the shirt. We have this woman holding an umbrella, and we have the same forest photo. Well, if you want to get the doll to wear this shirt holding this umbrella in the sky under this light, well, this is your result. How cool is that? Here's another example. Let's say we have this sheep doll inside a bucket. We have this boat floating on the water. Again, we have this photo of the woman holding an umbrella and this forest photo. And let's say you want to get this doll sailing in this boat and then using this umbrella as a sail under this light. Note that it generates this very accurately. The sheep doll and the boat and the red umbrella and the forest light are very consistent with the input images. Here's another example. Let's say we have an image of this dude. We have this doll sitting on the bench, and we have the same umbrella photo, and we have this woman doing yoga near the sea. And let's say for the prompt, you want this man on this bench holding this umbrella by the sea, which is the background of this photo. Well, again, you can plug it into this Tokenverse tool and it would generate this photo. Note, the face of the man plus the bench plus the umbrella plus this sea background indeed matches the input photos. And further down this project page, they also allow you to try this out yourself. So here are the input photos and here's the output photos. Right now we have the doll wearing these glasses, wearing this shirt with this necklace, as you can see here. You can also click on this to change the doll, for example. So let's change the doll into this bear. And this is what we get. And you can also change the shirt to something like this. And now the shirt looks like this. We can also change the glasses to these pink heart glasses. And here's the result. And finally, let's change the necklace to something like this. And here's what you get. Now, let's change the doll back to the rabbit. And this is what you get. So a really useful tool for you to mix and match different elements from different images. And this doesn't just transfer objects. You can also transfer lighting. So for example, if this is your input image, well, you can transfer the same lighting style across different prompts. You can also transfer the pose of an input image. So let's say this is your original image. Note how it's able to apply the same pose across all these new images with different prompts. You can also apply texture to this. So let's say this is the input image. It's this dog made of colorful plastic beads. Well, you can apply the same texture to all these new images with different prompts, as you can see here. Here's another example of this mosaic vase with this really unique pink and white design. And now if you generate new images with different objects, but with the same mosaic design, you can see that it indeed applies the same design from the original vase. So this unleashes a lot of creativity. You can do so many cool things with this. Now if you scroll up to the top, it does say the code is coming soon. So it looks like they are planning to open source this, which is fantastic. Anyways, I will link to this page in the description below where you can check out more examples. Next up, we have another really cool tool called Video Depth Anything. This is an AI that can basically take a long video and figure out how far away things are from the camera. So this is like a depth video. Now there are previous tools that can do this, but this one is specialized for handling longer videos, and it's a lot more accurate. So this is actually based on an existing tool called Depth Anything version 2, but they've basically fine tuned it further to be even better. So here are some examples. And note that the video is really long, so they're speeding this up by three times. But note how accurate it is at determining the depth of all the objects in this video. Even with a very high action scene like this, it's able to estimate the depth of everything very accurately. Here's another example where again we have quite a high action scene with various people jumping around, the camera is moving everywhere. This is quite a complex scene. But again, it's able to estimate the depth of everything very accurately across the entire video. Very impressive. Here is yet another example, again quite a chaotic scene. It's a very shaky camera, but it's able to capture the depth of everything in the video very accurately. And if you compare this with a competitor called Depth Crafter, note that this new one is a lot more detailed. So Depth Crafter is on the left. And if you look at the grass, note how much more detailed and sharper the grass is from this new tool compared to the previous competitor, which is a lot blurrier. Here's another example, again on the left is Depth Crafter, on the right is this new tool, Video Depth Anything. And notice especially the details on the metal fence. The old model is really blurry, but this one is able to generate the depth of the fence very accurately. And if you scroll up to the top, not only did they release a GitHub repo, but they also have a free Hugging Face space for you to try out. So for example, if we input this video and click Generate, this is the depth video that we get. Here's another tricky example of this Ferris wheel and note that this is quite long. This is 28 seconds, but again, it's able to process this and generate a very consistent and accurate depth video from this. Really impressive. And if you want to run this locally, they've also released a GitHub repo, which contains all the instructions on how to download and run this on your computer. And note that the model sizes are quite small. The small version is only 28 million parameters, so not even a billion. And then the larger model is 381 million, so this is definitely usable even for a low grade GPU. Anyways, all the links are up here, plus there are more demos on this project page. So I will link to this page in the description below for you to read further. Also, this week we have not one, but two new AI models that even beats OpenAI's O1. This is their flagship model, which is PHD level. But the fact that this week we have two models that even beat O1, or at least matches it on various benchmarks, is just absolutely insane. Now the first one is called DeepSeek R1, and this is completely open source and free to use. You can already download this and run it locally and offline. In fact, some users are even able to run a distilled version of this on their iPhone or Android phone. Now really quickly, they actually used a new training technique to create this. The model is based on reinforcement learning instead of a mostly supervised learning technique that we've seen with previous generations of AI models like GPT and Claude and Lama. So basically what this means is it can learn things by itself with minimal human guidance. It kind of had to figure things out by itself and also verify its own answers, and because of this, it's a lot more performant in terms of problem solving and step-by-step reasoning. In fact, if you compare DeepSeek with OpenAI's O1, it basically beats O1 on all of these math benchmarks, which is pretty insane. Anyways, I already did a full review and deep dive on DeepSeek R1, so check out this video if you haven't already. But in addition to DeepSeek, which has gotten a lot of attention, this other company is relatively under the radar, but they've also released an AI model this week that beats OpenAI's O1. So the AI model is called Kimi K1.5, and this is a top multimodal model developed by a startup called Moonshot AI. And by the way, DeepSeek R1 is not multimodal. It currently can only process text, but this one, Kimi, also has vision capabilities, so it can potentially analyze images and video. And similar to DeepSeek, Kimi was also trained using reinforcement learning, without relying on more complex techniques that we've seen in the past like Monte Carlo tree search, value functions, or process reward models. Now I don't think it's a coincidence that we have both these companies releasing AI models in the same week, and both of them use reinforcement learning to train their models. So this might be the next big thing in AI, is to train even better models using reinforcement learning instead of previous methods. Anyways, if you look at the performance of Kimi K1.5 against OpenAI's O1 across all these benchmarks, again note that for most of them, Kimi actually beats O1 or is at least on par with its performance, which is absolutely insane to think about. There are some additional benchmarks comparing Kimi K1.5 against OpenAI's GPT 4.0 and Cloud 3.5 Sonnet, and you can see again for most of these benchmarks, it beats the rest of the models. Really impressive results. Now unlike DeepSeek R1, this model Kimi is not open source, so currently you can only access it through their own platform, and you'll need to fill out a test application form in order to get access. Anyways, I will link to this GitHub repo for you to read further. And that sums up all the highlights in AI this week. You know, I feel like this week especially so much has happened, it's kind of overwhelming to stay on top of everything. And we're not even past the first month of 2025. This is going to be an absolutely wild year. Anyways, let me know what you think of all of this. Which tool are you most excited about? Which one are you most looking forward to trying out? As always, I will be on the lookout for the top AI news and tools to share with you. So if you enjoyed this video, remember to like, share, subscribe, and stay tuned for more content. Also, there's just so much happening in the world of AI every week, I can't possibly cover everything on my YouTube channel. So to really stay up to date with all that's going on in AI, be sure to subscribe to my free weekly newsletter. The link to that will be in the description below. Thanks for watching and I'll see you in the next one.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript