Gemini App Gets Spark Agents and an Anything-to-Anything Model (Full Transcript)

Google revamps Gemini with a new UI, Spark autonomous agents, OmniModel multimodal generation, and desktop voice controls—plus a push toward wearable assistants.
Download Transcript (DOCX)
Speakers
add Add new speaker

[00:00:00] Speaker 1: Google just dropped a massive wave of AI updates for the Gemini app, completely overhauling the interface and introducing actual autonomous agents that can run tasks in the background. I sat down for an exclusive interview with Jeff Wong, the head of engineering for the Gemini app, to find out exactly how these new features work, from anything-to-anything multimodal models, to desktop voice hacks that clean up your speech in real time. Let's go inside the engineering team to see what's new. I started by asking Jeff about the massive visual change you see the second you launch the Gemini app, which is powered by a new design language Google calls Neural Expressive. Here's what he had to say about why they completely rebuilt the UI from the ground up.

[00:00:46] Speaker 2: First, there's a huge redesign of that. As soon as you launch the app, you'll see it. It's more modern. We have sort of a new design language that came out with it. On iOS, it's Liquid Glass, and it's just, I think, a lot cleaner. We were really aiming to sort of modernize and clean up a lot of the UI. So we redesigned what we call the zero state, like the first screen that you see. The side nav that kind of comes out when you tap the three lines, that's all redone. Brand new set of icons, new fonts across the board. And then, as you get a response back, you type in something to Gemini and it gives you a response. The response all redesigned. So that was just kind of top to bottom. The UI is very different. The reception has been pretty good.

[00:01:27] Speaker 1: But the absolute biggest announcement from the engineering team is a brand new feature called Spark. This fundamentally shifts Gemini from a simple chatbot into a true autonomous assistant that can execute multi-step workflows for you. Here's how Jeff explained it to me.

[00:01:44] Speaker 2: The big one that is leading in Sundar's announcement is called Spark. And so this is really bringing agents into our consumer AI app. Yeah, it's super exciting. So I've been playing with it. If you think about how eventually you will want something that can act and be a true assistant, it's something that should be able to do work and kind of spin off multiple tasks to accomplish a bigger goal. Versus right now, most of these chat apps, you kind of give it one thing and it gives you a response. So you can kind of give it some pretty big, open-ended, hard tasks and it'll know to spin off little sub-agents to go accomplish that and then kind of come back.

[00:02:24] Speaker 1: If you were to talk to someone who has no experience with agents, because I feel like you hear of agents and I think some people are a little apprehensive to test it out. What would you say is that first scenario that someone could try where they see value from using agents?

[00:02:39] Speaker 2: Just tell me what I need to know to prepare for the day. And actually, we built this as a product that I can talk about to you called Daily Brief, but Spark can do that as well. And so basically, if you kind of think about what you need to do, let's say tomorrow, you might wake up and kind of look at your calendar, see what you have going on, kind of think about what you need to prepare for. Do I need to go look up Kevin's YouTube channel? Same for your email. What is it that's lingering on your to-do list? Do I have bills I have to pay? And things like that, where it can go and do all that. It can kind of spin up these agents or you can call them little minions to kind of go and do work for you. And then once it does that, you can ask it to do that repeatedly. So that's the other part of Spark is you can have what we call them heartbeats internally, but they can repeat. So they can be sort of these scheduled actions.

[00:03:30] Speaker 1: When I hear of repeating things or setting up automations, that sounds complicated. But I take it it's not complicated. So is that something where you just go in and you say, hey, I want this to run every morning and that's as far as it goes?

[00:03:40] Speaker 2: Yeah. As soon as you've done something or even before you've done it, it's exactly that. It's every day, do this at 8 a.m. So it almost feels like with Spark, it's almost bringing kind of these agent capabilities to the masses. Exactly.

[00:03:54] Speaker 1: Next, we shifted gears to the brand new underlying engine driving the app's advanced multimodal capabilities, the OmniModel. I asked Jeff how this completely changes the way Gemini processes different types of media at the same time.

[00:04:08] Speaker 2: So we call that the OmniModel and that one's super cool. It's kind of combining a few of the different capabilities that we had in the app. So we had nano banana, which was image generation. We had VO, which was video generation. And for those, it was always text to images or text to video. And you could feed it like an image. You could do image to image. And now we're basically going from anything to anything. So it's actually very cool. I've been playing with it a bunch, but you can feed it a video and a couple images and then also give it a prompt at the same time. And then it'll output and combine all of those things together into a really cool output. So yeah, it's very powerful. It kind of combines a few different tools that we had.

[00:04:49] Speaker 1: So as an example, you could take, say, a video of yourself and then maybe have a photo with a certain style and then it'll merge the two of those?

[00:04:56] Speaker 2: Exactly. Okay.

[00:04:58] Speaker 1: From there, Jeff surprised me with one more major announcement, this time specifically built for the desktop experience. It's a feature designed to completely change how you talk to your computer by using background context and real-time speech filtering.

[00:05:13] Speaker 2: It's using the desktop app and using speech to basically do some pretty complex things. So one of the benefits of having the desktop app is it has context from your files on your hard drive, what's on your screen. And then voice is really cool because more and more, that's a very understandable communication method. And the models are getting so good where you can kind of hold down this thing and just speak naturally. And it knows what to filter out in terms of, I say ums and uhs a lot, or you might make a mistake and be like, oh, I actually scratched that. And it's good about kind of knowing and understanding how to filter that and output like a clean sort of stream of text. Once you have that coupled with, you're kind of using the voice now to orchestrate or kind of command and do things on the Mac, which is very cool.

[00:06:06] Speaker 1: To wrap up our conversation, I wanted to look further down the road. I asked Jeff about the ultimate North Star for the Gemini app and how tomorrow's hardware will completely change the way assistants interact with the physical world around us.

[00:06:22] Speaker 2: As you see the Gemini app continue to evolve and develop, what is your North Star for the experience and the app? Yeah, it kind of depends how far into the future. Eventually, we will have wearables, which can be kind of the primary interface. You saw some demos actually with the glasses, which I find actually that's a very natural interface where it has the audio, sort of bone conducting audio, so you don't have to pull out your phone, you don't have to type to it. Like you said, speech is just so natural. And then also, you can take photos and videos of the context. So it's actually super cool. Sometimes we'll be gardening and I have a pair of glasses and I'll turn it on and be like, hey, do I need to fertilize this? What plant is this and what kind of fertilizer do I need? Versus like going and googling, searching and trying to figure out exactly what fertilizer. So it's kind of nice to have that in context.

[00:07:23] Speaker 1: That actually, I think, resolves one of my big pain points because I feel oftentimes like if I'm gardening or something, I take out my phone, you take the screenshot, and then you add it to the chat, then you ask your question. It's like all these steps. It's a lot of work. And so just having glasses that are seeing what you're seeing and can answer your question. And it's hands-free. You can be actually gardening.

[00:07:43] Speaker 2: So it's almost like that assistant that's seeing what you're seeing and can answer questions that you have. Yeah, exactly. And then Josh mentioned something in one of his demos where you're sort of like throwing context to the assistant and it's catching it and sort of like saving it for later. I think that's also really important where the more context that has, the better it gets at, you know, helping you with whatever.

[00:08:06] Speaker 1: These new features are rolling out right now. And if you have access to Google AI Ultra, keep an eye out for the new Spark beta tab hitting your account to set up your very first autonomous agent. Please consider subscribing and I'll see you in the next video.

ai AI Insights
Arow Summary
Google announced major updates to the Gemini app, including a full UI redesign using a new design language (Neural Expressive; “Liquid Glass” on iOS), a new autonomous agent system called Spark that can break big goals into sub-tasks and run scheduled “heartbeats,” an upgraded multimodal “OmniModel” enabling anything-to-anything generation across text, images, and video, and new desktop voice capabilities that leverage on-screen and local-file context while filtering filler words for cleaner dictation/commands. The long-term vision points toward wearable interfaces (e.g., glasses) that provide hands-free, contextual assistance in the physical world.
Arow Title
Google Gemini App Overhaul: Spark Agents, OmniModel, and Desktop Voice
Arow Keywords
Google Gemini Remove
Gemini app Remove
UI redesign Remove
Neural Expressive Remove
Liquid Glass Remove
Spark Remove
autonomous agents Remove
agent workflows Remove
Daily Brief Remove
scheduled heartbeats Remove
OmniModel Remove
multimodal AI Remove
anything-to-anything Remove
image generation Remove
video generation Remove
desktop app Remove
voice commands Remove
speech filtering Remove
context awareness Remove
wearables Remove
AI glasses Remove
Arow Key Takeaways
  • Gemini’s interface was rebuilt end-to-end with a cleaner, modern design (new icons, fonts, redesigned navigation and responses).
  • Spark introduces consumer-friendly autonomous agents that decompose big tasks into sub-agents and can run on a schedule via repeatable “heartbeats.”
  • A practical starter agent use case is a daily preparation workflow that checks calendar, email, and to-dos (similar to “Daily Brief”).
  • The OmniModel unifies prior media tools to enable “anything-to-anything” multimodal input/output, combining text, images, and video in one generation flow.
  • Desktop Gemini adds voice-driven workflows using local context (files and screen) and real-time speech cleanup to remove ums/false starts.
  • Google’s North Star includes wearable assistants (glasses) that see what you see and answer questions hands-free, improving contextual help in real-world tasks.
  • Features are rolling out now, with Spark appearing as a beta tab for Google AI Ultra users.
Arow Sentiments
Positive: The discussion is enthusiastic and forward-looking, emphasizing excitement about new capabilities (agents, multimodal generation, voice) and improved usability, with no notable negative framing.
Arow Enter your query
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript