The oddest part of Gemini Omni is not that Google has another video model. Everyone has another video model now, or a roadmap slide shaped like one. The shift is in the grammar: video is no longer only the thing the model produces. It becomes something you hand back to the model, with a correction, a reference image, a line of audio, and a vague annoyance about the camera angle.
Google introduced Gemini Omni as a new model family that can create from mixed inputs, starting with video. The first release, Gemini Omni Flash, takes text, images, audio, and video as input and generates clips. Image and audio output are supposed to come later. That matters less as a feature checklist than as a change in where the edit lives. The old workflow had a file, a timeline, a tool, then another tool because the first one did not understand the thing you meant. Omni wants the edit to happen in the conversation.
I wrote on Tuesday about Google turning I/O into a Gemini argument, and this is the same argument in miniature. Gemini is not being sold as one app. It is becoming a layer that passes through the Gemini app, Flow, YouTube Shorts, Search, Chrome, and whatever else can bear the weight of a prompt box. A video model inside a specialist studio is interesting. A video model inside YouTube is a different animal, because the place where people watch, remix, imitate, and monetise video is also the place where the generated clip arrives.
The DeepMind model page frames Omni as "create anything from any input", which is grand enough to become meaningless if you stare at it too long. The useful part is narrower. You can ask for a scene, then ask for changes across multiple turns while the system tries to keep the character, action, and physical continuity intact. It is not just text-to-video with a nicer box around it. It is closer to a memory-bearing edit session, or at least the promise of one.
That promise is why the demos are both impressive and faintly claustrophobic. Editing by language sounds freeing until you remember how much of editing is not language. It is frame sense, boredom, irritation, the tiny lurch when a cut lands a beat late. Google can make the instruction conversational, but the person still has to know what they are trying to make. Otherwise the model supplies taste as a default setting, and default taste is where platforms go to get smooth.
The rollout is not hidden in a lab. Google says Omni Flash is going to Google AI Plus, Pro, and Ultra subscribers through the Gemini app and Flow, with free access through YouTube Shorts and YouTube Create starting this week. The Verge describes it as a new generative model family, while CineD reads it through the more practical lens of clips, references, conversational edits, and digital avatars. Those are not competing interpretations. They are the consumer and production versions of the same bet.
There is also the watermarking story, because there has to be. Google says videos created with Omni include SynthID, its imperceptible digital watermark, and that verification will sit in the Gemini app, Chrome, and Search. I am glad it exists. I also don't think a watermark settles the harder problem, which is social rather than technical: people learn the texture of generated media faster than institutions learn how to label it. The label arrives after the feeling.
What I keep coming back to is the way Omni turns video from evidence into material. A clip used to arrive with a stubbornness to it. Even a bad clip had the authority of something that had happened in front of a lens. Now the clip is more like a draft paragraph, editable by mood, reference, and revision. Google is not alone in pushing that change, but Google is better placed than most to make it ordinary. The expensive part is no longer making the impossible image. It is keeping enough friction in the process that people still notice what they asked for.
Sources:
-
Introducing Gemini Omni — Google Blog
-
Gemini Omni — Google DeepMind
-
Google I/O 2026: All the News and Announcements — The Verge