Is the Era of Creating Videos with Only Text Over? Google’s Ace Up Its Sleeve, 'Gemini Omni'

AI Summary

Google has announced 'Gemini Omni,' a next-generation multimodal AI model that allows users to create and edit videos by freely mixing text, images, sound, and existing footage through conversation.

Close your eyes for a moment and imagine a very interesting scene. You pick up your smartphone and take a casual photo of a toy car rolling around on your floor. Then, you speak into the microphone, recording the sounds of an engine revving and tires screeching: “Vroom~ Screech!” Finally, you type this into a chat box: “Create a cinematic scene where this toy car races through a dust storm in the middle of a vast desert.”

Surprisingly, you are not sitting in a Hollywood CG studio that costs billions of dollars. You are simply lying comfortably on your bed at home. In the past, combining photos, sounds, and ideas into a perfect video was a domain that required dozens of hours of grueling work and high-level expertise. But now, you just have to toss all these ingredients to the AI. In just a few minutes, a high-definition video that looks like a scene from a blockbuster movie is completed.

This magical story is not a fantasy of the distant future. It is a new reality opened up by ‘Gemini Omni,’ the next-generation generative media AI model officially announced by Google during the keynote at ‘Google I/O 2026’ just a few days ago [1]. Through this technological leap, Google is shifting the power of video creation, once reserved for a few experts, into the hands of all of us.

Why It Matters

In recent years, we have witnessed the eye-popping progress of AI in real-time. Asking questions to write reports or generating desired images has become a fairly familiar part of daily life. However, the ‘Video’ field has been regarded as a massive barrier, the most difficult to conquer even in the AI industry.

Most AI video tools that have appeared so far focused only on turning ‘text to video.’ ‘Veo 3,’ which Google showcased last year, also worked by analyzing sentences entered by the user to create videos [2]. The problem is that it is incredibly difficult to perfectly describe a person’s complex imagination using only ‘text.’ Because people tried to explain camera angles or subtle atmospheres with only words, the results were often quite different from what they actually wanted.

To use a cooking analogy, existing AI video production was like having to write down a ‘strict and demanding recipe’ perfectly. You had to write a perfect prompt (command) like, “Add 3.5g of salt and 5g of sugar, and bake at 180 degrees Celsius for exactly 15 minutes,” just to get a barely edible dish. If you used even one wrong word, an odd dish with way too much salt would pop out.

However, Gemini Omni is different. This AI is closer to a ‘genius chef with a sharp eye’ rather than a recipe. You just drop leftover ingredients from the fridge (existing video), a doodle from a sketchbook (image), and a humming tune (sound) on the kitchen table and say, “Mix these and make something delicious.” This is because Gemini Omni can simultaneously accept any type of input—text, sound, photos, or actual video—and create an amazing video based on them [3].

This change goes beyond simply adding a cool new tool. It means that ordinary people can create professional-level media without complex editing programs. It is also a strong declaration of war by Google to seize leadership in media creation against powerful rivals like ChatGPT’s OpenAI and Anthropic [1].

The Explainer

How exactly did Google perform this amazing magic? The ‘Gemini Omni’ announced this time is not a single feature, but the name of a ‘Family’ of massive AI media models that Google will introduce in the future. And the first model to step out as the leader of this series is ‘OmniFlash’ [4].

OmniFlash is the final evolution of what the industry calls Multimodal technology. Simply put, it is ‘technology that understands and processes various types of data (text, sound, images, etc.) simultaneously without being picky.’ While inheriting the solid visual capabilities of Google’s existing video model, Veo, it has far surpassed it by gaining the ability to freely mix various ingredients [3].

The most spine-tingling capability is ‘Conversational editing.’ Moving beyond just creating videos, it has brought the process of modifying already completed videos into our daily conversations [5].

Think back to video editing in the past. To change a color or erase an object in the background, you had to turn on a heavy professional program and struggle while manipulating a complex timeline (the time axis of the video). However, working with Gemini Omni is like having a cup of coffee and chatting with a ‘kind professional editor sitting next to you holding the mouse.’

Suppose you say this while looking at the screen:

User: “Hmm, the weather looks too gloomy. Can you change the background to a red evening sky with a sunset?”
OmniFlash: (Turns the sky red in just a few seconds)
User: “Oh, great! But that blue car passing by in the left corner is breaking the mood. Just erase that.”

You only need to speak in everyday language. Gemini Omni understands the context perfectly and magically modifies that part of the video [5]. Since the AI handles the complex mathematical calculations and pixel adjustments, the user only needs to open their mouth as if asking a friend for a favor.

Experts analyze that this change is thanks to a massive internal reorganization within Google. In the past, departments were divided, and technology was fragmented: ‘Veo’ for video, ‘Nano Banana’ for images, and ‘Gemini’ for text. It was like experts who didn’t even talk to each other were stuck in their own rooms within the same company. However, Google made a strategic decision to integrate all these technologies into one giant system [6]. It connected the eyes, ears, and mouth, which used to act separately, into one genius brain.

Where We Stand

While Google was ready to surprise the world, there is a rather embarrassing behind-the-scenes story behind this grand announcement. Google tried to keep this technology strictly confidential for a ‘surprise show’ on the day of the event, but the information was pointlessly leaked a week before the event started [7].

It wasn’t because someone hacked them or a spy stole secrets. Traces of the Omni model were accidentally left in the UI (User Interface) code within the update file of the ‘Gemini’ app installed on smartphones worldwide [8]. Quick-witted developers dug through the app’s internal code and found the name ‘Omni’ and its operating method even before the official announcement [9]. It was like a magician getting caught with the script before even getting on stage.

Despite this mishap, people’s expectations grew even more, and the reaction on-site was enthusiastic. On this stage, Google poured out updates that showed an overwhelming difference in scale beyond Gemini Omni.

First, they introduced ‘Gemini 3.5 Flash,’ which significantly boosted the speed of search engines and Workspace overall [10]. They also closely integrated evolved AI features into core services such as Google Docs and YouTube [11].

What particularly caught people’s eyes was the appearance of the personalized AI assistant ‘Gemini Spark’ [1]. If the AI of the past was a vending machine that only answered questions, it is now evolving into an ‘always-awake, proactive assistant’ that understands my schedule, handles tasks, and advises on daily plans before I even give instructions [12, 13].

What’s Next

The appearance of Gemini Omni signals a tectonic shift in the entire media content market beyond the invention of a convenient tool. The imagination of ordinary people, which had been suppressed by high barriers such as expensive equipment and long training periods, has finally been released without constraints. Soon, we will witness an era where creative videos that we couldn’t even imagine until now pour out.

Google’s footsteps do not stop. Through this keynote, Google confidently stated that it would introduce the ‘Gemini 3.5 Pro’ model, the top-tier brain that will boast much more sophisticated performance than the currently released features, as early as next month [2].

Just as cameras, phones, and the internet merged into a single smartphone and changed our daily lives, this phenomenon where text, sound, photos, and videos converge in one vessel called ‘Gemini Omni’ will forever change the way we consume and create media.

Now, the only talent a creator needs is not complex program manipulation skills, but pure imagination: “How will I express the world in my head through conversation?” In this new era opened by Gemini Omni, what will be your first conversation with AI?

AI’s Take

The true value of Gemini Omni lies in its ability to hide complex technical computations and elevate ordinary human conversation into a creative tool. In the past, one had to learn the ‘language’ of technology to move imagination into reality, but we are now in an era where the ‘words’ we are most familiar with are enough. The barrier to bringing one’s imagination into reality has finally been completely dismantled.

References

Share this article:

Test Your Understanding

Q1. Which of the following is the official name of the generative media AI model family newly announced by Google?

Gemini Spark
Gemini Omni
Gemini 3.5 Flash

Google announced 'Gemini Omni,' a next-generation AI media model family capable of accepting various inputs to generate and naturally edit videos.

Q2. Before Gemini Omni, what was the name of Google's existing AI video model that created videos based on text?

Veo
Nano Banana
OmniFlash

Gemini Omni was created by further expanding and advancing the capabilities of 'Veo,' Google's existing text-based video generation model.

Q3. What event first made the existence of Gemini Omni known to the world before the official announcement at Google I/O 2026?

A hacking attack by a competitor
An exposé interview by a Google employee
A leak of UI strings within the Gemini smartphone app

A week before the event, traces of the Omni model were leaked from user interface (UI) strings within Google's Gemini app installed on smartphones, revealing its features in advance.