TechCrunch AI • 70일 전

구글 '제미나이 오mni': 이미지·음성·텍스트를 영상으로

IMP

8/10

핵심 요약

구글이 '제미나이 오mni(Gemini Omni)' 모델을 발표하며 텍스트, 이미지, 오디오를 결합해 물리 법칙과 문맥을 이해하는 고품질 비디오를 생성하는 기능을 선보였습니다. 첫 모델인 '오mni 플래시'는 최대 10초의 영상을 만들 수 있으며, 개인화된 디지털 아바타와 딥페이크 방지용 워터마크 기능도 포함되어 있어 소비자 친화적인 멀티모달 AI 시장을 선도하려는 구글의 의도를 보여줍니다.

번역된 본문

구글이 3년 전 제미나이를 출시했을 때의 목표는 텍스트, 이미지, 오디오, 비디오로 학습하여 어떤 형식의 콘텐츠든 생성할 수 있는 단일 신경망인 멀티모달 대형 언어 모델을 구축하는 것이었습니다. 오늘 열린 Google I/O 개발자 컨퍼런스에서 구글은 이 목표를 향한 구체적인 발걸음을 내디뎠습니다. 산다르 피차이 구글 CEO가 "어떤 입력이든 가지고 모든 것을 창조할 수 있을 것"이라고 밝힌 새로운 멀티모달 모델 패밀리, 제미나이 오mni(Gemini Omni)가 바로 그것입니다.

오mni는 비디오 제작에서부터 시작됩니다. 사용자는 이제 이미지, 오디오, 비디오, 텍스트를 결합할 수 있으며, 오mni는 이를 단순히 이어 붙이는 것이 아니라 모든 입력값을 종합적으로 추론하여 일관된 결과물을 만들어냅니다. 그 결과물은 물리학, 문화, 역사, 과학에 대한 이해를 반영한 고품질 비디오입니다. 또한 오mni는 구글의 '나노 바나나(Nano Banana)'와 유사하게 복잡한 편집 소프트웨어 대신 일반 텍스트 명령만으로 사진을 편집할 수 있게 해줍니다.

구글은 이미 사용자가 텍스트와 이미지를 비디오로 변환하고 아바타를 연출 및 맞춤 설정할 수 있는 전용 비디오 모델인 'Veo'를 보유하고 있습니다. 하지만 구글 딥마인드의 제품 관리 책임자인 니콜 브릭토바(Nicole Brichtova)는 오늘의 발표가 단순한 Veo 업데이트 이상이라고 말했습니다. "이는 제미나이의 지능과 미디어 모델의 렌더링 기능을 결합하는 발전을 향한 다음 단계입니다."

딥마인드의 최고 기술책임자인 코레이 카부크쿠오글루(Koray Kavukcuoglu)가 월요일 기자 브리핑에서 보여준 예시는 이러했습니다. 오mni에게 "단백질 접힘에 대한 클레이메이션 설명 영상"이라는 간단한 프롬프트를 주자, 곧바지 애니메이션 설명 영상을 렌더링했고 다음과 같은 내레이션이 포함되었습니다. "단백질은 아미노산 사슬로 시작됩니다. 알파 나선과 베타 시트라는 평면 부분과 같은 패턴으로 접혀 완벽한 3차원 형태를 형성합니다."

오mni의 장기적인 비전은 더욱 광범위하여, 오디오에서 이미지를 생성하거나 비디오에서 오디오를 생성하는 등의 작업에 모델을 활용하는 것을 포함합니다. 피차이는 브리핑에서 "제미나이를 처음 발표했을 때, 이는 텍스트, 코드, 오디오, 이미지, 비디오의 조합으로 학습하여 세계에 대한 더 깊은 이해를 제공할 네이티브 멀티모달 AI 모델이었습니다. 세계 모델을 통해 AI는 텍스트 예측에서 현실을 시뮬레이션하는 단계로 나아가고 있으며, 제미나이 오mni가 그 방향으로 나아가는 다음 단계입니다."라고 말했습니다.

이번 출시의 일환으로 사용자는 자신만의 디지털 아바타를 사용하여 비디오를 제작할 수도 있습니다. 이는 현재 폐지된 OpenAI의 Sora 앱에서 'Cameos' 기능으로 대중화된 바 있습니다. 브릭토바에 따르면, 딥페이크를 방지하기 위해 사용자는 전용 제품 온보딩 과정을 거쳐야 하며, 여기에는 자신을 녹화하고 일련의 숫자를 말하는 과정이 포함됩니다. 그런 다음 아바타는 향후 사용을 위해 저장됩니다. 추가적으로, 오mni로 제작된 모든 비디오에는 구글의 'SynthID' 디지털 워터마크가 포함되어 사용자가 비디오가 제미나이 제품을 통해 생성되었는지 확인할 수 있습니다.

이 패밀리의 첫 번째 모델은 '제미나이 오mni 플래시(Gemini Omni Flash)'로, 오늘부터 제미나이 앱, 유튜브 쇼츠(YouTube Shorts), AI 크리에이티브 스튜디오인 Flow에 순차적으로 론칭됩니다. 플래시 모델은 10초 분량의 비디오 렌더링이 가능합니다. 브릭토바는 이것이 모델의 한계가 아니라, 더 많은 사람들이 사용할 수 있게 하려는 의지와 대부분의 사용자가 아직 훨씬 긴 비디오를 원하지 않을 것이라는 예상에 기반한 결정이라고 설명했습니다. 하지만 더 긴 비디오 재생 시간은 가까운 미래에 출시될 예정입니다.

구글은 오mni 플래시를 소비자 도구에 더 가깝게 포지셔닝하는 것으로 보입니다. 브릭토바와 딥마인드의 연구 엔지니어인 게이브 바스-마론(Gabe Barth-Maron)이 TechCrunch와의 통화에서 제시한 디지털 아바타의 활용 예시는 모두 개인적인 것이었습니다. 수상을 하거나 달에 가는 자신의 모습을 담은 비디오를 만들거나, 휴가 때 찍은 동영상 배경에서 지나가는 행인을 지우는 식이었습니다. 바스-마론은 이를 더욱 간단히 표현했습니다. "마치 개인화된 밈(Meme)과 같습니다."

브릭토바는 "우리는 소비자가 쉽게 사용할 수 있도록 만드는 데 분명히 집중했습니다. 아직 소비자의 벽을 넘은 비디오 모델이 많지 않기 때문에, 이것이 우리의 도전"이라고 말했습니다. 이러한 사용의 편의성에는 주의사항도 따릅니다. 브릭토바와 바스-마론은 (원문 누락) 편집 프롬프트와 관련된 세부 사항에 주의를 기울여야 한다는 점을 덧붙였습니다.

원문 보기

원문 보기 (영어)

When Google launched Gemini three years ago , the goal was to build a multimodal large language model — a single neural network that was trained on text, image, audio, and video and could generate content in any of those formats. Today, at its Google I/O developer conference, the company took a concrete step toward that goal with Gemini Omni, a new family of multimodal models that Google CEO Sundar Pichai says will be able to “create anything from any input.” Omni will start with video. Users can now combine images, audio, video, and text, and rather than simply stitching those inputs together, Omni reasons across all of them to produce a consistent output. The result is high-quality videos that reflect an understanding of physics, culture, history, and science. Omni also lets users edit photos with plain text commands rather than complex editing software, similar to Google’s Nano Banana . Google already has a dedicated video model, Veo , that lets users turn text and images into videos, and even direct and customize avatars . But Google DeepMind director of product management Nicole Brichtova says that today’s release is more than a Veo update: “It’s the next step towards the progression of combining the intelligence of Gemini with the rendering capabilities of our media models.” One example that Koray Kavukcuoglu, DeepMind’s chief technologist, gave reporters during a media briefing on Monday: When Omni was given a simple prompt like “a claymation explainer of protein folding,” it quickly rendered a video of a stop-motion explainer with a voice-over that said, “Proteins start as chains of amino acids. They fold into patterns like the alpha helix and flat sections called beta sheets, forming a perfect three-dimensional shape.” The long-term vision for Omni is broader, involving the model being used to do things like generate images from audio, or audio from video. “When we first announced Gemini, it was our first AI model to be natively multimodal,” Pichai said during the briefing. “We knew that training it on a combination of text, code, audio, images, and video would give it a deeper understanding of the world. With world models, AI is moving from predicting text to simulating reality. Gemini Omni is the next step in that direction.” As part of the release, users will also be able to create videos with their own digital avatars — something OpenAI popularized on its now-defunct Sora app with Cameos. To prevent deepfakes, users will have to go through a dedicated product onboarding, which involves recording themselves and speaking out a series of numbers, per Brichtova. The avatar then gets stored for future use. Additionally, all videos created with Omni will include Google’s SynthID digital watermark, which allows users to verify if videos were generated via the Gemini products. The first model in the family is Gemini Omni Flash, which will roll out today to the Gemini app, YouTube Shorts, and AI creative studio Flow. Flash will be capable of rendering 10 seconds of video, which Brichtova says isn’t a model limitation, but rather a decision based both on a desire to get it into more hands and an anticipation that most users won’t want to make much longer videos yet. Longer video durations are in the pipeline for the near future, though. Google seems to be pitching Omni Flash as more of a consumer tool. The examples Brichtova and Gabe Barth-Maron, a research engineer at DeepMind, gave on a call with TechCrunch of uses for digital avatars were all personal: Making a video of yourself winning an award or going to the moon, or removing a passerby from the background of a video you took on vacation. Barth-Maron put it more simply: “They’re like personalized memes.” “We definitely did focus on making this easy to use for consumers,” Brichtova said. “Not many video models have breached that chasm with consumers, so this is our play to do that.” The ease of use comes with a caveat: Brichtova and Barth-Maron noted that editing prompts will need to be highly specific, otherwise Omni risks over-editing or unintentionally altering elements the user wanted to keep — a problem Nano Banana users would have run into. Despite the near-term consumer focus, Omni’s enterprise and creative implications are obvious, and Google will make Omni available via API in the coming weeks. The avatar-generating tool — a capability that is available today on Shorts — is something Google expects content creators to pick up. But more broadly, an end-to-end multimodal workflow could be transformative for advertisers and filmmakers. Startup Luma AI is building something similar, an agentic tool that can generate an entire ad campaign based on a short brief and a product image, powered by its own “unified” model. “We're actually pretty proud of the model's text-rendering capabilities, which is really useful for things like advertising,” Brichtova said. “If you want a product somewhere, or even just a slogan, it needs to be accurate … We definitely anticipate filmmakers and other kinds of creators are going to be using this model as well.” The more professional use cases might be better served by the Omni Pro model, which should perform better across all Omni tasks. Google hasn’t said when it will release Pro yet, but Brichtova said that will happen when “we feel like we’re at a point where we have a step change above Flash.” Topics AI , gemini omni flash , Google , google gemini omni , google io 2026 , Media & Entertainment , Veo When you purchase through links in our articles, we may earn a small commission . This doesn’t affect our editorial independence. Rebecca Bellan Senior Reporter Rebecca Bellan is a senior reporter at TechCrunch where she covers the business, policy, and emerging trends shaping artificial intelligence. Her work has also appeared in Forbes, Bloomberg, The Atlantic, The Daily Beast, and other publications. You can contact or verify outreach from Rebecca by emailing rebecca.bellan@techcrunch.com or via encrypted message at rebeccabellan.491 on Signal. View Bio May 27 Athens, Greece StrictlyVC Athens is up next. Hear unfiltered insights straight from Europe’s tech leaders and connect with the people shaping what’s ahead. Lock in your spot before it’s gone. REGISTER NOW Most Popular Elon Musk has lost his lawsuit against Sam Altman and OpenAI Tim Fernholz Users turn to jailbreaking their older Kindles as Amazon ends support Lauren Forristal OpenAI launches ChatGPT for personal finance, will let you connect bank accounts Ivan Mehta US orders travelers on Air Force One to throw away gifts, pins, and burner phones after China trip Lorenzo Franceschi-Bicchierai OpenAI is reportedly preparing legal action against Apple; it wouldn't be the first partner to feel burned Connie Loizos How to turn off Instagram's new Instants feature and retract photos you accidentally shared Aisha Malik Musk’s xAI is running nearly 50 gas turbines unchecked at its Mississippi data center Tim De Chant

구글 제미나이 멀티모달 비디오 생성 딥마인드