Invention Title:

VIDEO SYNTHESIS VIA MULTIMODAL CONDITIONING

Publication number:

US20250330679

Publication date:

2025-10-23

Section:

Electricity

Class:

H04N21/47205

Inventors:

Kyle Olszewski 🇺🇸 Los Angeles, CA, United States

Ligong Han 🇺🇸 Edison, NJ, United States

Sergey Tulyakov 🇺🇸 Santa Monica, CA, United States

Francesco Barbieri 🇺🇸 Marina del Rey, CA, United States

Hsin-Ying Lee 🇺🇸 San Jose, CA, United States

Shervin Minaee 🇺🇸 Bellevue, WA, United States

Jian Ren 🇺🇸 Hermosa Beach, CA, United States

Applicant:

Snap Inc. 🇺🇸 Santa Monica, CA, United States

Smart overview of the Invention

The disclosed technology involves a multimodal video generation framework (MMVID) that utilizes both text and images as inputs, either jointly or separately, to create video content. This approach leverages quantized video representations and a bidirectional transformer that processes multiple modalities to predict discrete video outputs. The framework introduces a novel video token trained through self-learning and employs an advanced mask-prediction algorithm, enhancing video quality and consistency. Additionally, text augmentation is used to bolster the robustness of textual inputs and the diversity of generated videos.

Technical Field

The invention pertains to the field of image and video processing, specifically focusing on video synthesis. Traditional methods in this area generate content from noise, aiming for improved resolutions, renderings, and variability in image content. The MMVID framework addresses more complex challenges by integrating multiple modalities—textual and visual—to control the video generation process.

Background

Conventional conditional video generation techniques often rely on a single input modality, limiting flexibility and output quality. The MMVID framework enhances this by incorporating multiple control signals, such as text prompts and visual inputs, within a unified system. This allows for more dynamic video generation, enabling the creation of diverse outputs based on varying textual instructions and visual inputs.

Framework Details

The MMVID framework operates through a two-stage image generation process using discrete feature representations. Initially, an autoencoder quantizes image data to form a foundational representation. Subsequently, model training employs a BERT module to establish correlations between multimodal controls—textual and visual—and the learned video representations. This training involves tasks like Masked Sequence Modeling (MSM), relevance estimation (REL), and video consistency estimation (VID) to ensure the generated videos are coherent with input signals and temporally consistent.

Applications and Benefits

The MMVID framework's ability to integrate diverse input modalities offers significant advantages in generating varied and contextually appropriate videos. It supports applications ranging from creative media production to automated content creation, where different visual elements can be dynamically synthesized based on textual prompts. Experiments conducted on multiple datasets demonstrate the framework's efficacy, including the newly introduced Multimodal VoxCeleb dataset with extensive facial attribute annotations.