Invention Title:

VIDEO GENERATION USING FRAME-WISE TOKEN EMBEDDINGS

Publication number:

US20250119624

Publication date:

2025-04-10

Section:

Electricity

Class:

H04N21/816

Inventors:

Joon-Young Lee San Jose, CA, United States

Seoung Wug Oh San Jose, CA, United States

Feng Liu Beaverton, OR, United States

Haoran CAI Mercer Island, WA, United States

Difan Liu San Jose, CA, United States

YANG ZHOU Mountain View, CA, United States

Mingi Kwon San Jose, CA, United States

Baqiao Liu Champaign, IL, United States

Applicant:

Adobe Inc. San Jose, CA, United States

Smart overview of the Invention

The patent application involves a method and system for generating synthetic videos using frame-wise token embeddings. This process begins with obtaining an input prompt that describes a video scene. Based on this prompt, a series of frame-wise token embeddings is generated, each corresponding to a sequence of video frames. These embeddings are then used by a video generation model to create a synthesized video, consisting of multiple images that represent the sequence of frames.

Technical Details

The system includes a video generation model equipped with temporal layers to ensure coherence between frames. A frame-wise token generator is used to produce tokens that are appended to text embeddings, introducing variation across frames while maintaining coherence. This approach allows for more diverse movements in generated videos compared to those without frame-wise tokens. The model adapts existing image generation techniques by incorporating regularization losses and mapping layers to maintain temporal coherence.

Training and Methodology

The method involves obtaining a training set that includes a video and a descriptive training prompt. A temporal consistency loss is computed based on this data, which is then used to train the video generation model. This training aims to enable the model to generate synthesized videos from input prompts while maintaining temporal consistency across frames.

System Components

The system comprises at least one processor and memory containing executable instructions for video generation. The video generation model includes parameters stored in memory, designed to produce synthesized videos based on input prompts. A mapping network within the model generates regularized noise inputs, ensuring a temporally regularized distribution across frames.

Applications and Use Cases

The described system can be implemented in various computing environments, including servers that process user inputs via networks. Users provide text prompts through interfaces, which are encoded into text embeddings with additional frame-wise tokens for unique frame features. The system generates and outputs synthetic videos based on these inputs, offering potential applications in creative workflows where automated video generation is desired.