Invention Title:

GENERATING VIDEOS USING SEQUENCES OF GENERATIVE NEURAL NETWORKS

Publication number:

US20260187995

Publication date:

2026-07-02

Section:

Physics

Class:

G06V10/82

Inventors:

William Chan 🇨🇦 Toronto, Canada

Tim Salimans 🇳🇱 Utrecht, Netherlands

Chitwan Saharia 🇨🇦 Toronto, Canada

Jonathan Ho 🇺🇸 New York, NY, United States

Jay Ha Whang 🇺🇸 Mountain View, CA, United States

Applicant:

Google LLC 🇺🇸 Mountain View, CA, United States

Smart overview of the Invention

The patent application describes a system for generating videos from text prompts using sequences of generative neural networks. Initially, a text encoder neural network processes the text prompt to create a contextual embedding. This embedding is then used by a series of generative neural networks to produce a final video that depicts the scene described in the text.

Process Description

The process begins by receiving a text prompt that describes a scene. The text encoder neural network converts this prompt into a contextual embedding, which is then processed by an initial generative neural network. This initial network generates an output video with a basic spatial and temporal resolution. Subsequent generative neural networks in the sequence receive this video as input and enhance it by either increasing its spatial resolution or temporal resolution.

Generative Neural Networks Sequence

The sequence of generative neural networks is designed to iteratively refine the video. Each subsequent network in the sequence receives the output from the previous network and enhances it further. This enhancement can result in either a higher spatial resolution, which improves the pixel quality of the video frames, or a higher temporal resolution, which increases the number of frames per second.

Training and Implementation

The networks are trained using examples that include a text prompt and a corresponding video. The text encoder is pre-trained and remains unchanged during this training. The generative neural networks can employ techniques like spatial and temporal self-attention, convolution, and diffusion-based methods to generate the output video. These networks may also use classifier-free guidance and noise conditioning augmentation to improve video quality.

Advantages

The system offers significant advantages by generating high-resolution videos that accurately reflect the scenes described by text prompts. By using a sequence of generative neural networks, the system can progressively enhance video quality without relying on a single network to achieve the final resolution. This approach allows for the creation of videos with both high spatial and temporal resolutions, ensuring detailed and smooth visual outputs.