US20250119624
2025-04-10
Electricity
H04N21/816
The patent application involves a method and system for generating synthetic videos using frame-wise token embeddings. This process begins with obtaining an input prompt that describes a video scene. Based on this prompt, a series of frame-wise token embeddings is generated, each corresponding to a sequence of video frames. These embeddings are then used by a video generation model to create a synthesized video, consisting of multiple images that represent the sequence of frames.
The system includes a video generation model equipped with temporal layers to ensure coherence between frames. A frame-wise token generator is used to produce tokens that are appended to text embeddings, introducing variation across frames while maintaining coherence. This approach allows for more diverse movements in generated videos compared to those without frame-wise tokens. The model adapts existing image generation techniques by incorporating regularization losses and mapping layers to maintain temporal coherence.
The method involves obtaining a training set that includes a video and a descriptive training prompt. A temporal consistency loss is computed based on this data, which is then used to train the video generation model. This training aims to enable the model to generate synthesized videos from input prompts while maintaining temporal consistency across frames.
The system comprises at least one processor and memory containing executable instructions for video generation. The video generation model includes parameters stored in memory, designed to produce synthesized videos based on input prompts. A mapping network within the model generates regularized noise inputs, ensuring a temporally regularized distribution across frames.
The described system can be implemented in various computing environments, including servers that process user inputs via networks. Users provide text prompts through interfaces, which are encoded into text embeddings with additional frame-wise tokens for unique frame features. The system generates and outputs synthetic videos based on these inputs, offering potential applications in creative workflows where automated video generation is desired.