Invention Title:

SYSTEMS AND TECHNIQUES TO PERFORM 4D-GUIDED VIDEO GENERATION WITH DIFFUSION MODELS

Publication number:

US20250378674

Publication date:
Section:

Physics

Class:

G06V10/44

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

Embodiments introduce systems and techniques for generating four-dimensional (4D) videos using diffusion models. By receiving a prompt and an input mesh, keyframes are generated from depth and UV coordinate maps. These keyframes undergo feature extraction and are processed through a diffusion model to create the 4D video. The process involves UV-guided noise initialization and feature injection into the diffusion model, enhancing temporal consistency and quality.

Background

Diffusion models are mathematical frameworks used to describe the dispersion of entities within a system, applicable across various fields such as computer vision and machine learning. These models iteratively transform an initial distribution into a target distribution, enabling realistic data generation. They are particularly useful for tasks like image synthesis and data generation, providing insights into complex system dynamics.

Challenges and Solutions

Traditional 3D content creation is labor-intensive and challenging to automate. Emerging text-to-video diffusion models offer automation but lack control over scene layout and motion. The discussed systems combine dynamic 3D meshes with diffusion models' capabilities, allowing for high-quality, temporally consistent frame generation. This approach leverages ground truth correspondence from dynamic meshes to enhance video generation.

Technical Details

The system integrates 3D workflows with text-to-image models to create 4D-guided animations. By determining correspondence information from input meshes, the system ensures temporal consistency through UV-space noise initialization. Self-attention layers are enhanced to maintain spatial appearance consistency, utilizing depth cues and control models for structural guidance. This framework employs a pre-trained text-to-image diffusion model as a multi-frame renderer.

Implementation and Advantages

The system converts scene-level proxy meshes into 4D animations using depth and UV coordinate maps as guiding channels. Through a diffusion process, these inputs guide the generation of high-fidelity, consistent animations. The method improves upon previous solutions by introducing a UV-space noise initialization mechanism and correspondence-aware attention, applicable to various machine learning systems. This approach enables efficient and controllable video generation.