US20250378674
2025-12-11
Physics
G06V10/44
Embodiments introduce systems and techniques for generating four-dimensional (4D) videos using diffusion models. By receiving a prompt and an input mesh, keyframes are generated from depth and UV coordinate maps. These keyframes undergo feature extraction and are processed through a diffusion model to create the 4D video. The process involves UV-guided noise initialization and feature injection into the diffusion model, enhancing temporal consistency and quality.
Diffusion models are mathematical frameworks used to describe the dispersion of entities within a system, applicable across various fields such as computer vision and machine learning. These models iteratively transform an initial distribution into a target distribution, enabling realistic data generation. They are particularly useful for tasks like image synthesis and data generation, providing insights into complex system dynamics.
Traditional 3D content creation is labor-intensive and challenging to automate. Emerging text-to-video diffusion models offer automation but lack control over scene layout and motion. The discussed systems combine dynamic 3D meshes with diffusion models' capabilities, allowing for high-quality, temporally consistent frame generation. This approach leverages ground truth correspondence from dynamic meshes to enhance video generation.
The system integrates 3D workflows with text-to-image models to create 4D-guided animations. By determining correspondence information from input meshes, the system ensures temporal consistency through UV-space noise initialization. Self-attention layers are enhanced to maintain spatial appearance consistency, utilizing depth cues and control models for structural guidance. This framework employs a pre-trained text-to-image diffusion model as a multi-frame renderer.
The system converts scene-level proxy meshes into 4D animations using depth and UV coordinate maps as guiding channels. Through a diffusion process, these inputs guide the generation of high-fidelity, consistent animations. The method improves upon previous solutions by introducing a UV-space noise initialization mechanism and correspondence-aware attention, applicable to various machine learning systems. This approach enables efficient and controllable video generation.