Invention Title:

PHOTOREALISTIC CONTENT GENERATION FROM ANIMATED CONTENT BY NEURAL RADIANCE FIELD DIFFUSION GUIDED BY VISION-LANGUAGE MODELS

Publication number:

US20260112104

Publication date:

2026-04-23

Section:

Physics

Class:

G06T15/205

Inventors:

Wei WANG 🇺🇸 San Jose, CA, United States

Wei JIANG 🇺🇸 Sunnyvale, CA, United States

Yue Chen 🇺🇸 Saratoga, CA, United States

Applicant:

Huawei Technologies Co., Ltd. 🇨🇳 Shenzhen, China

Drawings (4 of 11)

Drawing 01 for PHOTOREALISTIC CONTENT GENERATION FROM ANIMATED CONTENT BY NEURAL RADIANCE FIELD DIFFUSION GUIDED BY VISION-LANGUAGE MODELS

Drawing 02 for PHOTOREALISTIC CONTENT GENERATION FROM ANIMATED CONTENT BY NEURAL RADIANCE FIELD DIFFUSION GUIDED BY VISION-LANGUAGE MODELS

Drawing 03 for PHOTOREALISTIC CONTENT GENERATION FROM ANIMATED CONTENT BY NEURAL RADIANCE FIELD DIFFUSION GUIDED BY VISION-LANGUAGE MODELS

Drawing 04 for PHOTOREALISTIC CONTENT GENERATION FROM ANIMATED CONTENT BY NEURAL RADIANCE FIELD DIFFUSION GUIDED BY VISION-LANGUAGE MODELS

Smart overview of the Invention

The patent application details a method for generating photorealistic three-dimensional (3D) video content from animated two-dimensional (2D) or 3D video content. It utilizes neural radiance fields (NeRF) diffusion guided by vision-language models. The process involves obtaining animated video content, a text prompt, and view information to generate photorealistic 2D image frames. These frames are then used to render 3D image frames, allowing for the creation of novel images from new viewpoints.

Technical Field

The technique falls under the domain of video content generation, specifically focusing on transforming animated content into photorealistic 3D video. Photorealism is a technique that aims to reproduce images as realistically as possible, akin to photographs, and is often used in advertising and marketing to visualize products.

Methodology

The method involves several steps: obtaining animated video content, a text prompt, and view information; generating photorealistic 2D frames using a vision-language model; and rendering 3D frames from these 2D frames using a 3D representation model. This process can be iterated to train the 3D model, enhancing its ability to produce high-quality, novel 3D images from different viewpoints.

Implementation Details

Various optional implementations are described, such as using side information to enhance 2D frame generation or employing a multi-modal conditioned reverse diffusion module. The model can be trained in stages, refining the photorealistic output through backpropagation and loss computation, with high-quality appearance diffusion modules enhancing the final output.

Computing Device and Medium

The application also outlines a computing device and a non-transitory computer-readable medium for executing these methods. The device comprises a memory and processors configured to perform the steps of obtaining content, generating 2D frames, rendering 3D frames, and generating novel images from new views, ensuring a comprehensive framework for photorealistic content generation.