Invention Title:

SINGLE IMAGE TO REALISTIC 3D OBJECT GENERATION VIA SEMI-SUPERVISED 2D AND 3D JOINT TRAINING

Publication number:

US20250111592

Publication date:

2025-04-03

Section:

Physics

Class:

G06T15/20

Inventors:

Sifei Liu Santa Clara, CA, United States

Arash Vahdat San Mateo, CA, United States

Jiaming Song San Carlos, CA, United States

Ye Yuan Santa Clara, CA, United States

Morteza Mardani Santa Clara, CA, United States

Dejia Xu San Jose, CA, United States

Applicant:

NVIDIA Corporation Santa Clara, CA, United States

Smart overview of the Invention

The demand for 3D content creation is growing due to advancements in virtual reality and augmented reality. Traditionally, creating high-quality 3D content required manual effort by experts. Artificial intelligence-based methods have been developed to automate this process, but they often produce suboptimal results. Existing solutions either rely on limited 3D data, which hampers generalization to new objects, or use only 2D data, leading to poor geometric accuracy. Addressing these limitations requires a model trained on both 2D and 3D data to generate accurate 3D content from a single 2D image.

Technical Field

The disclosed method pertains to artificial intelligence techniques for generating three-dimensional (3D) content. By integrating both 2D and 3D data during training, the approach aims to improve the quality and accuracy of the generated 3D models. This technique is particularly relevant for applications in virtual reality and augmented reality, where realistic and detailed 3D representations are crucial.

Summary

The proposed method involves a machine learning model trained using both 2D and 3D datasets to create 3D content from a single 2D image. The 2D dataset consists of single-view images labeled with texture information, while the 3D dataset includes multi-view images labeled with geometry information. By combining these datasets, the model learns to generate detailed and accurate 3D representations based on input from a single image.

Methodology

Accessing Datasets: The process begins by accessing a 2D dataset of single-view images with texture labels and a 3D dataset of multi-view images with geometry labels.
Joint Training: The model is trained using both datasets, allowing it to learn textures from the 2D data and geometries from the 3D data, improving overall output quality.
Model Components: The training involves an encoder that generates a 3D representation from input images and a renderer that converts these into various 2D renderings for comparison against ground truths.

Implementation Details

The joint training incorporates semi-supervised techniques, where an encoder creates 3D representations during training with both datasets. A renderer then produces sets of 2D renderings from these representations. A loss function compares these renderings against ground truths to refine the model's accuracy in capturing textures and geometries. This approach ensures the generated 3D content is both visually accurate and geometrically sound.