Invention Title:

GROUNDED HUMAN MOTION GENERATION WITH OPEN VOCABULARY SCENE-AND-TEXT CONTEXTS

Publication number:

US20250308156

Publication date:
Section:

Physics

Class:

G06T17/00

Inventors:

Assignees:

Applicant:

Smart overview of the Invention

The patent application outlines a method for generating human motion in 3D scenes using open vocabulary scene-and-text contexts. The approach involves receiving a 3D point cloud of a scene and natural language instructions related to a goal object within the scene. A text tokenizer processes the instructions, and a pre-trained vision-language model generates text features. Scene features are created using a pre-trained U-Net scene encoder, which are then downsampled. By fusing these scene features with text features, a conditional latent is formed to predict motion parameters for a parametric human body model.

Challenges in Traditional Methods

Traditional methods of generating human motions in 3D scenes based on textual descriptions face several challenges. These include the high cost and time consumption associated with producing diverse and semantically consistent motions. Moreover, conventional approaches often exhibit biases towards centering motions within scenes and rely on closed vocabularies, limiting their adaptability to diverse textual inputs. Such methods may also suffer from mismatches between text and image embeddings, leading to inaccurate motion generation.

Innovative Features

The proposed method addresses these challenges by enabling grounded human motion generation with open vocabulary contexts. This involves training the system to align text embeddings with 3D point cloud features and establishing a framework for text-and-scene-conditional motion generation. The system replaces closed vocabulary pretraining with open vocabulary knowledge distillation and introduces novel regularization losses to refine the grounding of text and scene data. These innovations aim to improve the accuracy and efficiency of human motion generation.

System Architecture

The system architecture comprises several components: a pre-trained vision-language model, a U-Net scene encoder, a down sampler, a fusion module, and a conditional motion generator. These elements work together to process inputs—comprising 3D point clouds and natural language instructions—and generate 3D human meshes for multiple motion frames. The system can be implemented on various computing devices, including servers, smartphones, and gaming devices, providing flexibility in deployment.

Pre-trained Model Utilization

The method leverages pre-trained models like CLIP for semantic understanding of visual information and open vocabulary processing. These models map images and text into a common latent space for effective comparison and retrieval. By utilizing open vocabulary capabilities, the system can handle a broad range of terms not explicitly included in its training dataset, enhancing its adaptability and performance in diverse scenarios.