US20260038179
2026-02-05
Physics
G06T13/205
The patent application outlines a system for creating photorealistic 3D talking faces using only audio input. It utilizes machine learning models to predict both face geometry and texture from audio signals, which are then combined to form a 3D mesh model. This model can be integrated into existing videos or virtual environments, providing a versatile tool for various multimedia applications.
The system employs a machine-learned model that decouples 3D geometry, head pose, and texture by decomposing faces from video into a normalized space. This separation allows for regression over the 3D face shape and the corresponding 2D texture atlas. An auto-regressive approach is used to stabilize temporal dynamics by conditioning the model on its previous visual state, improving the realism of generated sequences.
The technology has numerous applications, such as creating personalized and voice-controlled avatars for games and VR, auto-translating and dubbing videos, and general video editing. It also offers potential for multimedia communication compression by transmitting only audio and recreating the visual content when needed. The generated 3D mesh can be used in both 2D video editing and 3D environments.
Models are trained using synchronized audio and video data, simplifying data preparation and model architecture. The framework supports personalized models that capture individual speaker traits, enhancing realism. This approach reduces the degrees of freedom to speech-related features, allowing for plausible model generation from short videos.
To ensure temporal consistency and photorealism, the system uses an encoder-decoder framework to compute embeddings from audio spectrograms, predicting 3D geometry and texture. An auto-regressive framework furthers temporal smoothness by conditioning texture generation on both audio and previous outputs. The model also incorporates a 3D-normalized fixed texture atlas for consistent face illumination in video re-synthesis.