Invention Title:

AUDIO-DRIVEN FACIAL ANIMATION SUPPORTING VARYING IDENTITIES AND SPEAKING STYLES

Publication number:

US20260105672

Publication date:

2026-04-16

Section:

Physics

Class:

G06T13/205

Inventors:

Yeongho Seol 🇰🇷 Seoul, South Korea

Zhengyu HUANG 🇨🇳 Shanghai, China

Dmitry KOROBCHENKO 🇬🇧 London, United Kingdom

Roger BLANCO RIBERA 🇰🇷 Seongnam-si, South Korea

Assignee:

NVIDIA Corporation 🇺🇸 Santa Clara, CA, United States

Applicant:

NVIDIA Corporation 🇺🇸 Santa Clara, CA, United States

Smart overview of the Invention

The patent application describes systems and methods for animating virtual actors or avatars using audio-driven animation. It focuses on improving the efficiency and accuracy of lip-synchronization in digital animations by employing machine-learning techniques. The system identifies animations for a mesh based on audio data and speaking styles, generating vertex deltas from these inputs. These deltas are used to update a machine-learning model, enabling it to produce animations that reflect different speaking styles and identities without excessive computational demands.

Background

Traditional methods for generating facial animations from audio data are resource-intensive and struggle to accurately map different speaking styles. These conventional approaches often require separate models for each actor or speaking style, leading to inefficiencies. The disclosed invention aims to overcome these limitations by using a single machine-learning model capable of handling multiple speaking styles, thereby enhancing computational efficiency and animation quality.

Technical Aspects

The system utilizes processors and circuits to identify animations for a mesh and generate vertex deltas based on a neutral pose. These deltas are used to update machine-learning model parameters, allowing the model to produce output vertex deltas for any given input style and audio data. The process involves generating style vectors from speaking style indications and updating model parameters accordingly. The model can then execute using new audio data and style indications to create synchronized animations.

Implementation Details

The machine-learning model may include multiple layers, with audio data and style vectors provided as inputs to different layers. The system can modify meshes based on vertex deltas to create transformed meshes that match input audio data. Additionally, it supports blending meshes from different identities using weight values for speaking styles. The system can encode vertex deltas into data structures to further refine the machine-learning model.

Applications

The system can be integrated into various applications, such as virtual reality, gaming, and digital content creation, where realistic facial animations are crucial. By receiving inputs from graphical user interfaces, such as sliders, users can control speaking styles and generate animations that synchronize with audio data. The system's ability to blend multiple facial meshes and handle diverse speaking styles makes it a versatile tool for creating lifelike digital characters.