Invention Title:

ROBUST FACIAL ANIMATION FROM VIDEO AND AUDIO

Publication number:

US20260134604

Publication date:

2026-05-14

Section:

Physics

Class:

G06T13/40

Inventors:

Kiran BHAT 🇺🇸 San Francisco, CA, United States

William Welch 🇺🇸 San Francisco, CA, United States

Ian Sachs 🇺🇸 Corte Madera, CA, United States

Inaki NAVARRO 🇨🇭 Zurich, Switzerland

Tijmen VERHULSDONCK 🇸🇪 Göteborg, Sweden

Eloi DU BOIS 🇺🇸 Austin, TX, United States

Dario KNEUBUEHLER 🇨🇭 Zurich, Switzerland

Charles SHANG 🇺🇸 San Mateo, CA, United States

Assignee:

Roblox Corporation 🇺🇸 San Mateo, CA, United States

Applicant:

Roblox Corporation 🇺🇸 San Mateo, CA, United States

Smart overview of the Invention

Implementations focus on generating real-time facial animations for 3D avatars using video and audio inputs. A camera captures video of a user's face, while a trained face detection model and a regression model process this data to produce video FACS weights, head poses, and facial landmarks. Concurrently, audio captured by a microphone is analyzed by a trained facial movement detection model and a regression model to output audio FACS weights. A blending term helps identify audio lapses, aiding in the fusion of video and audio data to create final FACS weights for animating avatars.

The technical field involves enhancing virtual experiences by employing methods, systems, and computer-readable media for facial animation. Users on platforms like gaming or media exchange can interact in virtual environments using avatars. Traditionally, avatar animations relied on user inputs for gestures and movements, which had limitations. The described approach automates these animations, overcoming previous drawbacks by using a combination of video and audio inputs for more dynamic and accurate avatar animations.

Key aspects include receiving video and audio frames and processing them through machine learning models. Video frames yield video FACS weights, while audio frames produce audio FACS weights. These are combined using a blending term to drive the facial animation of a 3D model. The models are trained using a semi-supervised process and include multiple task-specific decoders for detailed outputs like head poses and facial landmarks.

System components include memory with stored instructions and a processing device to execute these instructions. The system processes video and audio inputs to obtain and combine FACS weights, ultimately animating a 3D model. The models involved are structured with encoders and task-specific decoders, each designed to handle specific outputs. The blending of video and audio data ensures cohesive and realistic avatar animations.

The approach also encompasses a non-transitory computer-readable medium with instructions for executing the described operations. This system is designed to be flexible, allowing for various implementations and modifications. The integration of video and audio data through trained models and blending techniques represents a significant advancement in creating lifelike animations for virtual avatars.