US20260141749
2026-05-21
Physics
G06V40/171
The patent application introduces a unified model for real-time makeup virtual try-on (VTO) on resource-constrained platforms such as mobile devices and web browsers. This model efficiently combines facial landmark detection and occlusion-aware segmentation, improving performance and reducing complexity compared to traditional approaches that use separate models. The unified model is specifically designed to enhance accuracy around critical areas like the eyes and lips, utilizing temporal information to optimize real-time performance by leveraging predictions from previous video frames.
The invention pertains to computer image processing, focusing on an occlusion-aware real-time tiny facial alignment model for makeup virtual try-on applications. These applications require precise face alignment to render makeup effects accurately, especially in dynamic environments like video calls and live streams where occlusions such as hands or objects can interfere with performance.
Real-time makeup VTO applications face challenges in maintaining accurate facial landmark predictions while handling occlusions effectively. Traditional methods involve integrating multiple models, which increases system complexity and slows down performance. The proposed solution is a compact model that unifies face alignment and segmentation tasks, offering two lightweight segmentation modules for lips and face. This approach allows users to choose the appropriate module based on their needs, providing a smoother and more robust VTO experience.
The model architecture features a novel facial alignment network structure tailored for VTO tasks, focusing on eye and lip regions. It includes a lightweight unified face alignment and segmentation model that operates in real-time, demonstrating superior speed and smaller model size without sacrificing accuracy. The model leverages temporal information to enhance parallelism and reduce inference time, utilizing previous frame predictions to guide current predictions.
The model is trained on a dataset of 6,000 subjects with diverse characteristics, using a 65-landmark system for detailed facial feature representation. To handle occlusions, the dataset is augmented with realistic scenarios involving common occlusion objects like hands and masks. The model architecture comprises a shared backbone and multiple branches, including segmentation and point prediction branches, optimized for rendering makeup effects based on the intersection of predicted points and mask segmentations.