US20250315631
2025-10-09
Physics
G06F40/58
The patent application describes a sophisticated neural system designed to translate speech in videos while maintaining the speaker's original facial expressions and voice characteristics. This system integrates multiple models to produce a video of the speaker speaking in a target language, with lip movements synchronized to the translated speech. The process preserves the original speech's emphases and prosody, ensuring that the speaker's voice remains recognizable. The system is particularly useful for applications like video conferencing, dubbing, and assistive technologies.
The system operates through a series of steps beginning with automatic speech recognition (ASR) that detects emphasis in the original speech. This is followed by a translation model that converts the detected text into the target language. A Text-to-Speech (TTS) model then synthesizes this translated text, maintaining original speech emphases. A voice conversion model adapts the synthesized speech to match the original speaker's voice. Finally, a generative model creates new video frames with adjusted lip movements synchronized to the translated audio.
Traditional dubbing processes are labor-intensive, costly, and often result in mismatches between lip movements and audio. The proposed system addresses these issues by providing an automated solution that preserves the speaker's original voice and lip movements. This method is faster and more efficient, making it suitable for dynamic content like real-time events and low-budget productions.
The system comprises several key components: an audio processing subsystem with machine learning modules for language translation and voice conversion, and a video processing subsystem for face detection and lip generation. These components work together to generate output videos where the speaker appears to speak in the target language naturally. The video generation module combines new video frames with adapted audio to produce seamless results.
This technology has numerous applications across various domains. It can enhance video conferencing by providing real-time translations while preserving speaker identity, improve movie dubbing by maintaining natural lip-sync and voice consistency, and assist hearing-impaired individuals by offering more accessible content through synchronized visual cues. Additionally, it facilitates low-bandwidth transmissions by reducing the need for multiple language tracks.