Invention Title:

REAL-TIME VOICE MIXING AND GENERATION SYSTEM WITH ARTIFICIAL INTELLIGENCE

Publication number:

US20260188296

Publication date:

2026-07-02

Section:

Physics

Class:

G10L13/027

Inventors:

Steve Gu 🇺🇸 Lafayette, CA, United States

Mehmet Efe Akengin 🇺🇸 EL Cerrito, CA, United States

Assignee:

BitHuman Inc 🇺🇸 San Francisco, CA, United States

Applicant:

BitHuman Inc 🇺🇸 San Francisco, CA, United States

Smart overview of the Invention

The patent application discusses a system for real-time voice mixing and generation using artificial intelligence. It involves a processor and a multi-modal user interface input unit. This system can handle various input types such as text prompts, voice personality descriptions, images, existing voice samples, documents, websites, videos, and multi-language personality profiles. These inputs are used to create personalized voice outputs tailored to specific audiences and content requirements.

Voice Mixing Engine

The core of the system is an artificial intelligence voice-mixing engine. This engine receives inputs from the multi-modal user interface and uses them to mix characteristics from multiple high-quality base voices in real-time. A voice library within the engine contains a set of base voices and voice characteristics that are utilized in the voice mixing process. The steps involved in this process include voice vector selection, fine-tuning, and new voice embedding, resulting in a set of outputs ready for synthesis.

Voice Generation Engine

Coupled with the voice-mixing engine is an artificial intelligence voice generation engine. This component synthesizes the voice using the outputs from the mixing engine. It generates audio outputs that reflect the desired characteristics and styles specified in the input. This process allows for dynamic voice synthesis that can adjust to various stylistic and emotional requirements, ensuring the final output meets the intended specifications.

Voice Characteristics and Processing

The system includes a voice characteristic analyzer that examines features such as voice timbre, pitch range, speaking rate, articulation patterns, and emotional expressiveness. These characteristics are mapped to vector representations, facilitating the manipulation and combination of voice features. The feature extraction module captures acoustic and prosodic elements, creating normalized feature sets and voice signatures that are embedded into a continuous vector space for further processing.

Dynamic Voice Manipulation

The system supports dynamic manipulation of voice characteristics, allowing for continuous transitions and interpolation between different voice styles. It enables the selection of target voices based on desired characteristics and priority weighting. The process involves a similarity-based search and style-preserving combinations, ensuring real-time performance and quality consistency. This flexibility allows for the creation of unique and tailored voice outputs that align with specific requirements.