US20260065913
2026-03-05
Physics
G10L17/02
The disclosed technology pertains to systems and methods for detecting synthetic singing vocals within audio signals using a sophisticated multi-stage machine-learning architecture. This approach specifically addresses the challenges posed by machine-generated vocals, which have become increasingly sophisticated due to advances in audio modeling. The system is designed to differentiate between human-generated and machine-generated singing by analyzing various acoustic features unique to singing, as opposed to speech.
The system comprises several key components, including a singing detector, a singer detector, and a liveness detector. The singing detector identifies segments of audio that contain singing vocals. The singer detector extracts vocalprint embeddings to identify singer-specific characteristics and potentially recognize specific singers. The liveness detector focuses on identifying fakeprint embeddings, which are indicative of machine-generated artifacts, to determine the likelihood that the audio is synthetic.
The detection process involves multiple stages. Initially, the system identifies vocal segments in the audio using the singing detector. It then extracts fakeprint embeddings to analyze machine-related artifacts, generating a singing liveness score. This score helps classify the vocals as either human or machine-generated. Additionally, the system may use score-level fusion to combine results from the singer and liveness detectors, enhancing the accuracy of the final classification.
The system undergoes a training phase where it learns to generate liveness scores using a labeled corpus of audio signals. This phase involves updating model parameters based on a loss function. An enrollment phase allows the system to extract and store fakeprint embeddings from known synthetic audio, which aids in future comparisons. This approach ensures the system remains adept at recognizing both genuine and synthetic vocals.
The technology leverages specific acoustic features for its analysis, such as pitch contours, timbral texture, and vibrato patterns for vocalprint embeddings, and pitch smoothing or phoneme distortion for fakeprint embeddings. To enhance model robustness, the system incorporates singing-specific data augmentation techniques like pitch shifting and tempo perturbation. These strategies improve the system's ability to accurately detect and classify singing vocals across various scenarios.