Invention Title:

SINGING VOICE DEEPFAKE DETECTION

Publication number:

US20260065913

Publication date:
Section:

Physics

Class:

G10L17/02

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The disclosed technology pertains to systems and methods for detecting synthetic singing vocals within audio signals using a sophisticated multi-stage machine-learning architecture. This approach specifically addresses the challenges posed by machine-generated vocals, which have become increasingly sophisticated due to advances in audio modeling. The system is designed to differentiate between human-generated and machine-generated singing by analyzing various acoustic features unique to singing, as opposed to speech.

Components and Functionality

The system comprises several key components, including a singing detector, a singer detector, and a liveness detector. The singing detector identifies segments of audio that contain singing vocals. The singer detector extracts vocalprint embeddings to identify singer-specific characteristics and potentially recognize specific singers. The liveness detector focuses on identifying fakeprint embeddings, which are indicative of machine-generated artifacts, to determine the likelihood that the audio is synthetic.

Detection Process

The detection process involves multiple stages. Initially, the system identifies vocal segments in the audio using the singing detector. It then extracts fakeprint embeddings to analyze machine-related artifacts, generating a singing liveness score. This score helps classify the vocals as either human or machine-generated. Additionally, the system may use score-level fusion to combine results from the singer and liveness detectors, enhancing the accuracy of the final classification.

Training and Enrollment

The system undergoes a training phase where it learns to generate liveness scores using a labeled corpus of audio signals. This phase involves updating model parameters based on a loss function. An enrollment phase allows the system to extract and store fakeprint embeddings from known synthetic audio, which aids in future comparisons. This approach ensures the system remains adept at recognizing both genuine and synthetic vocals.

Acoustic Features and Model Robustness

The technology leverages specific acoustic features for its analysis, such as pitch contours, timbral texture, and vibrato patterns for vocalprint embeddings, and pitch smoothing or phoneme distortion for fakeprint embeddings. To enhance model robustness, the system incorporates singing-specific data augmentation techniques like pitch shifting and tempo perturbation. These strategies improve the system's ability to accurately detect and classify singing vocals across various scenarios.