US20260088032
2026-03-26
Physics
G10L17/06
The patent application introduces a novel system for detecting audio deepfakes, which are synthetic speech generated by advanced AI techniques. Existing deepfake detection models often struggle to differentiate between real and synthetic audio, requiring extensive retraining and large datasets. The proposed system uses verified audio samples from a known speaker to create a distribution of detection scores without altering the underlying deepfake detection model. By applying a statistical test to compare detection scores from unverified audio with this reference distribution, the system achieves accurate and efficient deepfake detection suitable for real-time applications.
The invention falls under the domain of deepfake detection, specifically targeting synthetic audio. Audio deepfakes, created through AI, pose risks such as fraud, misinformation, and impersonation, challenging data security and communication integrity. While some models use speaker recognition to compare audio samples, they are generally outperformed by those focusing on synthesis-related distortions. Existing methods often require computationally intensive retraining, limiting their scalability and real-time applicability.
The system improves deepfake detection by incorporating verified samples of a known speaker, enhancing accuracy without modifying the model. It segments verified audio into frames, applying a speaker-independent deepfake detector to generate detection scores and construct a statistical reference distribution. The system then evaluates whether scores from a test audio file align with this distribution, determining its authenticity. This method allows for speaker-specific conditioning information, increasing accuracy without retraining the model for each speaker, making it efficient for real-time use.
The system receives both unverified and verified audio samples, using multiple verified samples to generate deepfake detection scores for the known speaker. Pre-processing the audio samples involves normalization, filtering, and feature extraction. The system segments the audio using a sliding window to capture temporal variations, with shorter step sizes increasing detection accuracy but at a higher computational cost. Data augmentation strategies, such as noise injection and pitch shifting, are applied to enhance model robustness and improve detection accuracy.
A pretrained, speaker-independent deepfake detector processes the audio segments to determine authenticity scores. These scores form a person-specific reference distribution for verified samples, against which unverified samples are compared. The system evaluates if the unverified sample scores fit within the reference distribution, using statistical tests to verify authenticity. This approach leverages conditioning information, enhancing detection accuracy without explicit model retraining, and supports reliable operation in diverse real-world scenarios by accommodating variations in audio quality and recording environments.