Invention Title:

SELF-SUPERVISED AUDIO-VISUAL LEARNING FOR CORRELATING MUSIC AND VIDEO

Publication number:

US20250316062

Publication date:

2025-10-09

Section:

Physics

Class:

G06V10/774

Inventors:

Bryan RUSSELL 🇺🇸 San Francisco, CA, United States

Justin Salamon 🇺🇸 San Francisco, CA, United States

Didac SURIS COLL-VINENT 🇺🇸 New York, NY, United States

Assignee:

Adobe Inc. 🇺🇸 San Jose, CA, United States

Applicant:

Adobe Inc. 🇺🇸 San Jose, CA, United States

Smart overview of the Invention

The disclosed system is designed to correlate video and audio sequences using a media recommendation system powered by a trained encoder network. By receiving a media sequence that includes both video and audio components, the system segments these into smaller parts, extracting visual and audio features from each segment. Transformer networks, consisting of visual and audio transformers, are employed to create contextualized features from these extracted elements. The system then predicts video and audio pairings based on these contextualized features, allowing for the training of the transformer networks to enhance their predictive capabilities.

Background

Choosing appropriate music for videos is a challenging task, particularly for non-professionals. Traditional methods rely on text-based searches which can be inadequate for capturing the nuanced "feel" of music. Additionally, matching video sequences with corresponding audio can be complex, especially when determining the best sequence order. Existing solutions often require manual annotation, which is labor-intensive and not scalable to large datasets.

System Functionality

The media recommendation system introduced here facilitates the correlation of video and audio sequences by identifying those that align temporally and artistically. It segments a given video into parts and analyzes each segment to produce visual embeddings. A transformer encoder network then generates contextualized visual features by considering both individual segment features and those of adjacent segments. These visual features are compared with contextualized audio features to find the most similar pairings, using training data that includes artistically paired audio-visual content.

Technical Approach

The system leverages the temporal synchronization of audio and video, using transformer networks to model this context. Unlike other methods that focus on physical correspondences between modalities, this system emphasizes artistic alignment, such as visual style or musical mood. Existing solutions that rely on predefined mood categories or cross-modal ranking loss have limitations in scalability and accuracy, which this system aims to overcome by using context-aware embeddings derived from both modalities.

Implementation

Upon receiving an input video sequence, the system generates context-aware visual embeddings for each segment. It retrieves corresponding audio embeddings from a pre-processed media catalog. By comparing these embeddings, the system identifies audio segments most similar to each video segment's visual features. This approach enhances speed and scalability by learning from large collections of artistically paired data without manual labeling, thereby improving the correspondence between video and audio clips.