Invention Title:

Engaging Multimodal Content Generation System

Publication number:

US20250111569

Publication date:
Section:

Physics

Class:

G06T11/60

Inventors:

Applicants:

Smart overview of the Invention

The Engaging Multimodal Content Generation System (GEM) is a computing framework designed to create captivating multimodal image-text pairs, particularly useful for online advertisements. It integrates a pre-trained engaging discriminator with a method for learning effective prompts for a stable diffusion model, ensuring that the generated content is both engaging and contextually appropriate. GEM employs an iterative algorithm to produce coherent and engaging image-sentence pairs on specified topics, outperforming baseline approaches in engagement and alignment.

Background

With the rise of digital platforms, generating engaging multimodal content is crucial for applications like advertisements and educational materials. Multimodal content combines images and text to effectively capture user interest while maintaining relevance to the topic. For example, an advertisement might pair an eye-catching image with a catchy caption, while educational content might use striking visuals with informative text to convey complex issues. This need for engaging content has driven the development of the GEM framework.

Technical Summary

The GEM framework consists of two main steps: combining a pre-trained engagement discriminator with a method to learn continuous prompts for a stable diffusion model, followed by an iterative algorithm for generating coherent image-text pairs. The first step ensures that images are engaging and contextually suitable, while the second step uses an engagement image generator and a pre-trained text paraphrase speaker to create well-matched pairs. Experimental results show that GEM-generated pairs are more engaging than those from other methods.

Related Work

Recent advancements in vision-and-language models have improved the generation of realistic objects through techniques like Generative Adversarial Networks (GANs) and transformer-based models. However, these models often neglect engagement aspects. The GEM framework introduces an engagement classifier to generate high-engagement image-caption pairs, addressing this gap. Additionally, learnable prompts have become important for guiding model outputs in both language modeling and computer vision, enhancing controllability and flexibility.

Methodology

The GEM model utilizes a stable diffusion model to generate engaging multimodal content. Inputs such as text are processed through embedding layers to create continuous prompts, which are then used by a diffuser to generate images. The framework includes components like embedding layers and fully connected layers that output engaging loss metrics during training. The iterative process leverages trained image generators and text paraphrasers to produce aligned multimodal content. This approach ensures realistic and engaging results by combining engagement and reconstruction losses during training.