Invention Title:

GENERATING MULTI-MODAL RESPONSE(S) THROUGH UTILIZATION OF LARGE LANGUAGE MODEL(S) AND OTHER GENERATIVE MODEL(S)

Publication number:

US20250139379

Publication date:

2025-05-01

Section:

Physics

Class:

G06F40/40

Inventors:

Alessandro Agostini Zurich, Switzerland

Michael Andrew Goodman Oakland, CA, United States

Yifeng Lu Mountain View, CA, United States

Golnaz Ghiasi Mountain View, CA, United States

Amin Ghafouri San Francisco, CA, United States

Konstantin Shagin Adliswil, Switzerland

Sanil Jain Sunnyvale, CA, United States

Thang Luong Santa Clara, CA, United States

Agoston Weisz Zurich, Switzerland

Evgeny Sluzhaev Zurich, Switzerland

Igor Petrovski Zurich, Switzerland

Wei Yu Mountain View, CA, United States

Rakesh Shivanna Sunnyvale, CA, United States

Vikas Peswani Mountain View, CA, United States

Attila Dankovics Zurich, Switzerland

Elle Chae Sunnyvale, CA, United States

Marcelo Menegali San Francisco, CA, United States

Oscar Akerlund Zrich, Switzerland

Tiffany Chen San Francisco, CA, United States

Applicant:

Google LLC Mountain View, CA, United States

Drawings (4 of 8)

Drawing 01 for GENERATING MULTI-MODAL RESPONSE(S) THROUGH UTILIZATION OF LARGE LANGUAGE MODEL(S) AND OTHER GENERATIVE MODEL(S)

Drawing 02 for GENERATING MULTI-MODAL RESPONSE(S) THROUGH UTILIZATION OF LARGE LANGUAGE MODEL(S) AND OTHER GENERATIVE MODEL(S)

Drawing 03 for GENERATING MULTI-MODAL RESPONSE(S) THROUGH UTILIZATION OF LARGE LANGUAGE MODEL(S) AND OTHER GENERATIVE MODEL(S)

Drawing 04 for GENERATING MULTI-MODAL RESPONSE(S) THROUGH UTILIZATION OF LARGE LANGUAGE MODEL(S) AND OTHER GENERATIVE MODEL(S)

Smart overview of the Invention

The patent application describes a system for generating multi-modal responses using large language models (LLMs) and other generative models. This system processes natural language (NL) inputs to produce responses that combine text and multimedia content. The multimedia content can include images, videos, and audio generated by various generative models based on prompts derived from the LLM output. These multimedia elements are interleaved with the textual content to create a cohesive and contextually relevant user experience.

Background

Traditional LLMs generate responses that often include both text and pre-existing multimedia content, but these elements are typically not well-integrated, leading to a disjointed user experience. Users may need to navigate back and forth between text and images, consuming more computational resources and time. Moreover, when requested to generate media of fictional entities, LLMs struggle due to the lack of existing content. Users must then manually interact with separate generative models, which interrupts the flow of interaction.

Innovation

The proposed system overcomes these limitations by integrating LLMs with other generative models to produce multimedia content on-the-fly. For instance, when tasked with creating an encyclopedia entry for a mythical creature like an Elkbird, the system generates both descriptive text and corresponding media using generative models. This approach allows for a seamless, one-shot interaction where users receive both text and media without needing to engage separate systems.

Implementation

The system can fine-tune LLMs to determine optimal placement of multimedia within textual responses. Training involves using curated or automatically generated instances that pair NL inputs with multi-modal outputs. This training enables the LLM to predict where generative or non-generative multimedia should be inserted into responses. The process ensures that multimedia is contextually relevant and enhances the overall interaction quality.

Technical Details

LLM outputs include probability distributions over sequences of tokens, which guide the generation of both text and multimedia content. The system uses these distributions to select appropriate non-generative tags or generative prompts for multimedia content creation. This method allows the system to efficiently produce multi-modal responses tailored to user inputs, ensuring a coherent integration of all content types within the response.