Invention Title:

Multimodal Large Language Model That Learns to Correct Itself, Focusing on Automated Speech Recognition

Publication number:

US20260094600

Publication date:

2026-04-02

Section:

Physics

Class:

G10L15/16

Inventors:

Fadi Biadsy 🇺🇸 Sandyston, NJ, United States

Quan Wang 🇺🇸 Hoboken, NJ, United States

Youzheng Chen 🇺🇸 Mountain View, CA, United States

Yonghui Xiao 🇺🇸 Mountain View, CA, United States

Assignee:

Google LLC 🇺🇸 Mountain View, CA, United States

Applicant:

Google LLC 🇺🇸 Mountain View, CA, United States

Smart overview of the Invention

A novel method for enhancing large language models (LLMs) involves generating a sequence of output tokens from a given prompt. This sequence includes both correct and incorrect textual tokens, along with revision tokens that identify errors and suggest replacements. The process results in a refined sequence of output tokens, improving the accuracy of the LLM's responses, particularly in tasks such as automated speech recognition (ASR).

Technical Field

The method focuses on multimodal LLMs, which are designed to handle diverse inputs such as text, audio, and video. These models often employ autoregressive decoding, generating tokens based on previous ones. However, this approach can limit accuracy by ignoring future context. The disclosed method addresses this limitation by enabling the LLM to self-correct using revision tokens.

Key Features

The method includes receiving a prompt and generating output tokens that may contain errors. Revision tokens are used to identify and correct these errors.
Output tokens are generated autoregressively, conditioned on previous tokens to enhance coherence.
The prompts can include acoustic frames or text in different languages, with the model providing speech recognition results or translations.

Implementation

Generating revised sequences involves identifying incorrect tokens and replacing them following the guidance of revision tokens. The model can determine inaccuracies based on subsequent tokens, allowing for dynamic correction. The revision tokens also indicate the position and number of errors relative to them, facilitating precise adjustments.

Training and System Design

Training involves using audio-transcription pairs, where transcriptions are intentionally altered to include errors. Revision tokens are then used to correct these errors during training, enhancing the model's self-correction capability. The system can be implemented using an encoder-decoder or a decoder-only architecture, depending on specific requirements.