US20260094600
2026-04-02
Physics
G10L15/16
A novel method for enhancing large language models (LLMs) involves generating a sequence of output tokens from a given prompt. This sequence includes both correct and incorrect textual tokens, along with revision tokens that identify errors and suggest replacements. The process results in a refined sequence of output tokens, improving the accuracy of the LLM's responses, particularly in tasks such as automated speech recognition (ASR).
The method focuses on multimodal LLMs, which are designed to handle diverse inputs such as text, audio, and video. These models often employ autoregressive decoding, generating tokens based on previous ones. However, this approach can limit accuracy by ignoring future context. The disclosed method addresses this limitation by enabling the LLM to self-correct using revision tokens.
Generating revised sequences involves identifying incorrect tokens and replacing them following the guidance of revision tokens. The model can determine inaccuracies based on subsequent tokens, allowing for dynamic correction. The revision tokens also indicate the position and number of errors relative to them, facilitating precise adjustments.
Training involves using audio-transcription pairs, where transcriptions are intentionally altered to include errors. Revision tokens are then used to correct these errors during training, enhancing the model's self-correction capability. The system can be implemented using an encoder-decoder or a decoder-only architecture, depending on specific requirements.