Invention Title:

AUDIO AND VIDEO TOKENIZATION FOR MULTIMODAL LARGE LANGUAGE MODELS

Publication number:

US20260099522

Publication date:
Section:

Physics

Class:

G06F16/33295

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application describes systems and methods for power-efficient, continuous tokenization, and long-context storage of audio and video data for use with multimodal large language models (LLMs). These systems include specialized subsystems designed to receive input signals, generate discrete tokens, and buffer these tokens for periods ranging from seconds to hours. Upon receiving a trigger to engage with a multimodal LLM, a subset of these buffered tokens is sent to an inference dispatcher, which distributes them to one or more inference engines for processing. This architecture supports multiple modalities, including audio, video, image, and text, facilitating context-rich, privacy-preserving, and low-latency AI interactions on client devices.

Technical Field

The disclosure relates to deep learning, focusing on power-efficient audio and video tokenization for multimodal LLMs. The rise in AI-based data processing, particularly with LLMs, has enabled systems to understand and generate human language across various modalities. However, continuous audio analysis is power-intensive, and LLMs require substantial compute power, often consuming the entire bandwidth of neural processing units. The proposed systems aim to mitigate these challenges by utilizing efficient token-based data encoding and low-power hardware to reduce power consumption and bandwidth usage.

Background

Multimodal LLMs are designed to reason from multiple modalities like audio, text, and images, enabling natural voice interactions with minimal latency. Users expect these models to provide human-like intelligence, requiring processing over extended temporal contexts. Continuous audio analysis in current systems leads to significant power consumption, making it impractical for battery-powered devices. Conventional methods, which rely on raw audio processing, impose heavy power and bandwidth demands. The proposed systems distribute components of the multimodal inference pipeline across specialized hardware subsystems, minimizing power consumption while preserving rich contextual data.

Detailed Description

The systems and methods proposed address the limitations of continuous multimodal LLM use by efficiently tokenizing inputs and storing them. Inputs are encoded into tokens, compact representations that preserve essential information while reducing size compared to raw samples or embeddings. An audio offload engine integrated into the system-on-chip (SoC) performs continuous audio tokenization, converting audio streams into highly compressed symbolic representations. These tokens are buffered locally, enabling long-context recall and real-time analysis without transmitting raw data, enhancing privacy and reducing bandwidth use.

Implementation

The systems incorporate a trigger mechanism activated by user input or predefined conditions to initiate interaction with a multimodal LLM. Buffered tokens are transmitted to an inference dispatcher, which optimizes token distribution across local or remote inference engines. By offloading tokenization to low-power subsystems and implementing scalable token buffering and distribution, these techniques enable continuous multimodal processing with minimal power overhead. The methods extend battery life while maintaining AI availability and can be applied to any input modality, including audio, video, text, and image.