Invention Title:

METHODS AND SYSTEMS FOR GENERATING TASK-SPECIFIC OUTPUT USING A SINGLE QUANTIZED MODEL GRAPH

Publication number:

US20260141230

Publication date:
Section:

Physics

Class:

G06N3/0495

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application describes a method and system for optimizing generative artificial intelligence (AI) models using a single quantized model graph. This approach focuses on parameter efficient fine tuning (PEFT) models, which are utilized to adapt foundational AI models to specific tasks without full retraining. The method involves obtaining a pre-trained base model, analyzing quantization sensitivity across multiple PEFT models, and selecting a model based on its sensitivity score. A fixed quantization configuration is determined, which helps adjust model weights and generate a single quantized model graph for task-specific outputs.

Challenges Addressed

Generative AI models, such as large language and vision models, face challenges when deployed on embedded devices due to memory, power, and latency constraints. Each PEFT model typically requires separate quantization parameters, resulting in increased memory usage and task-switching latency. The patent addresses these issues by unifying quantization parameters across multiple PEFT models, thus reducing memory consumption, deployment complexity, and improving scalability.

Technical Approach

The process begins with a quantization sensitivity analysis to obtain a sensitivity score for each PEFT model. Based on these scores, a quantization-sensitive model is selected to determine a fixed quantization configuration. This configuration includes scale and zero-point parameters, which are used to adjust the weights of all PEFT models. The adjusted models are then integrated into a single quantized model graph, enabling dynamic selection and inference for task-specific outputs.

Implementation Details

The method includes preprocessing each PEFT model to adjust weights, using techniques such as knowledge distillation (KD) loss-based fine-tuning. This involves minimizing divergence metrics between output distributions of original and quantization-constrained models. The adjusted weights are used for inference without further modification, allowing runtime switching among PEFT models. The system supports various devices, including edge devices, mobile neural processing units, and cloud accelerators.

Applications and Benefits

This innovation is applicable to a wide range of generative AI tasks, including language generation, speech synthesis, image generation, and multimodal reasoning. By creating a unified quantized model graph, the system enhances efficiency, reduces deployment costs, and supports scalable AI applications. It is particularly beneficial for large-scale deployments where multiple PEFT models are required, offering a more sustainable solution for resource-constrained environments.