Invention Title:

METHOD FOR PARALLEL EXECUTION OF MULTIPLE DEEP-LEARNING MODELS AND APPARATUS THEREFOR

Publication number:

US20260148062

Publication date:

2026-05-28

Section:

Physics

Class:

G06N3/08

Inventors:

Mi-Sun YU 🇰🇷 Daejeon, South Korea

Yong-In Kwon 🇰🇷 Daejeon, South Korea

Assignee:

Electronics and Telecommunications Research Institute 🇰🇷 Daejeon, South Korea

Applicant:

ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE 🇰🇷 Daejeon, South Korea

Smart overview of the Invention

The patent application details a method and apparatus for executing multiple deep-learning models in parallel using various hardware accelerators. This approach involves transforming deep-learning models into executable partitions, which are then managed and deployed across different accelerators based on specific dependencies. By processing these partitions concurrently, the system aims to enhance inference efficiency and reduce response times.

Technical Field and Challenges

The technology focuses on optimizing inference scheduling and compilation for deep-learning models to minimize latency and maximize resource utilization. Traditional compilers often transform models for single accelerators, limiting parallel execution across heterogeneous devices. This invention addresses the challenge of executing multiple models concurrently on diverse accelerators, such as NVIDIA Jetson Nano and Google Coral Edge TPU, which often results in inefficient resource allocation.

Objectives

Key objectives include maximizing system throughput, reducing response times, and improving execution management. The technology automatically partitions deep-learning models into units executable by different accelerators, considering computational characteristics and performance. This reduces AI application development time and enhances performance, enabling concurrent model execution even without high-performance GPUs.

Methodology

The method involves transforming models into partitions executable on accelerators, deploying them based on execution order and dependencies, and executing them in parallel. The partition includes code optimized for specific accelerators, generated through hardware-independent graph optimization. The performance model considers execution time, data transmission, and retrieval times, ensuring minimal wait times and efficient resource utilization.

Apparatus and Execution

The apparatus includes a deep-learning compiler, a partition deployment module, and a multi-model execution module. These components work together to manage and execute model partitions across accelerators. The system monitors accelerator performance to refine the partition performance model, ensuring efficient parallel execution. Results are generated upon completion of the last partition, maximizing concurrency and system responsiveness.