US20260148062
2026-05-28
Physics
G06N3/08
The patent application details a method and apparatus for executing multiple deep-learning models in parallel using various hardware accelerators. This approach involves transforming deep-learning models into executable partitions, which are then managed and deployed across different accelerators based on specific dependencies. By processing these partitions concurrently, the system aims to enhance inference efficiency and reduce response times.
The technology focuses on optimizing inference scheduling and compilation for deep-learning models to minimize latency and maximize resource utilization. Traditional compilers often transform models for single accelerators, limiting parallel execution across heterogeneous devices. This invention addresses the challenge of executing multiple models concurrently on diverse accelerators, such as NVIDIA Jetson Nano and Google Coral Edge TPU, which often results in inefficient resource allocation.
Key objectives include maximizing system throughput, reducing response times, and improving execution management. The technology automatically partitions deep-learning models into units executable by different accelerators, considering computational characteristics and performance. This reduces AI application development time and enhances performance, enabling concurrent model execution even without high-performance GPUs.
The method involves transforming models into partitions executable on accelerators, deploying them based on execution order and dependencies, and executing them in parallel. The partition includes code optimized for specific accelerators, generated through hardware-independent graph optimization. The performance model considers execution time, data transmission, and retrieval times, ensuring minimal wait times and efficient resource utilization.
The apparatus includes a deep-learning compiler, a partition deployment module, and a multi-model execution module. These components work together to manage and execute model partitions across accelerators. The system monitors accelerator performance to refine the partition performance model, ensuring efficient parallel execution. Results are generated upon completion of the last partition, maximizing concurrency and system responsiveness.