US20260044380
2026-02-12
Physics
G06F9/5038
The patent application introduces a method to enhance the execution of artificial intelligence workloads, specifically focusing on transformer models. These models are structured as chains of cells, each containing specific tasks. The method involves generating a parallel schedule that efficiently distributes these tasks across a cluster of devices, aiming to optimize performance and resource utilization. This approach addresses the challenge of managing the growing computational demands of large AI models by leveraging mixed parallelism techniques.
Generative AI models, particularly large language models, are becoming increasingly complex, necessitating distributed execution to handle their computational requirements. The process of effectively partitioning these tasks across multiple devices is intricate, requiring careful synchronization and management of interdependencies. The dynamic nature of computational resources further complicates this task, highlighting the need for adaptive parallelism strategies to maximize efficiency and harness the full potential of these models.
The method involves receiving internal representations of the transformer model, the device cluster, and the workload. Based on these representations, multiple candidate execution plans are generated, each offering a unique parallel schedule for device partitioning. The optimal execution plan is selected by evaluating resource usage through simulation, ensuring the lowest possible resource consumption while maintaining effective execution of the transformer model across the device cluster.
The invention can be implemented in a computing system comprising memory, a processor, and storage media with executable instructions. These components work together to generate and execute the parallel schedule, dividing the transformer model into sequential stages and creating replicas as needed. Tasks within each cell are mapped to devices, facilitating efficient parallel execution of the model according to the workload requirements.
Further aspects of the method involve searching for parallel schedules by determining the number of model and cell replicas needed. This includes partitioning the device cluster into these replicas and stages, ensuring that tasks are appropriately mapped to devices. This systematic approach to partitioning and scheduling aims to optimize the execution of transformer models, enhancing the performance and scalability of AI workloads across distributed systems.