Invention Title:

MIXED PARALLELISM FOR EXECUTION OF ARTIFICIAL INTELLIGENCE WORKLOADS

Publication number:

US20260044380

Publication date:

2026-02-12

Section:

Physics

Class:

G06F9/5038

Inventors:

Fanny Nina Paravecino 🇺🇸 San Jose, CA, United States

Timothy Lawrence Harris 🇬🇧 Cambridge, United Kingdom

Alexander WETMORE 🇺🇸 Redmond, WA, United States

Woosuk KWON 🇺🇸 Berkeley, CA, United States

Assignee:

Microsoft Technology Licensing, LLC 🇺🇸 Redmond, WA, United States

Applicant:

Microsoft Technology Licensing, LLC 🇺🇸 Redmond, WA, United States

Smart overview of the Invention

The patent application introduces a method to enhance the execution of artificial intelligence workloads, specifically focusing on transformer models. These models are structured as chains of cells, each containing specific tasks. The method involves generating a parallel schedule that efficiently distributes these tasks across a cluster of devices, aiming to optimize performance and resource utilization. This approach addresses the challenge of managing the growing computational demands of large AI models by leveraging mixed parallelism techniques.

Background

Generative AI models, particularly large language models, are becoming increasingly complex, necessitating distributed execution to handle their computational requirements. The process of effectively partitioning these tasks across multiple devices is intricate, requiring careful synchronization and management of interdependencies. The dynamic nature of computational resources further complicates this task, highlighting the need for adaptive parallelism strategies to maximize efficiency and harness the full potential of these models.

Execution Plan Generation

The method involves receiving internal representations of the transformer model, the device cluster, and the workload. Based on these representations, multiple candidate execution plans are generated, each offering a unique parallel schedule for device partitioning. The optimal execution plan is selected by evaluating resource usage through simulation, ensuring the lowest possible resource consumption while maintaining effective execution of the transformer model across the device cluster.

System and Method Implementation

The invention can be implemented in a computing system comprising memory, a processor, and storage media with executable instructions. These components work together to generate and execute the parallel schedule, dividing the transformer model into sequential stages and creating replicas as needed. Tasks within each cell are mapped to devices, facilitating efficient parallel execution of the model according to the workload requirements.

Advanced Parallel Scheduling

Further aspects of the method involve searching for parallel schedules by determining the number of model and cell replicas needed. This includes partitioning the device cluster into these replicas and stages, ensuring that tasks are appropriately mapped to devices. This systematic approach to partitioning and scheduling aims to optimize the execution of transformer models, enhancing the performance and scalability of AI workloads across distributed systems.