Invention Title:

INTERACTIVE INTERFACE TASK AUTOMATION UTILIZING GENERATIVE ARTIFICIAL INTELLIGENCE (AI) ACTION MODELS IMPROVED WITH RETRIEVAL-AUGMENTED GENERATION (RAG)

Publication number:

US20260037318

Publication date:

2026-02-05

Section:

Physics

Class:

G06F9/5038

Inventors:

Ravi Theja YADA 🇺🇸 Renton, WA, United States

Amr Mahmoud Ahmed Bekhiet ALY 🇺🇸 Redmond, WA, United States

Sarvesh NAGPAL 🇺🇸 Kirkland, WA, United States

Sharon PENG 🇺🇸 Miami, FL, United States

Aamir JAWAID 🇺🇸 Renton, WA, United States

Assignee:

Microsoft Technology Licensing, LLC 🇺🇸 Redmond, WA, United States

Applicant:

Microsoft Technology Licensing, LLC 🇺🇸 Redmond, WA, United States

Smart overview of the Invention

The patent application introduces a task execution system designed to automate user-requested actions across interactive interfaces using advanced machine learning models. Central to this system is a generative AI action model enhanced by retrieval-augmented generation (RAG), which improves the accuracy and efficiency of task completion. This system addresses the limitations of large action models (LAMs), such as error-prone actions and the need for continuous user input, by creating a session plan that guides the execution of tasks through various interface segments.

Technical Background

Recent advancements in AI have led to the development of LAMs, which simulate user interactions with software interfaces. Despite their capabilities, LAMs encounter significant challenges, particularly when actions fail or require a sequence of interactions. The task execution system described in this application leverages generative AI and RAG to overcome these challenges. By using visual context information and prior user session data, the system can self-correct and adjust session plans dynamically, ensuring seamless task execution.

System Implementation

The task execution system employs a generative AI action model and a visual-based generative AI model, both enhanced with sanitized user session data from a RAG database. This setup allows the system to generate and execute session plans autonomously, without needing additional user input. The system can adjust to obstacles by generating updated plans or alternative actions. This approach not only improves accuracy and efficiency but also enhances the flexibility of automated task execution across various interfaces.

Benefits and Applications

By integrating prior user session information, the task execution system creates tailored session plans that increase the precision of actions performed on interactive interfaces. This leads to fewer errors and more efficient task completion. The use of grounding information helps the system identify and interact with the correct elements on an interface, further enhancing accuracy. The system's ability to self-correct and adapt to different interfaces makes it a versatile tool for automating complex tasks without user intervention.

Terminology and Concepts

Key terms within this application include:

Actionable Task: An objective that can be accomplished through interactions with an interface.
Interactive Interface: A graphical user interface with elements that respond to user actions.
Session Plan: A strategy comprising actions to achieve a task across interface segments.
Action: An interaction performed on an interface, such as selecting or clicking an element.
Heatmap: A visual representation of user interactions on an interface, used as RAG inputs.