US20260011061
2026-01-08
Physics
G06T11/60
A media application is designed to modify images using a diffusion model that incorporates text conditioning and depth maps. This process begins with an initial image and user input that selects specific objects within that image. A textual request is then used to guide the modification of these selected objects. The application generates a mask highlighting the selected objects, which, along with a depth map, is input into the diffusion model. The goal is to output an image that fulfills the user's textual request while avoiding alterations to human subjects.
Generative AI models are increasingly used to create images from text prompts or modify existing images based on such prompts. However, these models face challenges in accurately rendering human features, often leading to unrealistic results. This is particularly problematic when the images include people, as AI struggles with details like fingers and facial features. The invention aims to address these limitations by focusing on non-human elements and preserving the realism of the output image.
The method involves receiving an initial image and user input to select objects for modification. A user-selected mask is generated to identify these objects. The diffusion model then uses the textual request, depth map, and mask to produce an output image. This model is specifically trained to avoid modifying human pixels, ensuring that human subjects remain unchanged while other elements are modified according to the user's request.
Training the diffusion model involves using data that includes initial images with selected objects, textual requests, and depth maps. The model is trained to generate images that satisfy these conditions without altering human pixels. This process involves iterative generation and comparison of output images against ground truth images until a predefined accuracy threshold is met. The training also incorporates segmentation masks and classifier-free guidance to maintain structural integrity.
The technology is implemented through a computer-readable medium containing instructions for executing the method. User interactions, such as tapping or textual identification, help select objects for modification. The system can perform object recognition to identify human elements, applying preserving masks to prevent their modification. This ensures that the output image retains the essence of the initial image while satisfying the user's modification requests.