Invention Title:

RESTYLING IMAGES USING A DIFFUSION MODEL WITH TEXT CONDITIONING AND A DEPTH MAP

Publication number:

US20260011061

Publication date:

2026-01-08

Section:

Physics

Class:

G06T11/60

Inventors:

Bryan Feldman 🇺🇸 Mountain View, CA, United States

Yael Pritch Knaan 🇺🇸 Mountain View, CA, United States

Navin SARMA 🇺🇸 Mountain View, CA, United States

Selena SHANG 🇺🇸 Mountain View, CA, United States

Alex Rav ACHA 🇺🇸 Mountain View, CA, United States

Clement NG 🇺🇸 Mountain View, CA, United States

Judy ZHU 🇺🇸 Mountain View, CA, United States

Shlomo FRUCHTER 🇺🇸 Mountain View, CA, United States

Qinghao CHU 🇺🇸 Mountain View, CA, United States

Matan COHEN 🇺🇸 Mountain View, CA, United States

Assignee:

Google LLC 🇺🇸 Mountain View, CA, United States

Applicant:

Google LLC 🇺🇸 Mountain View, CA, United States

Smart overview of the Invention

A media application is designed to modify images using a diffusion model that incorporates text conditioning and depth maps. This process begins with an initial image and user input that selects specific objects within that image. A textual request is then used to guide the modification of these selected objects. The application generates a mask highlighting the selected objects, which, along with a depth map, is input into the diffusion model. The goal is to output an image that fulfills the user's textual request while avoiding alterations to human subjects.

Background

Generative AI models are increasingly used to create images from text prompts or modify existing images based on such prompts. However, these models face challenges in accurately rendering human features, often leading to unrealistic results. This is particularly problematic when the images include people, as AI struggles with details like fingers and facial features. The invention aims to address these limitations by focusing on non-human elements and preserving the realism of the output image.

Method Summary

The method involves receiving an initial image and user input to select objects for modification. A user-selected mask is generated to identify these objects. The diffusion model then uses the textual request, depth map, and mask to produce an output image. This model is specifically trained to avoid modifying human pixels, ensuring that human subjects remain unchanged while other elements are modified according to the user's request.

Training the Model

Training the diffusion model involves using data that includes initial images with selected objects, textual requests, and depth maps. The model is trained to generate images that satisfy these conditions without altering human pixels. This process involves iterative generation and comparison of output images against ground truth images until a predefined accuracy threshold is met. The training also incorporates segmentation masks and classifier-free guidance to maintain structural integrity.

Implementation and Use

The technology is implemented through a computer-readable medium containing instructions for executing the method. User interactions, such as tapping or textual identification, help select objects for modification. The system can perform object recognition to identify human elements, applying preserving masks to prevent their modification. This ensures that the output image retains the essence of the initial image while satisfying the user's modification requests.