Invention Title:

Style-Aware Drag-and-Drop Insertion of Subjects into Images

Publication number:

US20250378609

Publication date:
Section:

Physics

Class:

G06T11/60

Inventors:

Applicant:

Smart overview of the Invention

The patent application discusses a method for inserting subjects like people or animals from one image into another, ensuring that the inserted subject matches the style of the target image. This process preserves the original pose and identity of the subject while integrating them into the new environment with appropriate shadows and occlusions. The method involves fine-tuning a diffusion model to recover a subject's image based on a learned auxiliary input description, then imposing style information from a target image onto the model to produce a style-translated image of the subject. This translated subject is then seamlessly inserted into the target image using a subject insertion model.

Current machine learning models face challenges in accurately translating the style of specific subjects within images while maintaining their identity, often resulting in high computational costs. Integrating these translated subjects into different backgrounds with accurate environmental effects is also difficult. The proposed method addresses these challenges by utilizing a diffusion model and auxiliary input to achieve accurate style translation and integration at a reduced computational cost compared to existing methods like inpainting, which often produce poor-quality results.

The method involves several steps: receiving images of the subject and target, fine-tuning a diffusion model to predict the subject's image using a noisy version, and learning an auxiliary input that conditions the model's output. The fine-tuned model, with imposed style information from the target image, generates a second image of the subject in the target style. Finally, a subject insertion model integrates the subject into the target image environment. Additionally, a method for training the subject insertion model is described, involving the removal of subjects from stylized images to create training data and fine-tuning the model using filtered images.

A non-transitory computer-readable medium is provided, storing program instructions executable by a processor to perform the described methods. A system is also described, including a processor and a computer-readable medium with instructions for executing these methods. These components facilitate efficient and accurate identity-preserving style translation of subjects, leveraging both auxiliary descriptive input and fine-tuning of diffusion model parameters to maintain subject identity while applying the target image style.

The use of a diffusion model allows the generation of training data by introducing noise to the subject image and running inference iterations, providing rich loss information for training. The auxiliary inputs trained may vary based on their nature, enhancing the model's ability to represent both generic and subtle aspects of the subject's identity. This approach reduces computational costs while ensuring accurate style translation and integration of subjects into target images.