US20250308116
2025-10-02
Physics
G06T11/60
The patent application introduces a method and system for generating customized textual images using diffusion models. It addresses the limitations of traditional diffusion-based methods in creating textual content with complex font attributes. The process begins by receiving an input image, a textual prompt, and several control parameters. These inputs are used to extract a character mask and a conditional mask, which guide the generation of accurate customized textual images.
This innovation falls within the field of image processing, specifically focusing on the generation of customized textual images using diffusion models. The method enhances the quality of text rendering in images, which is crucial for applications across various industries such as entertainment, advertising, and education. By automating the creation of high-quality text images, it reduces the need for professional skills and iterative design processes.
Text-to-image synthesis has evolved significantly with the advent of diffusion models, which offer advantages over traditional methods like GANs. However, existing models often lack comprehensive control over text generation, particularly when dealing with complex fonts and small text sizes. Previous works like Glyph-Draw and TextDiffuser have made strides in this area but still face challenges in generating dense and small text accurately.
The proposed method involves several key steps: receiving input data including an image and a textual prompt, generating character and conditional masks based on control parameters, and using these masks to guide a diffusion model. The model initializes with random Gaussian noise and iteratively refines an intermediate image to produce a latent vector image. Finally, a trained consistency model generates the customized textual image from this latent vector image.
The system is implemented through hardware processors configured to execute programmed instructions stored in memory. These processors handle tasks such as generating masks and refining images through diffusion models. A computer program product is also provided, enabling devices to perform these operations autonomously. This approach ensures precise control over font attributes and enhances the clarity of generated text within images.