Image to Image Models

Explore the best AI models for image to image

Filter

ControlNet

ControlNet is a conditional control framework for Stable Diffusion models that enables precise structural guidance during image generation through various conditioning inputs such as edge maps, depth maps, human pose skeletons, segmentation masks, and normal maps. Developed by Lvmin Zhang and Maneesh Agrawala at Stanford University, ControlNet adds trainable copy branches to frozen diffusion model encoders, allowing the model to learn spatial conditioning without altering the original model's capabilities. This architecture preserves the base model's generation quality while adding fine-grained control over composition, structure, and spatial layout of generated images. ControlNet supports multiple conditioning types simultaneously, enabling complex multi-condition workflows where users can combine pose, depth, and edge information to guide generation with extraordinary precision. The framework revolutionized professional AI image generation workflows by solving the fundamental challenge of maintaining consistent spatial structures across generated images. It has become an essential tool for professional artists and designers who need precise control over character poses, architectural layouts, product placements, and scene compositions. ControlNet is open-source and available on Hugging Face with pre-trained models for various Stable Diffusion versions including SD 1.5 and SDXL. It integrates seamlessly with ComfyUI and Automatic1111. Concept artists, character designers, architectural visualizers, fashion designers, and animation studios rely on ControlNet for production workflows. Its influence has extended beyond Stable Diffusion, inspiring similar control mechanisms in FLUX.1 and other modern image generation models.

Open Source

4.8

InstantID

InstantX Team|N/A

InstantID is a zero-shot identity-preserving image generation framework developed by InstantX Team that can generate images of a specific person in various styles, poses, and contexts using only a single reference photograph. Unlike traditional face-swapping or personalization methods that require multiple reference images or time-consuming fine-tuning, InstantID achieves accurate identity preservation from just one facial photograph through an innovative architecture combining a face encoder, IP-Adapter, and ControlNet for facial landmark guidance. The system extracts detailed facial identity features from the reference image and injects them into the generation process, ensuring that the generated person maintains recognizable facial features, proportions, and characteristics across diverse output scenarios. InstantID supports various creative applications including generating portraits in different artistic styles, placing the person in imagined scenes or contexts, creating profile pictures and avatars, and producing marketing materials featuring consistent character representations. The model works with Stable Diffusion XL as its base and is open-source, available on GitHub and Hugging Face for local deployment. It integrates with ComfyUI through community-developed nodes and can be accessed through cloud APIs. Portrait photographers, social media content creators, marketing teams creating personalized campaigns, game developers designing character variants, and digital artists exploring identity-based creative work all use InstantID. The framework has influenced subsequent identity-preservation models and remains one of the most effective solutions for single-image identity transfer in the open-source ecosystem.

Open Source

4.7

IP-Adapter

Tencent|22M

IP-Adapter is an image prompt adapter developed by Tencent AI Lab that enables image-guided generation for text-to-image diffusion models without requiring any fine-tuning of the base model. The adapter works by extracting visual features from reference images using a CLIP image encoder and injecting these features into the diffusion model's cross-attention layers through a decoupled attention mechanism. This allows users to provide reference images as visual prompts alongside text prompts, guiding the generation process to produce images that share stylistic elements, compositional features, or visual characteristics with the reference while still following the text description. IP-Adapter supports multiple modes of operation including style transfer, where the generated image adopts the artistic style of the reference, and content transfer, where specific subjects or elements from the reference appear in the output. The adapter is lightweight, adding minimal computational overhead to the base model's inference process. It can be combined with other control mechanisms like ControlNet for multi-modal conditioning, enabling sophisticated workflows where pose, style, and content can each be controlled independently. IP-Adapter is open-source and available for various Stable Diffusion versions including SD 1.5 and SDXL. It integrates with ComfyUI and Automatic1111 through community extensions. Digital artists, product designers, brand managers, and content creators who need to maintain visual consistency across generated images or transfer specific aesthetic qualities from reference material particularly benefit from IP-Adapter's capabilities.

Open Source

4.6

IP-Adapter FaceID

Tencent|22M (adapter)

IP-Adapter FaceID is a specialized adapter module developed by Tencent AI Lab that injects facial identity information into the diffusion image generation process, enabling the creation of new images that faithfully preserve a specific person's facial features. Unlike traditional face-swapping approaches, IP-Adapter FaceID extracts face recognition feature vectors from the InsightFace library and feeds them into the diffusion model through cross-attention layers, allowing the model to generate diverse scenes, styles, and compositions while maintaining consistent facial identity. With only approximately 22 million adapter parameters layered on top of existing Stable Diffusion models, FaceID achieves remarkable identity preservation without requiring per-subject fine-tuning or multiple reference images. A single clear face photo is sufficient to generate the person in various artistic styles, different clothing, diverse environments, and novel poses. The adapter supports both SDXL and SD 1.5 base models and can be combined with other ControlNet adapters for additional control over pose, depth, and composition. IP-Adapter FaceID Plus variants incorporate additional CLIP image features alongside face embeddings for improved likeness and detail preservation. Released under the Apache 2.0 license, the model is fully open source and widely integrated into ComfyUI workflows and the Diffusers library. Common applications include personalized avatar creation, custom portrait generation in various artistic styles, character consistency in storytelling and comic creation, personalized marketing content, and social media content creation where maintaining a recognizable likeness across multiple generated images is essential.

Open Source

4.5

FLUX Redux

Black Forest Labs|12B

FLUX Redux is the specialized image variation model within the FLUX model family developed by Black Forest Labs, designed for generating creative variations of reference images while preserving their core style, color palette, and compositional essence. Built on the 12-billion parameter Diffusion Transformer architecture, FLUX Redux takes a reference image as input and produces new images that maintain the visual DNA of the original while introducing controlled variations in content, composition, or perspective. The model captures high-level stylistic attributes including artistic technique, color harmony, lighting mood, and textural qualities, then applies them to generate fresh compositions that feel aesthetically consistent with the source material. FLUX Redux can be combined with text prompts to guide the direction of variation, allowing users to request specific changes like 'same style but with a mountain landscape' or 'similar color palette with an urban scene.' This makes it particularly powerful for brand consistency workflows where marketing teams need multiple visuals sharing a unified aesthetic. The model also supports image-to-image workflows where the reference serves as a strong stylistic prior while text prompts define new content. As a proprietary model, FLUX Redux is accessible through Black Forest Labs' API and partner platforms including Replicate and fal.ai with usage-based pricing. Key applications include generating cohesive visual content series for social media campaigns, creating style-consistent variations for A/B testing in advertising, producing product imagery in consistent brand aesthetics, and creative exploration where artists iterate on a visual direction without starting from scratch.

Proprietary

4.6

GFPGAN

Tencent ARC|N/A

GFPGAN is a practical face restoration algorithm developed by Tencent ARC that leverages generative facial priors embedded in a pre-trained StyleGAN2 model to restore severely degraded face images with remarkable quality. First released in December 2021, GFPGAN addresses the challenging problem of blind face restoration where input images may suffer from unknown combinations of low resolution, blur, noise, compression artifacts, and other forms of degradation. The model's architecture combines a degradation removal module with a StyleGAN2-based generative prior, using a novel channel-split spatial feature transform layer that balances fidelity to the original face with the high-quality facial details provided by the generative model. This approach allows GFPGAN to restore fine facial details including skin textures, eye clarity, hair strands, and tooth definition that are completely lost in the degraded input. The model processes faces through a U-Net encoder that extracts multi-resolution features from the degraded image, which then modulate the StyleGAN2 decoder's feature maps to produce a restored output that preserves the original identity while dramatically enhancing quality. GFPGAN excels in old photo restoration, enhancing low-resolution surveillance footage, improving video call quality, recovering damaged family photographs, and preparing low-quality source material for professional use. The model is open source under Apache 2.0, available on Hugging Face and Replicate, and has become a foundational component integrated into numerous creative AI tools and pipelines. Its ability to handle real-world degradation patterns rather than just synthetic corruption makes it particularly valuable for practical restoration tasks encountered by photographers, archivists, and content creators.

Open Source

4.5

FaceSwap ROOP

s0md3v|N/A

FaceSwap ROOP is an open-source face swapping tool created by s0md3v that enables one-click face replacement in images and videos using InsightFace detection combined with the inswapper neural network. Released in May 2023, the tool gained popularity for its simplicity, allowing users to swap faces with just a single source image and a target media file without any dataset preparation or model training. The architecture leverages InsightFace for accurate facial detection and landmark recognition, while the inswapper model handles the actual face replacement by mapping facial features from the source onto the target while preserving natural lighting, skin tone, and expression characteristics. ROOP operates as a hybrid system combining traditional computer vision techniques with deep learning models to achieve seamless blending between swapped faces and their surrounding context. The tool supports both image and video processing, handling frame-by-frame face replacement in video content with temporal consistency. Common use cases include creative content production, film and video post-production, social media entertainment, privacy protection through face anonymization, and educational demonstrations of AI capabilities. Available under the MIT license, ROOP can be run locally or accessed through cloud platforms like Replicate and fal.ai. The tool includes built-in NSFW filtering and ethical usage guidelines to prevent misuse. Its combination of ease of use, open-source accessibility, and zero training requirement makes it one of the most widely adopted face swapping tools in the AI community.

Open Source

4.3

IC-Light

Lvmin Zhang|1B+

IC-Light (Intrinsic Compositing Light) is an AI relighting model developed by Lvmin Zhang, the creator of ControlNet, that manipulates and transforms lighting conditions in photographs with remarkable realism. Built on a Stable Diffusion backbone with specialized lighting conditioning, the model with over one billion parameters can take any photograph of an object or person and completely alter the light source direction, color temperature, intensity, and ambient lighting while maintaining photorealistic shadows, highlights, and surface reflections. IC-Light operates in two distinct modes: foreground relighting where the subject is extracted and relit independently, and background-compatible relighting where the lighting is adjusted to match a new background environment. The model understands physical light behavior including specular reflections, subsurface scattering on skin, metallic surfaces, and transparent materials, producing results that respect real-world optical properties. IC-Light accepts text descriptions or reference images to define the target lighting setup, offering intuitive control over the final appearance. Released under the Apache 2.0 license, the model is fully open source and has been integrated into ComfyUI with dedicated workflow nodes. Professional photographers, product photographers, digital artists, and e-commerce teams use IC-Light for correcting unfavorable lighting in existing photos, creating studio-quality lighting from casual snapshots, matching product lighting across catalog images, generating dramatic cinematic lighting for creative projects, and preparing composited images with consistent illumination across elements.

Open Source

4.5

PhotoMaker

Tencent|N/A

PhotoMaker is a personalized photo generation model developed by TencentARC that creates realistic and diverse human portraits from reference images using a novel Stacked ID Embedding approach. Unlike traditional fine-tuning methods such as DreamBooth that require lengthy training processes, PhotoMaker achieves identity-preserving generation in seconds by extracting and stacking embeddings from multiple reference photos through CLIP and specialized identity encoders. Built on the SDXL pipeline, the model injects identity representations via modified cross-attention layers, enabling high-quality outputs that maintain facial features while allowing creative freedom in style, pose, and setting variations. PhotoMaker supports identity mixing, allowing users to blend features from multiple people to create unique composite faces with adjustable contribution weights. The model excels in personalized portrait generation, identity-consistent story illustration for comics and visual novels, virtual try-on applications, and advertising content creation. PhotoMaker V2 brought significant improvements in identity preservation accuracy, natural generation quality, and text alignment, particularly in challenging scenarios like extreme pose changes and age transformations. As an open-source model released under the Apache 2.0 license, PhotoMaker is freely available on Hugging Face with community integrations in ComfyUI and other popular creative tools. It requires only one to four reference images to produce compelling results, making it one of the most accessible and efficient identity-preserving generation solutions available for both individual creators and professional production workflows.

Open Source

4.5

Img2Img SDXL

Stability AI|6.6B

Img2Img SDXL is the image-to-image pipeline of Stability AI's Stable Diffusion XL model, enabling users to transform existing images through style conversion, enhancement, and creative modification while maintaining structural coherence with the original input. Built on SDXL's 6.6 billion parameter latent diffusion architecture with dual text encoders, the img2img pipeline takes an input image along with a text prompt and denoising strength parameter to produce variations ranging from subtle refinements to dramatic transformations. The denoising strength controls how much the model departs from the original image, with lower values preserving more of the source composition. The SDXL base produces high-resolution 1024x1024 outputs natively without quality degradation seen in earlier Stable Diffusion versions. Key capabilities include artistic style transfer where photographs can be converted into paintings or illustrations, image enhancement, concept iteration where designers rapidly explore variations of an existing visual, and creative compositing where elements are reimagined within new contexts. The pipeline supports ControlNet integration for precise structural guidance, LoRA models for style customization, and various schedulers for fine-tuning the generation process. Released under the CreativeML Open RAIL-M license, Img2Img SDXL is available through Stability AI's platform, fal.ai, Replicate, and Hugging Face, and can be run locally with a minimum of 8GB VRAM. It serves as an essential tool for designers, digital artists, and creative professionals who need to iterate quickly on visual concepts while maintaining specific compositional elements from their source material.

Open Source

4.4

InstructPix2Pix

Tim Brooks|1B

InstructPix2Pix is an innovative image editing model developed by researchers at UC Berkeley that enables users to edit images using natural language instructions without requiring manual masks, sketches, or reference images. The model was trained on a dataset of paired image edits generated by combining GPT-3's language capabilities with Stable Diffusion's image generation, learning to translate text-based editing instructions into precise visual modifications. Users can provide an input image along with a text instruction such as 'make it snowy,' 'turn the cat into a dog,' or 'add dramatic sunset lighting,' and InstructPix2Pix applies the requested changes while preserving the overall structure and unaffected elements of the original image. The model operates in a single forward pass, making edits quickly without iterative optimization. It handles a wide range of editing operations including style transfer, object replacement, lighting changes, season and weather modifications, material changes, and artistic transformations. InstructPix2Pix is built on the Stable Diffusion architecture and is open-source, available on Hugging Face with integration into the Diffusers library. It runs on consumer GPUs with 6GB or more VRAM. Photographers, digital artists, content creators, and developers building image editing applications use InstructPix2Pix for rapid creative editing workflows. While it may not match the precision of manual editing in complex scenarios, its natural language interface makes sophisticated image edits accessible to users without any image editing expertise.

Open Source

4.3

PuLID

ByteDance|N/A

PuLID is an identity-preserving image generation model developed by ByteDance that introduces a Pure and Lightning ID customization approach for creating personalized portraits with exceptional speed and fidelity. Released in April 2024, PuLID addresses the core challenge of maintaining a person's identity features across different generated images without requiring lengthy fine-tuning processes. The model achieves this through a novel contrastive alignment loss and accurate ID loss mechanism that works directly with pre-trained diffusion models, specifically integrating with SDXL and FLUX architectures. PuLID's key innovation lies in its ability to decouple identity features from other image attributes such as pose, expression, and background, enabling highly controllable generation where the subject's identity remains consistent while all other aspects can be freely modified. The model processes reference images through an InsightFace-based identity encoder to extract robust facial feature representations, which are then injected into the generation pipeline through specialized adapter layers. This approach enables real-time personalization without any per-subject training, making it significantly faster than alternatives like DreamBooth or textual inversion. PuLID excels in applications including personalized avatar creation, social media content generation, virtual try-on scenarios, and identity-consistent multi-scene illustration. As an open-source project released under the Apache 2.0 license, PuLID is available on Hugging Face and supported through platforms like fal.ai, offering both researchers and creators a powerful tool for identity-preserving image generation with minimal computational overhead.

Open Source

4.4

Instant Style

InstantX Team|N/A

Instant Style is a style transfer model developed by the InstantX Team that applies the artistic style of a reference image to generated content while preserving the original content structure and semantics. Released in April 2024, the model introduces a Decoupled Style Adapter architecture built on IP-Adapter, which separates style information from content information to enable clean style injection without contaminating the subject matter of the generated image. This decoupling is achieved through specialized attention mechanisms that process style features independently from content features, allowing the model to capture color palettes, brushwork patterns, texture characteristics, and overall aesthetic qualities from the reference while maintaining compositional integrity. Instant Style works within the Stable Diffusion ecosystem, making it compatible with existing SDXL checkpoints, LoRA models, and ControlNet conditions for maximum creative flexibility. The model requires only a single reference image to extract style information, with no fine-tuning needed, enabling instant style application in real-time workflows. Key applications include artistic content creation, brand-consistent visual asset generation, game art production with unified aesthetic styles, illustration series maintaining visual coherence, and rapid prototyping of visual concepts in different artistic treatments. Available as an open-source project under the Apache 2.0 license on Hugging Face, Instant Style can also be accessed through Replicate and fal.ai. The model represents a significant advancement in controllable style transfer, offering superior content preservation compared to earlier approaches that often distorted subject matter when applying strong stylistic transformations.

Open Source

4.3

T2I-Adapter

Tencent ARC|77M

T2I-Adapter is a lightweight conditioning framework for text-to-image diffusion models developed by Tencent ARC Lab that provides structural control over generated images through various guidance signals including sketch, depth, segmentation, color, and style inputs. Unlike ControlNet which adds substantial computational overhead by creating full copies of the encoder, T2I-Adapter uses a compact adapter architecture that achieves similar conditioning capabilities with significantly less memory usage and faster inference times. The adapter extracts multi-scale features from conditioning images and injects them into the diffusion model's intermediate feature maps, guiding the generation process to follow the desired spatial structure while maintaining the model's creative freedom in unspecified areas. T2I-Adapter supports multiple conditioning types that can be combined for complex multi-condition generation, allowing users to specify both structural layout and stylistic direction simultaneously. Each adapter type is trained independently and can be mixed and matched at inference time, providing flexible compositional control. The framework is particularly effective for professional workflows requiring consistent spatial layouts across multiple variations, such as architectural visualization, product design iteration, and character sheet generation. T2I-Adapter is open-source and available for Stable Diffusion 1.5 and SDXL on Hugging Face, compatible with the Diffusers library and ComfyUI. Its lightweight nature makes it especially valuable for deployment on resource-constrained hardware and for applications requiring real-time or near-real-time conditioning. Designers, architects, product developers, and animation studios use T2I-Adapter for production workflows where precise structural guidance is needed without the computational cost of heavier control solutions.

Open Source

4.2

Pix2Pix

UC Berkeley|54M

Pix2Pix is a pioneering image-to-image translation framework developed at UC Berkeley that introduced the concept of using conditional generative adversarial networks for paired image translation tasks. Published in November 2017 as part of the landmark paper "Image-to-Image Translation with Conditional Adversarial Networks," Pix2Pix demonstrated that a single general-purpose architecture could learn mappings between different visual domains when provided with paired training examples. The architecture consists of a U-Net-based generator that preserves spatial information through skip connections and a PatchGAN discriminator that evaluates image quality at the patch level rather than globally, enabling the model to capture fine-grained texture details while maintaining structural coherence. With approximately 54 million parameters, Pix2Pix is relatively lightweight compared to modern diffusion models, enabling fast inference and efficient training. The model excels at diverse translation tasks including converting semantic label maps to photorealistic scenes, transforming architectural facades from sketches, colorizing black-and-white photographs, converting edge maps to realistic images, and translating satellite imagery to street maps. The BSD-licensed open-source implementation has become one of the most influential works in generative AI, establishing fundamental principles that influenced subsequent models like CycleGAN, SPADE, and modern diffusion-based image editing approaches. Despite being superseded by newer techniques in terms of raw output quality, Pix2Pix remains widely used in educational contexts, rapid prototyping, and applications where paired training data is available and deterministic translation behavior is desired. Available on Hugging Face and Replicate, the model continues to serve as a foundational reference for understanding conditional image generation and adversarial training dynamics.

Open Source

4.0