Which models does ControlNet work with?

ControlNet has dedicated weight sets for Stable Diffusion 1.5, SDXL, and FLUX architectures. Each base model requires its own specific ControlNet weights, as the encoder copy must match the original model's architecture. Community-trained variants also exist for many fine-tuned models derived from these base architectures.

Which preprocessors can I use with ControlNet?

ControlNet supports a wide range of preprocessors including Canny edge detection, OpenPose for human pose estimation, MiDaS and ZoeDepth for depth maps, SAM for segmentation, lineart extractors, scribble/sketch inputs, and normal map generators. Each preprocessor creates a conditioning image that guides the generation process in a specific structural way.

What input types does ControlNet support?

ControlNet supports over 14 conditioning input types: Canny edges, HED soft edges, depth maps, normal maps, OpenPose skeletons, segmentation maps, scribbles, lineart, MLSD straight lines, content shuffle, instruct pix2pix, inpainting, tile upscale, and IP2P. Each type requires a separate trained model and corresponding preprocessor.

What are the hardware requirements for ControlNet?

ControlNet adds approximately 1.4 billion parameters on top of the base Stable Diffusion model. For SD 1.5 ControlNet, a minimum of 8GB VRAM is recommended. For SDXL ControlNet, 12GB or more VRAM is preferred. Each additional ControlNet module stacked increases VRAM usage and processing time by roughly 20-50 percent.

What is the difference between ControlNet and IP-Adapter?

ControlNet provides structural and spatial control through explicit conditioning inputs like edges, depth maps, and pose skeletons. IP-Adapter, on the other hand, transfers style and semantic content from reference images through image prompt embedding. They serve complementary purposes and can be used together — ControlNet for structure and IP-Adapter for style and appearance.

Is ControlNet suitable for commercial use?

Yes, ControlNet is released under the Apache 2.0 license, which permits both research and commercial use without restrictions. You can integrate ControlNet into commercial products, services, and workflows freely. However, the outputs may also be subject to the licensing terms of the base diffusion model you pair it with, such as Stable Diffusion's CreativeML Open RAIL-M license.

ControlNet

Open Source

4.8

Lvmin Zhang

ControlNet is a conditional control framework for Stable Diffusion models that enables precise structural guidance during image generation through various conditioning inputs such as edge maps, depth maps, human pose skeletons, segmentation masks, and normal maps. Developed by Lvmin Zhang and Maneesh Agrawala at Stanford University, ControlNet adds trainable copy branches to frozen diffusion model encoders, allowing the model to learn spatial conditioning without altering the original model's capabilities. This architecture preserves the base model's generation quality while adding fine-grained control over composition, structure, and spatial layout of generated images. ControlNet supports multiple conditioning types simultaneously, enabling complex multi-condition workflows where users can combine pose, depth, and edge information to guide generation with extraordinary precision. The framework revolutionized professional AI image generation workflows by solving the fundamental challenge of maintaining consistent spatial structures across generated images. It has become an essential tool for professional artists and designers who need precise control over character poses, architectural layouts, product placements, and scene compositions. ControlNet is open-source and available on Hugging Face with pre-trained models for various Stable Diffusion versions including SD 1.5 and SDXL. It integrates seamlessly with ComfyUI and Automatic1111. Concept artists, character designers, architectural visualizers, fashion designers, and animation studios rely on ControlNet for production workflows. Its influence has extended beyond Stable Diffusion, inspiring similar control mechanisms in FLUX.1 and other modern image generation models.

Image to Image

Visit Website

Key Highlights

14+ Control Mode Support

Precisely guide image generation with Canny edge, OpenPose, depth map, segmentation, scribble, normal map, and many more conditioning inputs.

Zero-Convolution Architecture

Operates on a trainable copy while preserving original diffusion model weights, preventing loss of previously learned capabilities through locked parameters.

Multi-Model Compatibility

Compatible with SD 1.5, SDXL, and FLUX architectures, with dedicated ControlNet weights available for each base model version.

Multi-ControlNet Stacking

Stack multiple ControlNet modules simultaneously to combine different conditions like pose and depth in a single generation pipeline.

About

ControlNet is a groundbreaking neural network architecture developed by Lvmin Zhang and Maneesh Agrawala at Stanford University, first introduced in February 2023 through the paper "Adding Conditional Control to Text-to-Image Diffusion Models." The model adds conditional control to large pretrained text-to-image diffusion models such as Stable Diffusion by creating a trainable copy of the encoding layers. This innovative approach allows users to guide image generation using various spatial conditioning inputs including Canny edges, human pose skeletons (via OpenPose), depth maps (via MiDaS), segmentation maps, scribbles, and normal maps. The emergence of ControlNet fundamentally shifted the controllability paradigm in AI-assisted image generation, enabling the transition from random generation to precisely guided output.

The core architecture works by locking the original model weights while training a connected copy, ensuring that the pretrained capabilities are preserved while new conditional control is learned. This zero-convolution technique means ControlNet can be trained on relatively small datasets without catastrophic forgetting. With approximately 1.4 billion parameters per control model, mirroring the SD 1.5 encoder, ControlNet achieves remarkable structural fidelity. During training, the zero-convolution layers initially produce zero output, which preserves the base model's behavior at the start and gradually learns control signals. This elegant design forms the foundation for stable, predictable training behavior.

ControlNet offers unparalleled flexibility with support for over 14 different control types. It can preserve object boundaries with Canny edge detection, reproduce human body poses with OpenPose, transfer three-dimensional spatial structure with MiDaS depth maps, enable regional content control with segmentation maps, and manage surface details with normal maps. Achieving an SSIM score of 0.89 on Canny conditioning tasks, the model has become the industry standard for structural fidelity. Additionally, the ability to use multiple ControlNet models simultaneously enables complex scenarios such as concurrent depth and pose control.

In terms of applications, ControlNet spans a wide range from AI art creation to industrial design. It has been integrated into professional workflows for architectural visualization — generating photorealistic renders from sketches — fashion design for garment visualization in specific poses, game development for concept art generation, and the film industry for storyboard visualization. Interior designers can produce different style alternatives while preserving spatial layout through depth maps, while illustrators can transform rough sketches into detailed visuals.

ControlNet has become a foundational tool in the AI art and design community, integrated into popular interfaces like ComfyUI, Automatic1111, and Fooocus. It supports SD 1.5, SDXL, and FLUX architectures with dedicated weight sets for each. Its Apache 2.0 license makes it freely available for both research and commercial applications, and it has been deployed across platforms including Hugging Face, Replicate, and fal.ai. Community-created custom ControlNet models have also formed a continuously expanding ecosystem.

Compared to its alternatives, ControlNet offers more precise control than T2I-Adapter while requiring more computational resources. Unlike style-based adapters such as IP-Adapter, ControlNet focuses entirely on structural and spatial control. This complementary nature has encouraged the use of ControlNet alongside other adapters, making it an indispensable component of modern AI image generation workflows. Thanks to its open-source nature and strong community support, ControlNet continues to be the most widely adopted solution in the controllable image generation space.

Use Cases

Pose-Based Character Generation

Creating character visuals in desired poses using human pose references.

Architectural Visualization

Creating architectural renders and visualizations from edge maps.

Depth-Based Scene Generation

New scene and style generation preserving 3D feel with depth maps.

Product Photography Control

Generating different styles while preserving composition and structure of product images.

Pros & Cons

Pros

Offers various control methods including pose skeletons, depth maps, edge detection, and segmentation masks
Adds spatial control while preserving visual quality of large pretrained diffusion models
Learns from small task-specific datasets without losing general capabilities
Improves repeatability and alignment, valuable for artists, production teams, and prototyping workflows
ControlNet++ achieves 7-13% improvement in segmentation, line-art, and depth conditions

Cons

Poorly prepared control inputs (inconsistent sizes, noisy masks) can confuse the conditioning process
Setting control strength too high produces rigid, artifact-prone results
May struggle with edge detection on images with lots of noise or complex edges
Typically 20-50% longer processing time per ControlNet added
Can struggle with multiple people or complex poses

Technical Details

Parameters

1.4B

Architecture

Conditional Diffusion (encoder copy)

Training Data

Various conditioning datasets

License

Apache 2.0

Features

Pose Control (OpenPose)
Canny Edge Detection
Depth Map Conditioning (MiDaS)
Segmentation Map Control
Scribble/Sketch Guidance
Normal Map Support
Lineart Control
Multi-ControlNet Stacking

Benchmark Results

Metric	Value	Compared To	Source
Parametre Sayısı	1.4B (SD 1.5 encoder kopyası)	T2I-Adapter: 77M	ControlNet Paper (arXiv)
Desteklenen Kontrol Türü	14+ (Canny, Depth, Pose, vb.)	T2I-Adapter: 8+	ControlNet GitHub
SSIM (Canny koşulu)	0.89	—	ControlNet Paper (arXiv)
Çıkarım Süresi Artışı	+%15-25 (temel modele göre)	T2I-Adapter: +%5-10	ControlNet GitHub

Available Platforms

hugging face

replicate

fal ai

News & References

ControlNet v1.1 update released

GitHub · 2024-02

ControlNet widely adopted in ComfyUI ecosystem

VentureBeat · 2024-05

Frequently Asked Questions

Related Models

InstantID

InstantX Team|N/A

InstantID is a zero-shot identity-preserving image generation framework developed by InstantX Team that can generate images of a specific person in various styles, poses, and contexts using only a single reference photograph. Unlike traditional face-swapping or personalization methods that require multiple reference images or time-consuming fine-tuning, InstantID achieves accurate identity preservation from just one facial photograph through an innovative architecture combining a face encoder, IP-Adapter, and ControlNet for facial landmark guidance. The system extracts detailed facial identity features from the reference image and injects them into the generation process, ensuring that the generated person maintains recognizable facial features, proportions, and characteristics across diverse output scenarios. InstantID supports various creative applications including generating portraits in different artistic styles, placing the person in imagined scenes or contexts, creating profile pictures and avatars, and producing marketing materials featuring consistent character representations. The model works with Stable Diffusion XL as its base and is open-source, available on GitHub and Hugging Face for local deployment. It integrates with ComfyUI through community-developed nodes and can be accessed through cloud APIs. Portrait photographers, social media content creators, marketing teams creating personalized campaigns, game developers designing character variants, and digital artists exploring identity-based creative work all use InstantID. The framework has influenced subsequent identity-preservation models and remains one of the most effective solutions for single-image identity transfer in the open-source ecosystem.

Open Source

4.7

IP-Adapter

Tencent|22M

IP-Adapter is an image prompt adapter developed by Tencent AI Lab that enables image-guided generation for text-to-image diffusion models without requiring any fine-tuning of the base model. The adapter works by extracting visual features from reference images using a CLIP image encoder and injecting these features into the diffusion model's cross-attention layers through a decoupled attention mechanism. This allows users to provide reference images as visual prompts alongside text prompts, guiding the generation process to produce images that share stylistic elements, compositional features, or visual characteristics with the reference while still following the text description. IP-Adapter supports multiple modes of operation including style transfer, where the generated image adopts the artistic style of the reference, and content transfer, where specific subjects or elements from the reference appear in the output. The adapter is lightweight, adding minimal computational overhead to the base model's inference process. It can be combined with other control mechanisms like ControlNet for multi-modal conditioning, enabling sophisticated workflows where pose, style, and content can each be controlled independently. IP-Adapter is open-source and available for various Stable Diffusion versions including SD 1.5 and SDXL. It integrates with ComfyUI and Automatic1111 through community extensions. Digital artists, product designers, brand managers, and content creators who need to maintain visual consistency across generated images or transfer specific aesthetic qualities from reference material particularly benefit from IP-Adapter's capabilities.

Open Source

4.6

IP-Adapter FaceID

Tencent|22M (adapter)

IP-Adapter FaceID is a specialized adapter module developed by Tencent AI Lab that injects facial identity information into the diffusion image generation process, enabling the creation of new images that faithfully preserve a specific person's facial features. Unlike traditional face-swapping approaches, IP-Adapter FaceID extracts face recognition feature vectors from the InsightFace library and feeds them into the diffusion model through cross-attention layers, allowing the model to generate diverse scenes, styles, and compositions while maintaining consistent facial identity. With only approximately 22 million adapter parameters layered on top of existing Stable Diffusion models, FaceID achieves remarkable identity preservation without requiring per-subject fine-tuning or multiple reference images. A single clear face photo is sufficient to generate the person in various artistic styles, different clothing, diverse environments, and novel poses. The adapter supports both SDXL and SD 1.5 base models and can be combined with other ControlNet adapters for additional control over pose, depth, and composition. IP-Adapter FaceID Plus variants incorporate additional CLIP image features alongside face embeddings for improved likeness and detail preservation. Released under the Apache 2.0 license, the model is fully open source and widely integrated into ComfyUI workflows and the Diffusers library. Common applications include personalized avatar creation, custom portrait generation in various artistic styles, character consistency in storytelling and comic creation, personalized marketing content, and social media content creation where maintaining a recognizable likeness across multiple generated images is essential.

Open Source

4.5

FLUX Redux

Black Forest Labs|12B

FLUX Redux is the specialized image variation model within the FLUX model family developed by Black Forest Labs, designed for generating creative variations of reference images while preserving their core style, color palette, and compositional essence. Built on the 12-billion parameter Diffusion Transformer architecture, FLUX Redux takes a reference image as input and produces new images that maintain the visual DNA of the original while introducing controlled variations in content, composition, or perspective. The model captures high-level stylistic attributes including artistic technique, color harmony, lighting mood, and textural qualities, then applies them to generate fresh compositions that feel aesthetically consistent with the source material. FLUX Redux can be combined with text prompts to guide the direction of variation, allowing users to request specific changes like 'same style but with a mountain landscape' or 'similar color palette with an urban scene.' This makes it particularly powerful for brand consistency workflows where marketing teams need multiple visuals sharing a unified aesthetic. The model also supports image-to-image workflows where the reference serves as a strong stylistic prior while text prompts define new content. As a proprietary model, FLUX Redux is accessible through Black Forest Labs' API and partner platforms including Replicate and fal.ai with usage-based pricing. Key applications include generating cohesive visual content series for social media campaigns, creating style-consistent variations for A/B testing in advertising, producing product imagery in consistent brand aesthetics, and creative exploration where artists iterate on a visual direction without starting from scratch.

Proprietary

4.6

Quick Info

Parameters1.4B

Typediffusion

LicenseApache 2.0

Released2023-02

ArchitectureConditional Diffusion (encoder copy)

Rating4.8 / 5

CreatorLvmin Zhang

Links

Official Website HuggingFace GitHub arXiv Paper

Explore More

All Image to Image Models

Browse category

Using ControlNet with Stable Diffusion

Read guide

All AI Models

Browse all models

ControlNet

Key Highlights

14+ Control Mode Support

Zero-Convolution Architecture

Multi-Model Compatibility

Multi-ControlNet Stacking

About

Use Cases

Pose-Based Character Generation

Architectural Visualization

Depth-Based Scene Generation

Product Photography Control

Pros & Cons

Pros

Cons

Technical Details

Features

Benchmark Results

Available Platforms

News & References

Frequently Asked Questions

Which models does ControlNet work with?

Which preprocessors can I use with ControlNet?

What input types does ControlNet support?

What are the hardware requirements for ControlNet?

What is the difference between ControlNet and IP-Adapter?

Is ControlNet suitable for commercial use?

Related Models

InstantID

IP-Adapter

IP-Adapter FaceID

FLUX Redux

Quick Info

Links

Tags

Explore More