How does InstructPix2Pix v2 work?

InstructPix2Pix v2 takes a source image and text instruction to edit the image according to the instruction. Its diffusion model-based architecture applies targeted changes according to the text command while preserving the structure of the original image. V2 offers enhanced instruction understanding and higher quality editing results.

What is the difference between InstructPix2Pix v2 and v1?

V2 includes significant improvements over v1: better instruction understanding capability, higher visual quality, fewer unwanted changes (better preservation of regions that should not be edited), and support for a wider editing range. Training data and model architecture are also advanced versions.

Is InstructPix2Pix v2 open source?

Yes, InstructPix2Pix v2 is released as open source. Model weights and source code are available on Hugging Face and GitHub. It is offered under suitable license terms for research and commercial use. It continues to be actively developed by the community.

What types of edits can InstructPix2Pix v2 perform?

InstructPix2Pix v2 offers a wide editing range: adding or removing objects, changing colors and materials, style transfer, season or weather changes, lighting adjustments, and artistic style applications. For complex multi-step edits, multiple commands can be applied sequentially.

What hardware is required for InstructPix2Pix v2?

A minimum NVIDIA GPU with 8GB VRAM is recommended for running InstructPix2Pix v2 locally. Since the model is based on Stable Diffusion, it has similar hardware requirements. It can also be used through popular interfaces such as ComfyUI and Automatic1111.

What is the difference between InstructPix2Pix v2 and FLUX.2 Kontext?

Both perform text-based image editing. InstructPix2Pix v2 is open source and can be run locally, while FLUX.2 Kontext is a proprietary API service. FLUX.2 Kontext generally offers higher editing quality, while InstructPix2Pix v2 stands out for being free and customizable.

InstructPix2Pix v2

Open Source

4.4

UC Berkeley

InstructPix2Pix v2 is an advanced diffusion model developed at UC Berkeley that edits images based on natural language instructions, building upon the success of the original InstructPix2Pix by Tim Brooks and collaborators. The model takes an input image and a text instruction such as 'make it sunset' or 'turn the cat into a dog' and generates the edited result while preserving unrelated parts of the image. Built on a Stable Diffusion backbone with instruction tuning, the v2 version introduces significant improvements in instruction comprehension, output quality, and editing precision compared to its predecessor. The architecture learns to follow complex multi-step instructions and handles nuanced editing requests including style changes, object modifications, color adjustments, weather transformations, and compositional alterations. Unlike mask-based editing approaches, InstructPix2Pix v2 requires no manual region selection as it automatically identifies which parts of the image to modify based on the text instruction. The model with approximately 1.5 billion parameters runs efficiently on consumer GPUs with 8GB or more VRAM. Released under the MIT license, it is fully open source and has been integrated into popular creative tools and workflows including ComfyUI and the Diffusers library. Professional photographers, digital artists, e-commerce teams, and content creators use InstructPix2Pix v2 for rapid iterative editing, product photo enhancement, creative experimentation, and batch processing of visual content where traditional manual editing would be time-prohibitive.

Image Editing

Visit Website

Key Highlights

Text-Based Image Editing

Editing existing images with natural language commands without requiring any masking

Structure Preservation

Preserves the original image's overall structure, composition, and unedited regions during editing

Enhanced Instruction Understanding

Much better text instruction understanding capability than v1 for more accurate and intent-aligned edits

Wide Editing Range

Various editing types such as style changing, object adding/removing, color editing, and environment changes

About

InstructPix2Pix v2 is an advanced diffusion model capable of editing images using natural language instructions, developed as an improved version of the original InstructPix2Pix model. Building on the success of the original InstructPix2Pix developed by Tim Brooks and his team at UC Berkeley, the v2 version offers significant improvements in understanding and applying more complex editing instructions. Thanks to an expanded training dataset and optimized architecture, it has achieved notable performance gains particularly in multi-step and contextual editing tasks. These improvements have made the model a reliable tool for professional editing workflows.

The model's operating principle is highly intuitive: given a source image and a text instruction, the model applies the instruction to the image. Commands like "make the weather snowy," "change the outfit to blue," "add mountains to the background," or "change the photo to sunset lighting" can be given in natural language. The model automatically identifies the region to be modified and preserves the rest of the image. The v2 version's most important improvement is strengthened regional awareness — the model can now more accurately understand spatial references like "remove the flower in the top left corner" and handle complex instruction chains.

The technical architecture uses a dual conditioning mechanism built on Stable Diffusion infrastructure. The original image is fed into the U-Net as additional input channels, and the text instruction is processed through the CLIP text encoder. Two key parameters control the editing: image guidance scale determines how much of the original image is preserved, while text guidance scale adjusts how strongly the instruction is followed. The v2 version expands the optimal ranges of these parameters, producing stable results across a wider editing spectrum. The classifier-free guidance mechanism enables independent control in both dimensions.

InstructPix2Pix v2's greatest strength is maintaining the image's overall structure and identity during editing. When told to "add sunglasses" to a portrait photo, only the glasses are added while facial features, lighting, and background remain undisturbed. When given the instruction "change the season to winter" on a landscape photo, only seasonal elements are modified while composition and perspective are preserved. When told to "change the wall color to blue" in an interior photo, furniture and decoration elements are maintained. This level of accuracy makes it reliable for professional photo editing and content production workflows.

Use cases are extraordinarily diverse, spanning a broad industrial range. It is widely preferred in professional areas such as background replacement or product color adjustment in e-commerce product photo editing, season changes or interior decoration suggestion visualization in real estate photo enhancement, iterative style changes in creative design processes, rapid visual editing in social media content production, and variation generation on campaign visuals in the advertising industry.

Available as open source on Hugging Face, the model can be integrated with popular interfaces like ComfyUI and Automatic1111. Compared to the original InstructPix2Pix, the v2 version produces more consistent and higher-quality results particularly for complex instructions, regional edits, and style transformations. Compared to alternative methods such as MagicBrush and InstructDiffusion, InstructPix2Pix v2 stands out with its ease of setup, broad community support, and deep integration with the Stable Diffusion ecosystem.

Use Cases

Photographic Style Change

Applying different artistic and photographic styles by changing photo style with text commands

Content Editing

Content editing by changing objects, colors, or environments in photos with text instructions

Product Image Variations

Creating color, material, and environment variations in e-commerce product images with text commands

Creative Visual Experiments

Creative editing and experimentation on existing images for artists and designers

Pros & Cons

Pros

Image editing with natural language instructions — simple commands like 'make it sunny'
Makes targeted changes while preserving original image structure
More precise and consistent editing results compared to first version
High-quality outputs with diffusion-based architecture

Cons

Success rate can drop with complex, multi-step editing instructions
Sometimes makes unwanted changes to unintended areas
Can struggle to preserve photographic details
Weak in some editing types due to limited training data

Technical Details

Parameters

1.5B

Architecture

Stable Diffusion + Instruction Tuning

Training Data

GPT-4 generated instructions + Stable Diffusion pairs

License

MIT

Features

Instruction-Based Editing
Structure Preservation
No Masking Required
Multi-Turn Editing
Open Source
Diffusion-Based

Benchmark Results

Metric	Value	Compared To	Source
CLIP Yön Benzerliği	0.132	SDEdit: 0.084	InstructPix2Pix Paper (CVPR 2023)
Düzenleme Doğruluğu (CLIP Text-Image)	0.276	Prompt-to-Prompt: 0.248	Papers With Code
İçerik Koruma (LPIPS)	0.12	Null-Text Inversion: 0.08 (düşük daha iyi)	Hugging Face Model Card
İşleme Süresi (512×512)	~3.5 saniye (A100)	SDEdit: ~2.8 saniye	GitHub Repository

Available Platforms

GitHub

HuggingFace

Replicate

Frequently Asked Questions

Related Models

IC-Light

Lvmin Zhang|1B+

IC-Light (Intrinsic Compositing Light) is an AI relighting model developed by Lvmin Zhang, the creator of ControlNet, that manipulates and transforms lighting conditions in photographs with remarkable realism. Built on a Stable Diffusion backbone with specialized lighting conditioning, the model with over one billion parameters can take any photograph of an object or person and completely alter the light source direction, color temperature, intensity, and ambient lighting while maintaining photorealistic shadows, highlights, and surface reflections. IC-Light operates in two distinct modes: foreground relighting where the subject is extracted and relit independently, and background-compatible relighting where the lighting is adjusted to match a new background environment. The model understands physical light behavior including specular reflections, subsurface scattering on skin, metallic surfaces, and transparent materials, producing results that respect real-world optical properties. IC-Light accepts text descriptions or reference images to define the target lighting setup, offering intuitive control over the final appearance. Released under the Apache 2.0 license, the model is fully open source and has been integrated into ComfyUI with dedicated workflow nodes. Professional photographers, product photographers, digital artists, and e-commerce teams use IC-Light for correcting unfavorable lighting in existing photos, creating studio-quality lighting from casual snapshots, matching product lighting across catalog images, generating dramatic cinematic lighting for creative projects, and preparing composited images with consistent illumination across elements.