InstructPix2Pix v2 icon

InstructPix2Pix v2

Open Source
4.4
UC Berkeley

InstructPix2Pix v2 is an advanced diffusion model developed at UC Berkeley that edits images based on natural language instructions, building upon the success of the original InstructPix2Pix by Tim Brooks and collaborators. The model takes an input image and a text instruction such as 'make it sunset' or 'turn the cat into a dog' and generates the edited result while preserving unrelated parts of the image. Built on a Stable Diffusion backbone with instruction tuning, the v2 version introduces significant improvements in instruction comprehension, output quality, and editing precision compared to its predecessor. The architecture learns to follow complex multi-step instructions and handles nuanced editing requests including style changes, object modifications, color adjustments, weather transformations, and compositional alterations. Unlike mask-based editing approaches, InstructPix2Pix v2 requires no manual region selection as it automatically identifies which parts of the image to modify based on the text instruction. The model with approximately 1.5 billion parameters runs efficiently on consumer GPUs with 8GB or more VRAM. Released under the MIT license, it is fully open source and has been integrated into popular creative tools and workflows including ComfyUI and the Diffusers library. Professional photographers, digital artists, e-commerce teams, and content creators use InstructPix2Pix v2 for rapid iterative editing, product photo enhancement, creative experimentation, and batch processing of visual content where traditional manual editing would be time-prohibitive.

Image Editing

Key Highlights

Text-Based Image Editing

Editing existing images with natural language commands without requiring any masking

Structure Preservation

Preserves the original image's overall structure, composition, and unedited regions during editing

Enhanced Instruction Understanding

Much better text instruction understanding capability than v1 for more accurate and intent-aligned edits

Wide Editing Range

Various editing types such as style changing, object adding/removing, color editing, and environment changes

About

InstructPix2Pix v2 is an advanced diffusion model capable of editing images using natural language instructions, developed as an improved version of the original InstructPix2Pix model. Building on the success of the original InstructPix2Pix developed by Tim Brooks and his team at UC Berkeley, the v2 version offers significant improvements in understanding and applying more complex editing instructions. Thanks to an expanded training dataset and optimized architecture, it has achieved notable performance gains particularly in multi-step and contextual editing tasks. These improvements have made the model a reliable tool for professional editing workflows.

The model's operating principle is highly intuitive: given a source image and a text instruction, the model applies the instruction to the image. Commands like "make the weather snowy," "change the outfit to blue," "add mountains to the background," or "change the photo to sunset lighting" can be given in natural language. The model automatically identifies the region to be modified and preserves the rest of the image. The v2 version's most important improvement is strengthened regional awareness — the model can now more accurately understand spatial references like "remove the flower in the top left corner" and handle complex instruction chains.

The technical architecture uses a dual conditioning mechanism built on Stable Diffusion infrastructure. The original image is fed into the U-Net as additional input channels, and the text instruction is processed through the CLIP text encoder. Two key parameters control the editing: image guidance scale determines how much of the original image is preserved, while text guidance scale adjusts how strongly the instruction is followed. The v2 version expands the optimal ranges of these parameters, producing stable results across a wider editing spectrum. The classifier-free guidance mechanism enables independent control in both dimensions.

InstructPix2Pix v2's greatest strength is maintaining the image's overall structure and identity during editing. When told to "add sunglasses" to a portrait photo, only the glasses are added while facial features, lighting, and background remain undisturbed. When given the instruction "change the season to winter" on a landscape photo, only seasonal elements are modified while composition and perspective are preserved. When told to "change the wall color to blue" in an interior photo, furniture and decoration elements are maintained. This level of accuracy makes it reliable for professional photo editing and content production workflows.

Use cases are extraordinarily diverse, spanning a broad industrial range. It is widely preferred in professional areas such as background replacement or product color adjustment in e-commerce product photo editing, season changes or interior decoration suggestion visualization in real estate photo enhancement, iterative style changes in creative design processes, rapid visual editing in social media content production, and variation generation on campaign visuals in the advertising industry.

Available as open source on Hugging Face, the model can be integrated with popular interfaces like ComfyUI and Automatic1111. Compared to the original InstructPix2Pix, the v2 version produces more consistent and higher-quality results particularly for complex instructions, regional edits, and style transformations. Compared to alternative methods such as MagicBrush and InstructDiffusion, InstructPix2Pix v2 stands out with its ease of setup, broad community support, and deep integration with the Stable Diffusion ecosystem.

Use Cases

1

Photographic Style Change

Applying different artistic and photographic styles by changing photo style with text commands

2

Content Editing

Content editing by changing objects, colors, or environments in photos with text instructions

3

Product Image Variations

Creating color, material, and environment variations in e-commerce product images with text commands

4

Creative Visual Experiments

Creative editing and experimentation on existing images for artists and designers

Pros & Cons

Pros

  • Image editing with natural language instructions — simple commands like 'make it sunny'
  • Makes targeted changes while preserving original image structure
  • More precise and consistent editing results compared to first version
  • High-quality outputs with diffusion-based architecture

Cons

  • Success rate can drop with complex, multi-step editing instructions
  • Sometimes makes unwanted changes to unintended areas
  • Can struggle to preserve photographic details
  • Weak in some editing types due to limited training data

Technical Details

Parameters

1.5B

Architecture

Stable Diffusion + Instruction Tuning

Training Data

GPT-4 generated instructions + Stable Diffusion pairs

License

MIT

Features

  • Instruction-Based Editing
  • Structure Preservation
  • No Masking Required
  • Multi-Turn Editing
  • Open Source
  • Diffusion-Based

Benchmark Results

MetricValueCompared ToSource
CLIP Yön Benzerliği0.132SDEdit: 0.084InstructPix2Pix Paper (CVPR 2023)
Düzenleme Doğruluğu (CLIP Text-Image)0.276Prompt-to-Prompt: 0.248Papers With Code
İçerik Koruma (LPIPS)0.12Null-Text Inversion: 0.08 (düşük daha iyi)Hugging Face Model Card
İşleme Süresi (512×512)~3.5 saniye (A100)SDEdit: ~2.8 saniyeGitHub Repository

Available Platforms

GitHub
HuggingFace
Replicate

Frequently Asked Questions

Related Models

IC-Light icon

IC-Light

Lvmin Zhang|1B+

IC-Light (Intrinsic Compositing Light) is an AI relighting model developed by Lvmin Zhang, the creator of ControlNet, that manipulates and transforms lighting conditions in photographs with remarkable realism. Built on a Stable Diffusion backbone with specialized lighting conditioning, the model with over one billion parameters can take any photograph of an object or person and completely alter the light source direction, color temperature, intensity, and ambient lighting while maintaining photorealistic shadows, highlights, and surface reflections. IC-Light operates in two distinct modes: foreground relighting where the subject is extracted and relit independently, and background-compatible relighting where the lighting is adjusted to match a new background environment. The model understands physical light behavior including specular reflections, subsurface scattering on skin, metallic surfaces, and transparent materials, producing results that respect real-world optical properties. IC-Light accepts text descriptions or reference images to define the target lighting setup, offering intuitive control over the final appearance. Released under the Apache 2.0 license, the model is fully open source and has been integrated into ComfyUI with dedicated workflow nodes. Professional photographers, product photographers, digital artists, and e-commerce teams use IC-Light for correcting unfavorable lighting in existing photos, creating studio-quality lighting from casual snapshots, matching product lighting across catalog images, generating dramatic cinematic lighting for creative projects, and preparing composited images with consistent illumination across elements.

Open Source
4.5

Quick Info

Parameters1.5B
TypeDiffusion
LicenseMIT
Released2024-06
ArchitectureStable Diffusion + Instruction Tuning
Rating4.4 / 5
CreatorUC Berkeley

Links

Tags

editing
instruction
image
diffusion
Visit Website