PowerPaint icon

PowerPaint

Open Source
4.3
Tencent ARC

PowerPaint is a versatile open-source inpainting model developed by researchers at Tsinghua University and HKUST under the Tencent ARC umbrella, introducing the innovative concept of learnable task prompts that enable multiple inpainting functions within a single unified model. Rather than requiring separate specialized models for each editing task, PowerPaint uses learnable task vectors that activate different behaviors within shared model weights, supporting four distinct modes: text-guided object insertion, object removal, shape-guided inpainting, and image outpainting. Built upon a Stable Diffusion backbone enriched with a ControlNet-like control mechanism, the model allows users to describe desired content through text prompts for contextual generation, cleanly remove objects while preserving surrounding textures, generate content within specific mask shapes, or extend images beyond their original boundaries. This multi-task flexibility eliminates the need to switch between different tools or models during editing workflows. In benchmark evaluations, PowerPaint achieves competitive results against separately optimized task-specific models, with its object removal quality rivaling specialized models like LaMa and MAT. Applications span photography editing, graphic design mockups, e-commerce product image preparation, digital art canvas extension, and social media content adaptation for different platform dimensions. The model is PyTorch-based and publicly available through Hugging Face with a Gradio demo interface and Diffusers library integration. GPU requirements are similar to standard Stable Diffusion models with 8GB or more VRAM recommended. PowerPaint has established a new paradigm in multi-task inpainting and continues to inspire research in unified visual editing systems.

Inpainting

Key Highlights

Learnable Task Prompts

An innovative mechanism that learns specialized prompt embeddings for different inpainting tasks, managing multiple tasks with a single model

Unified Multi-Task Model

Combines object removal, text-guided inpainting, shape-guided insertion and outpainting tasks in a single model

Benchmark-Leading Performance

Superior performance demonstrating the ability to match or exceed specialized single-task models across multiple inpainting benchmarks

Stable Diffusion Based

Built on Stable Diffusion, compatible with the existing SD ecosystem and accessible with open-source model weights

About

PowerPaint is a versatile AI inpainting model developed by researchers at Tsinghua University and HKUST that introduces the concept of learnable task prompts, enabling multiple inpainting functions within a single unified model. PowerPaint successfully performs diverse inpainting tasks including object removal, text-guided content generation, shape-guided inpainting, and image outpainting, all under one model architecture, eliminating the need for separate specialized models for each task. This multi-task approach establishes a new paradigm in the image editing domain.

The model's technical innovation centers on learnable task vectors (learnable task prompts) that enable behavioral specialization within shared weights. Each inpainting task — object removal, content addition, shape-based editing, and outpainting — has specially optimized task vectors that enable the model to exhibit different behaviors using the same underlying weights. This approach eliminates the traditional paradigm requiring separate models for each task and provides significant resource efficiency. Built upon a Stable Diffusion backbone, the architecture is enriched with a ControlNet-like control mechanism for precise task guidance. Task vectors are automatically learned during training and activated based on the user's selected task during inference, providing a seamless multi-task experience.

PowerPaint supports four distinct and complementary inpainting modes. In text-guided object addition mode, users describe the desired content through text prompts and the model generates contextually appropriate content in the masked region that blends naturally with surroundings. Object removal mode cleanly fills the masked area with surrounding texture-consistent content while preserving image integrity. Shape-guided inpainting mode preserves the mask shape while generating content within it for controlled editing. Outpainting mode creates natural extensions beyond the image boundaries for canvas expansion. The ability to switch freely between these four modes makes PowerPaint an exceptionally versatile and practical tool for diverse editing scenarios.

Application scenarios span a broad range of professional and creative workflows, serving different professional groups. Photographers can use the same tool for both removing unwanted objects and adding creative content elements without switching applications. Graphic designers can perform rapid visual generation for mockups and concept visuals during the ideation phase. In the e-commerce sector, product image background cleanup and alternative background generation can be accomplished in a single streamlined workflow. Digital artists can utilize the outpainting mode for canvas extension and content variations to expand their creative compositions. Social media content creators can leverage outpainting to adapt visuals for different platform dimensions and aspect ratio requirements.

In the academic landscape, PowerPaint is recognized as a successful demonstration of the multi-task inpainting approach and is cited as a reference in image editing research. In benchmark evaluations, it achieves competitive and sometimes superior results when compared against separately optimized task-specific models across standard datasets. Its object removal quality rivals specialized models such as LaMa and MAT in head-to-head comparisons. For text-guided content generation, it delivers performance comparable to Stable Diffusion Inpainting while offering the additional task modes that single-purpose models cannot provide, giving it a distinct versatility advantage.

The model is PyTorch-based and publicly available through Hugging Face for open access. A Gradio-based demo interface and Diffusers library integration facilitate easy experimentation and production deployment across different environments. GPU requirements are similar to standard Stable Diffusion models, with 8GB or more VRAM recommended for optimal performance. PowerPaint is regarded as a reference model in the multi-task inpainting domain and is recognized as an important work shaping the future direction of image editing technologies. Its unified model approach continues to inspire future research in multi-task visual processing and generation systems.

Use Cases

1

Smart Object Removal

Seamlessly removing objects from images with natural-looking background using context-aware removal prompt

2

Creative Content Insertion

Adding new objects and elements to images guided by text description and mask shape

3

Image Extension

Extending images in any direction consistently using the outpainting task prompt

4

Research and Development

Using as a baseline model and comparison point in multi-task inpainting research

Pros & Cons

Pros

  • Multi-task inpainting — object removal, addition, replacement, and outpainting
  • Optimized results for each task with task-adaptive prompt encoding
  • Precise control with text-guided inpainting
  • Open-source research project

Cons

  • Not offered as a commercial product
  • High GPU requirements
  • In research stage — no stable release
  • Insufficient documentation

Technical Details

Parameters

N/A

Architecture

Stable Diffusion based with task-specific learnable prompt tokens

Training Data

Custom curated dataset with task-specific annotations for different inpainting modes

License

Apache 2.0

Features

  • Learnable Task Prompt (LTP) Mechanism
  • Context-Aware Object Removal
  • Text-Guided Inpainting
  • Shape-Guided Object Insertion
  • Image Outpainting
  • Stable Diffusion Architecture Base

Benchmark Results

MetricValueCompared ToSource
FID Score (Object Removal)8.73SD Inpainting: 12.6PowerPaint Paper (ECCV 2024)
CLIP Score (Text-guided)27.4SD Inpainting: 25.8PowerPaint Paper (ECCV 2024)
Desteklenen GörevlerRemoval, Fill, Shape-guided, OutpaintingPowerPaint GitHub
Inference Süresi (512x512)~4s (50 steps, A100)PowerPaint GitHub

Available Platforms

hugging face
replicate

Frequently Asked Questions

Related Models

GPT Image 1 icon

GPT Image 1

OpenAI|Unknown

GPT Image 1 is OpenAI's latest image generation model that integrates natively within the GPT architecture, combining language understanding with visual generation in a unified autoregressive framework. Unlike diffusion-based competitors, GPT Image 1 generates images token by token through an autoregressive process similar to text generation, enabling a conversational interface where users iteratively refine outputs through dialogue. The model excels at text rendering within images, producing legible and accurately placed typography that has historically been a weakness of diffusion models. It supports both generation from text descriptions and editing through natural language instructions, allowing users to upload images and describe desired modifications. GPT Image 1 understands complex compositional prompts with multiple subjects, spatial relationships, and specific attributes, producing coherent scenes accurately reflecting described elements. It handles diverse styles from photorealism to illustration, painting, graphic design, and technical diagrams. Editing capabilities include inpainting, style transformation, background replacement, object addition or removal, and color adjustment, all through conversational input. The model is accessible through the OpenAI API for application integration and through ChatGPT for consumer use. Safety systems prevent harmful content generation. Generated images belong to the user with full commercial rights under OpenAI's terms. GPT Image 1 represents a significant step toward multimodal AI systems seamlessly blending language and visual capabilities, making AI image creation more intuitive through natural conversation.

Proprietary
4.8
Adobe Generative Fill icon

Adobe Generative Fill

Adobe|N/A

Adobe Generative Fill is a generative AI feature integrated directly into Adobe Photoshop, powered by Adobe's proprietary Firefly image generation model. Introduced in 2023, it enables users to add, modify, or remove content in images using natural language text prompts within the familiar Photoshop interface. The feature works by selecting a region with any Photoshop selection tool, typing a descriptive prompt in the contextual task bar, and receiving three AI-generated variations within seconds. Generated content is placed on a separate layer, preserving Photoshop's non-destructive editing workflow that professionals rely on. A key differentiator is Firefly's training data approach, which uses exclusively licensed Adobe Stock imagery, openly licensed content, and public domain materials, providing commercial safety and IP indemnification that competing solutions cannot match. Generative Fill automatically maintains coherence with surrounding color, lighting, perspective, and texture for seamless blending. The companion Generative Expand feature enables extending images beyond their original canvas boundaries. Professional applications span advertising campaign iteration, photography post-production, real estate staging, product photography background replacement, fashion color modification, and editorial visual preparation. The feature is accessible through Photoshop's Creative Cloud subscription with a monthly generative credits system, and also available through Adobe Express and the web-based Firefly application. Content Credentials metadata indicates when AI was used, supporting transparency standards. Adobe Generative Fill represents the most commercially safe and professionally integrated approach to AI-powered image editing available today.

Proprietary
4.7
FLUX Fill icon

FLUX Fill

Black Forest Labs|12B

FLUX Fill is the specialized inpainting and outpainting model within the FLUX model family developed by Black Forest Labs, designed for professional-grade region editing, content filling, and image extension. Built on the 12-billion parameter Diffusion Transformer architecture that powers all FLUX models, FLUX Fill takes an input image along with a binary mask indicating the region to be modified and generates seamlessly blended content that matches the surrounding context in style, lighting, perspective, and detail level. The model excels at both inpainting tasks where masked areas within an image are filled with contextually appropriate content and outpainting tasks where image boundaries are extended to create larger compositions. FLUX Fill leverages the superior prompt adherence of the FLUX architecture, allowing users to guide the generation with text descriptions of what should appear in the masked region, providing precise creative control over the output. The model handles complex scenarios including filling regions that span multiple materials and textures, maintaining structural continuity of architectural elements, and generating photorealistic human features in masked face areas. As a proprietary model, FLUX Fill is accessible through Black Forest Labs' API and partner platforms including Replicate and fal.ai, with usage-based pricing. Professional photographers use FLUX Fill for removing unwanted elements and extending compositions, e-commerce teams employ it for product background replacement, digital artists leverage it for creative compositing, and marketing professionals use it for adapting images to different aspect ratios and formats without losing content quality.

Proprietary
4.7
SD Inpainting icon

SD Inpainting

Stability AI|1B

Stable Diffusion Inpainting is a specialized variant of Stability AI's Stable Diffusion model fine-tuned specifically for image inpainting tasks, enabling users to fill masked regions of an image with contextually coherent content guided by text prompts. Released in 2022, the model builds upon the latent diffusion architecture but extends it with additional input channels for mask-aware processing, where the original image, mask, and masked image are fed as extra channels to the U-Net. The v1.5 inpainting model was trained on 595K curated inpainting examples in collaboration with RunwayML, while community-developed SDXL variants have since extended capabilities with higher resolution output. Common applications include removing unwanted objects from photographs, completing damaged image regions, modifying content such as adding elements to scenes, and cleaning watermarks or text overlays. Professional use cases span photography post-production, advertising visual preparation, real estate staging, product photography background replacement, and digital art workflows. The model is accessible through popular open-source interfaces including AUTOMATIC1111 WebUI, ComfyUI, InvokeAI, and the Hugging Face Diffusers library. Users can create masks manually with brush tools or automatically through segmentation models like SAM. ControlNet integration adds additional control layers for more precise output guidance. Released under the CreativeML Open RAIL-M license, the model runs on GPUs with 8GB VRAM and supports optimizations like xFormers for reduced memory usage, making it one of the most widely adopted open-source inpainting solutions available.

Open Source
4.4

Quick Info

ParametersN/A
Typediffusion
LicenseApache 2.0
Released2023-12
ArchitectureStable Diffusion based with task-specific learnable prompt tokens
Rating4.3 / 5
CreatorTencent ARC

Links

Tags

powerpaint
versatile
inpainting
Visit Website