PhotoMaker icon

PhotoMaker

Open Source
4.5
Tencent

PhotoMaker is a personalized photo generation model developed by TencentARC that creates realistic and diverse human portraits from reference images using a novel Stacked ID Embedding approach. Unlike traditional fine-tuning methods such as DreamBooth that require lengthy training processes, PhotoMaker achieves identity-preserving generation in seconds by extracting and stacking embeddings from multiple reference photos through CLIP and specialized identity encoders. Built on the SDXL pipeline, the model injects identity representations via modified cross-attention layers, enabling high-quality outputs that maintain facial features while allowing creative freedom in style, pose, and setting variations. PhotoMaker supports identity mixing, allowing users to blend features from multiple people to create unique composite faces with adjustable contribution weights. The model excels in personalized portrait generation, identity-consistent story illustration for comics and visual novels, virtual try-on applications, and advertising content creation. PhotoMaker V2 brought significant improvements in identity preservation accuracy, natural generation quality, and text alignment, particularly in challenging scenarios like extreme pose changes and age transformations. As an open-source model released under the Apache 2.0 license, PhotoMaker is freely available on Hugging Face with community integrations in ComfyUI and other popular creative tools. It requires only one to four reference images to produce compelling results, making it one of the most accessible and efficient identity-preserving generation solutions available for both individual creators and professional production workflows.

Image to Image

Key Highlights

Stacked ID Embedding System

Merges information from 1-4 reference images into a unified identity representation for strong and consistent personalization.

Personalization in Seconds

Unlike DreamBooth's minutes-long training, generates personalized outputs in seconds through an efficient embedding-based approach.

Identity Mixing Capability

Offers the ability to blend facial features from multiple people to create new and unique identities for creative applications.

Diverse Pose and Expression Support

Consistently preserves identity across different poses, expressions, ages, and artistic styles with natural-looking results.

About

PhotoMaker is a personalized photo generation model developed by TencentARC, introduced in December 2023 through the paper "PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding." The model enables the creation of realistic and diverse photos of specific individuals by extracting stacked ID embeddings from multiple reference images. Unlike fine-tuning approaches like DreamBooth that require lengthy training times, PhotoMaker achieves personalization through an efficient embedding-based approach that works in seconds, and this rapid inference time makes it extremely attractive for practical applications. The model has redefined the balance between speed and quality in personalized content generation.

The architecture introduces a Stacked ID Embedding mechanism that merges information from multiple input photos (typically 1-4 images) into a unified identity representation. This mechanism extracts separate embeddings from each reference image through a CLIP image encoder and a specialized identity encoder, then stacks them to create a rich identity representation. This representation is injected into the SDXL generation pipeline through modified cross-attention layers. When multiple reference images are used, the model synthesizes information from different angles, expressions, and lighting conditions to develop a more comprehensive understanding of identity. The stacked structure allows combining a variable number of embeddings rather than a single fixed-size vector, providing more flexible identity encoding.

PhotoMaker was trained on a large dataset of celebrity images with diverse poses and expressions, enabling it to generalize well to unseen identities. The model also supports identity mixing — offering the ability to blend features from multiple people to create new, unique faces. This feature is particularly valuable in character design and creative exploration processes. Identity mixing weights are adjustable, allowing control over each source identity's contribution to the final result and enabling blending at the desired ratio.

Use cases are extraordinarily diverse, ranging from professional production to individual use. Personalized portrait generation is the most fundamental use case — users can create portraits from their own photos in different artistic styles, settings, and poses. Identity-consistent story illustration is used in comic, children's book, and visual novel production where a character needs to appear consistently across different scenes. In virtual try-on scenarios, users can visualize different outfits or hairstyles. It is also widely preferred in advertising, marketing, and social media sectors for generating model visuals.

PhotoMaker V2 significantly improved upon the original, offering better identity preservation, more natural generation quality, and enhanced text alignment. The V2 version provides more consistent identity preservation particularly in challenging scenarios — extreme pose changes, age transformations, and stylistic transfers. Both versions are SDXL-based and use the CLIP ViT-L/14 image encoder.

PhotoMaker is open source under the Apache 2.0 license and available on Hugging Face, with community integrations in ComfyUI and other popular tools. Compared to its competitors, while InstantID offers higher identity fidelity in single-image scenarios, PhotoMaker provides more comprehensive identity representation and better style diversity when multiple reference images are used. Its greatest advantage over DreamBooth is that training time is practically zero, making it indispensable in scenarios requiring rapid iteration and batch production.

Use Cases

1

Personalized Portrait Generation

Creating personalized portraits in various styles and scenes from individual photographs.

2

Story Illustration

Supporting storytelling by creating consistent visuals of the same character across different scenes.

3

Identity Mixing and Creative Experiments

Creating creative character designs by blending features from multiple people.

4

Virtual Try-On and Fashion

Visualizing different clothing, hairstyle, and accessory combinations while preserving the person's facial identity.

Pros & Cons

Pros

  • Rapid personalization within seconds without requiring any LoRA training or fine-tuning
  • Strong identity preservation from arbitrary number of input photos with high-fidelity face generation
  • Supports gender/age transformation and multi-ID mixing while maintaining identity consistency
  • Integrates with ControlNet, T2I-Adapter, and IP-Adapter for enhanced control capabilities
  • CVPR 2024 accepted; better ID preservation than test-time fine-tuning methods with significant speed advantage

Cons

  • Customization performance degrades significantly on Asian male faces (acknowledged by developers)
  • Still struggles with accurately rendering human hands in generated images
  • Higher style strength reduces ID fidelity; trade-off between stylization and identity preservation
  • Minimum 11GB GPU memory required, limiting accessibility on consumer hardware

Technical Details

Parameters

N/A

Architecture

Stacked ID Embedding + Diffusion

Training Data

Face identity dataset (filtered)

License

Apache 2.0

Features

  • Stacked ID Embedding
  • Multi-Reference Input (1-4 images)
  • Identity Mixing/Blending
  • SDXL Base Architecture
  • Zero-Shot Personalization
  • Multi-Style Generation
  • Age/Expression Variation
  • PhotoMaker V2 Enhanced Quality

Benchmark Results

MetricValueCompared ToSource
Yüz Benzerlik Skoru%65 (FaceNet cosine)InstantID: %72PhotoMaker Paper (arXiv)
Gerekli Referans Görsel1-4 adetInstantID: 1 adetPhotoMaker GitHub
Kimlik Koruma (ID Preserve)0.58 (DINO score)IP-Adapter-Face: 0.41PhotoMaker Paper (arXiv)
Çıkarım Süresi~10 saniye (A100)InstantID: ~5 saniyePhotoMaker GitHub

Available Platforms

hugging face
replicate
fal ai

Frequently Asked Questions

Related Models

ControlNet icon

ControlNet

Lvmin Zhang|1.4B

ControlNet is a conditional control framework for Stable Diffusion models that enables precise structural guidance during image generation through various conditioning inputs such as edge maps, depth maps, human pose skeletons, segmentation masks, and normal maps. Developed by Lvmin Zhang and Maneesh Agrawala at Stanford University, ControlNet adds trainable copy branches to frozen diffusion model encoders, allowing the model to learn spatial conditioning without altering the original model's capabilities. This architecture preserves the base model's generation quality while adding fine-grained control over composition, structure, and spatial layout of generated images. ControlNet supports multiple conditioning types simultaneously, enabling complex multi-condition workflows where users can combine pose, depth, and edge information to guide generation with extraordinary precision. The framework revolutionized professional AI image generation workflows by solving the fundamental challenge of maintaining consistent spatial structures across generated images. It has become an essential tool for professional artists and designers who need precise control over character poses, architectural layouts, product placements, and scene compositions. ControlNet is open-source and available on Hugging Face with pre-trained models for various Stable Diffusion versions including SD 1.5 and SDXL. It integrates seamlessly with ComfyUI and Automatic1111. Concept artists, character designers, architectural visualizers, fashion designers, and animation studios rely on ControlNet for production workflows. Its influence has extended beyond Stable Diffusion, inspiring similar control mechanisms in FLUX.1 and other modern image generation models.

Open Source
4.8
InstantID icon

InstantID

InstantX Team|N/A

InstantID is a zero-shot identity-preserving image generation framework developed by InstantX Team that can generate images of a specific person in various styles, poses, and contexts using only a single reference photograph. Unlike traditional face-swapping or personalization methods that require multiple reference images or time-consuming fine-tuning, InstantID achieves accurate identity preservation from just one facial photograph through an innovative architecture combining a face encoder, IP-Adapter, and ControlNet for facial landmark guidance. The system extracts detailed facial identity features from the reference image and injects them into the generation process, ensuring that the generated person maintains recognizable facial features, proportions, and characteristics across diverse output scenarios. InstantID supports various creative applications including generating portraits in different artistic styles, placing the person in imagined scenes or contexts, creating profile pictures and avatars, and producing marketing materials featuring consistent character representations. The model works with Stable Diffusion XL as its base and is open-source, available on GitHub and Hugging Face for local deployment. It integrates with ComfyUI through community-developed nodes and can be accessed through cloud APIs. Portrait photographers, social media content creators, marketing teams creating personalized campaigns, game developers designing character variants, and digital artists exploring identity-based creative work all use InstantID. The framework has influenced subsequent identity-preservation models and remains one of the most effective solutions for single-image identity transfer in the open-source ecosystem.

Open Source
4.7
IP-Adapter icon

IP-Adapter

Tencent|22M

IP-Adapter is an image prompt adapter developed by Tencent AI Lab that enables image-guided generation for text-to-image diffusion models without requiring any fine-tuning of the base model. The adapter works by extracting visual features from reference images using a CLIP image encoder and injecting these features into the diffusion model's cross-attention layers through a decoupled attention mechanism. This allows users to provide reference images as visual prompts alongside text prompts, guiding the generation process to produce images that share stylistic elements, compositional features, or visual characteristics with the reference while still following the text description. IP-Adapter supports multiple modes of operation including style transfer, where the generated image adopts the artistic style of the reference, and content transfer, where specific subjects or elements from the reference appear in the output. The adapter is lightweight, adding minimal computational overhead to the base model's inference process. It can be combined with other control mechanisms like ControlNet for multi-modal conditioning, enabling sophisticated workflows where pose, style, and content can each be controlled independently. IP-Adapter is open-source and available for various Stable Diffusion versions including SD 1.5 and SDXL. It integrates with ComfyUI and Automatic1111 through community extensions. Digital artists, product designers, brand managers, and content creators who need to maintain visual consistency across generated images or transfer specific aesthetic qualities from reference material particularly benefit from IP-Adapter's capabilities.

Open Source
4.6
IP-Adapter FaceID icon

IP-Adapter FaceID

Tencent|22M (adapter)

IP-Adapter FaceID is a specialized adapter module developed by Tencent AI Lab that injects facial identity information into the diffusion image generation process, enabling the creation of new images that faithfully preserve a specific person's facial features. Unlike traditional face-swapping approaches, IP-Adapter FaceID extracts face recognition feature vectors from the InsightFace library and feeds them into the diffusion model through cross-attention layers, allowing the model to generate diverse scenes, styles, and compositions while maintaining consistent facial identity. With only approximately 22 million adapter parameters layered on top of existing Stable Diffusion models, FaceID achieves remarkable identity preservation without requiring per-subject fine-tuning or multiple reference images. A single clear face photo is sufficient to generate the person in various artistic styles, different clothing, diverse environments, and novel poses. The adapter supports both SDXL and SD 1.5 base models and can be combined with other ControlNet adapters for additional control over pose, depth, and composition. IP-Adapter FaceID Plus variants incorporate additional CLIP image features alongside face embeddings for improved likeness and detail preservation. Released under the Apache 2.0 license, the model is fully open source and widely integrated into ComfyUI workflows and the Diffusers library. Common applications include personalized avatar creation, custom portrait generation in various artistic styles, character consistency in storytelling and comic creation, personalized marketing content, and social media content creation where maintaining a recognizable likeness across multiple generated images is essential.

Open Source
4.5

Quick Info

ParametersN/A
Typediffusion
LicenseApache 2.0
Released2023-12
ArchitectureStacked ID Embedding + Diffusion
Rating4.5 / 5
CreatorTencent

Links

Tags

photomaker
face
realistic
image-to-image
Visit Website