What makes SUPIR different from Real-ESRGAN?

SUPIR and Real-ESRGAN take fundamentally different approaches to image restoration. Real-ESRGAN uses a GAN-based architecture that learns to map degraded images to clean versions, producing fast and consistent results. SUPIR uses Stable Diffusion XL as a generative backbone with semantic guidance from a language model, enabling it to generate photorealistic details for severely degraded images where GAN-based methods produce blurry results. SUPIR produces higher quality on extreme degradations but requires significantly more GPU memory and processing time.

What are the hardware requirements for SUPIR?

SUPIR has substantial hardware requirements due to its SDXL backbone and multi-component architecture. A GPU with at least 24GB VRAM is recommended for running the full model — suitable GPUs include NVIDIA RTX 4090, A5000, A6000, or A100. The model can run with reduced settings on GPUs with 12-16GB VRAM using optimizations like half-precision and reduced batch sizes, but results may be slightly degraded. CPU-only operation is not practical due to the extreme processing time.

How does SUPIR use language models for restoration?

SUPIR integrates LLaVA, a large language-vision model, to automatically generate descriptive captions of the input image. These captions provide semantic guidance to the SDXL diffusion process, helping the model understand what objects, textures, and details should be present in the restored image. For example, if the caption identifies a human face, the model knows to generate realistic facial features. This semantic awareness distinguishes SUPIR from purely pixel-level restoration approaches and helps avoid generating incorrect details.

How long does SUPIR take to process an image?

Processing time depends on the input image size, number of diffusion steps, and GPU hardware. On a high-end GPU like an RTX 4090 or A100, processing a standard image typically takes 30 seconds to 2 minutes. With more diffusion steps for higher quality, processing can extend to several minutes. On less powerful GPUs with memory optimization enabled, processing time can increase substantially. The trade-off between quality and speed can be controlled by adjusting the number of diffusion steps.

Is SUPIR open source?

Yes, SUPIR's code and model weights are available on GitHub. The research code is released for academic and research purposes, though the specific licensing terms for commercial use should be reviewed carefully as the model builds upon SDXL and LLaVA, each with their own licensing requirements. The dependent models — SDXL and LLaVA — have separate licenses that may impose additional restrictions depending on your intended use case.

Can SUPIR restore text in images?

SUPIR can attempt to restore text in images thanks to its semantic understanding from the LLaVA captioning model. When the model recognizes text elements in the image, it can generate plausible text characters during restoration. However, text restoration remains challenging even for SUPIR — severely degraded text may be restored with incorrect characters, and the model cannot guarantee accurate text reconstruction. For critical text restoration, manual verification is always recommended.

SUPIR

Open Source

4.6

Tencent ARC

SUPIR is an advanced AI image restoration and upscaling model developed by Tencent ARC researchers in 2024 that harnesses the generative power of SDXL, a large-scale Stable Diffusion model, for photo-realistic image enhancement. SUPIR stands for Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration in the Wild. The model introduces a degradation-aware encoder that analyzes the specific types of quality loss present in an input image and generates intelligent text prompts to guide the restoration process, effectively telling the diffusion model what kind of content needs to be restored and how. This intelligent prompting approach enables SUPIR to produce remarkably detailed and natural-looking upscaled results that go beyond simple pixel interpolation to generate semantically meaningful detail. The model leverages the vast visual knowledge embedded in SDXL's pre-trained weights to synthesize realistic textures, facial features, text, and fine patterns during upscaling. SUPIR excels particularly at restoring severely degraded images where traditional upscaling methods fail, including old photographs, heavily compressed web images, and low-resolution captures. The model supports high upscaling factors while maintaining coherent content and natural appearance. Released under a research-only license, SUPIR is open source with code and weights available on GitHub. While computationally intensive due to its SDXL backbone, the model produces results that represent the current frontier of AI-powered image restoration quality. SUPIR is particularly valuable for professional photographers restoring archival images, forensic analysts enhancing surveillance footage, and digital artists who need maximum quality from limited source material.

Image Upscale

Visit Website

Key Highlights

SDXL Generative Backbone

Uses Stable Diffusion XL as a generative prior, enabling restoration with exceptional perceptual quality and realistic detail generation

Semantically-Aware Restoration

A semantically-aware restoration process that generates content-appropriate details with automatic image captioning via the LLaVA language model

Extreme Degradation Handling

Can handle severe degradations including extreme blur, heavy noise and very low resolution where traditional methods fail

State-of-the-Art Quality

A model representing the current frontier of image restoration quality, achieving the highest perceptual quality results in the field

About

SUPIR (Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration in the Wild) is an advanced AI model that harnesses the power of large-scale generative models for high-quality image restoration and upscaling. Developed by researchers in 2024, SUPIR delivers groundbreaking results in photo-realistically reconstructing severely degraded, low-resolution, and poor-quality images, setting a new standard for what is achievable in real-world image restoration scenarios. It stands among the first works to successfully apply model scaling principles to the image restoration domain.

The technical foundation of SUPIR is built upon SDXL (Stable Diffusion XL) as its base diffusion model, adapted for image restoration through specialized training strategies and architectural modifications. SUPIR's most distinctive feature is its ability to guide the restoration process through text prompts provided by the user. Users can supply textual descriptions of the image content to steer the model toward more accurate and detailed reconstructions that align with the semantic meaning of the scene. This text-image alignment mechanism operates through CLIP vision encoders and language models, substantially enhancing the model's semantic understanding of scene content and enabling contextually appropriate detail generation that goes beyond pixel-level pattern matching.

With over 2 billion parameters, SUPIR's large model capacity enables the generation of realistic textures, sharp edges, and structurally consistent details even in severely degraded inputs where other models produce artifacts or hallucinations. Negative prompting support allows users to suppress unwanted artifacts and visual anomalies, further refining output quality. The model provides flexibility through configurable sampling steps and CFG (Classifier-Free Guidance) scale settings, enabling users to balance restoration quality against processing speed according to their specific requirements. Lower step counts produce faster results while higher step counts yield more detailed and refined outputs.

Application domains encompass both professional and personal needs across a broad spectrum. Common scenarios include vintage family photo restoration, enhancement of low-resolution security camera footage, upscaling small web-sourced images to high resolution, and enriching detail in digital artwork. SUPIR excels particularly in face restoration, naturally reconstructing eye details, mouth features, and skin textures in portrait photographs with remarkable fidelity. Professional use cases extend to real estate photography enhancement, e-commerce product image preparation, and pre-print quality assurance workflows where maximum visual fidelity is essential. Historical photograph archive digitization and museum collection enhancement represent additional institutional applications.

The model does present certain resource considerations that users should understand. Its large model size demands substantial VRAM requirements, with a minimum of 12GB and an ideal recommendation of 24GB or more GPU memory for optimal performance at full resolution. Processing times are longer compared to lightweight super-resolution models, reflecting the computational cost of its sophisticated generative approach. However, this resource investment is justified by the unmatched realism and detail richness of the output. Batch processing with tiling support is available to manage memory constraints when processing high-resolution inputs, and reducing tile sizes enables workable results on more memory-constrained systems.

SUPIR represents a significant milestone in demonstrating the potential of large language and vision models for image restoration tasks. Its text-guided restoration approach provides users with unprecedented control over the enhancement process, enabling fine-grained direction of output characteristics through natural language. The model has garnered substantial attention from the research community and continues to influence the future direction of super-resolution and image restoration research. Available as open source on GitHub, SUPIR serves as an ideal tool for academic research and experimental applications pushing the boundaries of diffusion-based restoration quality.

Use Cases

Severe Image Restoration

Restoring extremely degraded, blurry or very low-resolution images to high quality

Archive Photo Recovery

Significantly improving historical or archive photographs to approach modern quality standards

Forensic Image Enhancement

Revealing details in security camera footage and low-quality images for forensic analysis enhancement

Professional Print Preparation

Preparing professional prints by elevating low-resolution images to quality usable in large print formats

Pros & Cons

Pros

SDXL-based image restoration and upscale — photorealistic results
Upscaling while preserving facial details and texture information
Restoration guidance with text prompts
Open source and widely used in research community

Cons

Very high VRAM requirement — 24GB+ GPU memory
Long processing time — not suitable for real-time use
Hallucination in some cases — may add non-existent details
Complex setup — requires multiple model downloads

Technical Details

Parameters

N/A

Architecture

SDXL-based diffusion model with degradation-aware encoder

Training Data

Large-scale dataset with synthetic degradation pairs

License

Research Only

Features

SDXL-Based Generative Restoration
LLaVA Semantic Image Captioning
Degradation-Aware Encoding
Extreme Low-Resolution Restoration
Photo-Realistic Detail Generation
Multi-Type Degradation Handling

Benchmark Results

Metric	Value	Compared To	Source
PSNR (DIV2K-Val, ×4)	27.80 dB	StableSR: 26.50 dB	arXiv 2401.13627
LPIPS (DIV2K-Val)	0.195	StableSR: 0.250	arXiv 2401.13627
Maksimum Büyütme	×4 (SDXL tabanlı)	SwinIR: ×4	GitHub Fanghua-Yu/SUPIR
Desteklenen Giriş	512×512 → 2048×2048	—	SUPIR Docs

Available Platforms

hugging face

replicate

fal ai

Frequently Asked Questions

Related Models

Real-ESRGAN

Tencent ARC|N/A

Real-ESRGAN is an open-source image upscaling and restoration model developed by Xintao Wang and collaborators at Tencent ARC Lab that enhances low-resolution, degraded, or compressed images to high-resolution outputs with remarkable detail recovery. Released in 2021 under the BSD license, Real-ESRGAN builds on the original ESRGAN architecture by introducing a high-order degradation modeling approach that simulates the complex, unpredictable quality loss found in real-world images, including compression artifacts, noise, blur, and downsampling. The model uses a U-Net architecture with Residual-in-Residual Dense Blocks as its generator network, trained with a combination of perceptual loss, GAN loss, and pixel loss to produce sharp, natural-looking upscaled results. Real-ESRGAN supports upscaling factors of 2x, 4x, and higher, and includes specialized model variants for anime and illustration content alongside the general-purpose photographic model. The model handles real-world degradations far better than its predecessor ESRGAN, which was trained only on synthetic degradation patterns. Real-ESRGAN has become one of the most widely deployed AI upscaling solutions, integrated into numerous applications including desktop tools, web services, mobile apps, and professional image editing workflows. The model runs efficiently on both CPU and GPU, with the lighter RealESRGAN-x4plus-anime variant optimized for consumer hardware. As a fully open-source project available on GitHub with pre-trained weights, it serves as the backbone for popular tools like Upscayl and various ComfyUI nodes. Real-ESRGAN is essential for photographers, content creators, game developers, and anyone who needs to enhance image resolution while preserving natural appearance and adding realistic detail.

Open Source

4.7

Topaz Gigapixel AI

Topaz Labs|N/A

Topaz Gigapixel AI is a commercial desktop application for AI-powered image upscaling and enhancement developed by Topaz Labs, positioned as an industry-standard tool for professional photographers, graphic designers, and image processing specialists. Available on Windows and macOS, the software uses a proprietary hybrid neural network architecture that combines multiple AI models to upscale images by up to 600 percent while preserving and even enhancing fine details, textures, and sharpness. Topaz Gigapixel AI includes specialized processing modes for different content types including faces, standard photography, computer graphics, and low-resolution sources, with each mode optimized to produce the best possible results for its target content. The software features intelligent face detection and enhancement that improves facial details during upscaling, producing natural-looking results even from very low-resolution source images. Topaz Gigapixel AI supports batch processing for handling large volumes of images and integrates with Adobe Lightroom and Photoshop as a plugin, fitting seamlessly into professional photography workflows. The application processes images locally on the user's machine using GPU acceleration, ensuring privacy and fast processing without requiring an internet connection. Output quality is widely regarded as among the best available in commercial upscaling software, with particular strength in preserving natural textures and avoiding the artificial smoothing common in many AI upscalers. As a proprietary product with a one-time purchase or subscription model, Topaz Gigapixel AI is particularly valued by professional photographers enlarging prints, real estate photographers enhancing property images, forensic analysts improving evidence imagery, and archivists restoring historical photographs to modern resolution standards.

Proprietary

4.6

Upscayl

Upscayl Team|N/A

Upscayl is a free and open-source desktop application for AI-powered image upscaling, built on top of Real-ESRGAN and other super-resolution models. Developed by Nayam Amarshe and TGS963, Upscayl provides a user-friendly graphical interface that makes advanced AI image upscaling accessible to non-technical users on Windows, macOS, and Linux platforms. The application wraps multiple AI upscaling models in an Electron-based desktop app, allowing users to enhance image resolution with just a few clicks without any command-line knowledge or Python environment setup. Upscayl includes several pre-installed upscaling models optimized for different content types including general photography, digital art, anime, and sharpening, with each model producing different aesthetic characteristics suited to its target content. Users can select upscaling factors of 2x, 3x, or 4x and process individual images or entire folders through batch processing. The application supports common image formats including PNG, JPG, and WebP, and provides options for output format and quality settings. Upscayl also supports custom model loading, allowing users to import additional NCNN-compatible upscaling models from the community. Released under the AGPL-3.0 license, Upscayl is fully open source with its code available on GitHub and has accumulated a large community of users and contributors. The application runs entirely locally with no internet connection required, ensuring privacy for sensitive images. Upscayl is particularly popular among photographers, graphic designers, content creators, and hobbyists who need a simple, free solution for enhancing image quality without subscriptions or cloud processing dependencies.

Open Source

4.5

CodeFormer

Tencent ARC|N/A

CodeFormer is a state-of-the-art blind face restoration model developed by researchers at Nanyang Technological University in collaboration with Tencent ARC, presented at NeurIPS 2022. The model employs a unique Transformer-based architecture with a discrete codebook lookup mechanism to restore severely degraded facial images with exceptional fidelity. Its most distinguishing feature is an adjustable w parameter ranging from 0.0 to 1.0 that gives users precise control over the balance between identity preservation and restoration quality. Architecturally, CodeFormer consists of three core components: a VQGAN encoder-decoder that learns discrete visual codes from high-quality face datasets, a codebook that stores these learned representations, and a Transformer module that predicts optimal code combinations during restoration. This approach enables the model to produce plausible facial details even under extreme degradation because it draws information from learned priors rather than solely from the corrupted input. In benchmark evaluations on CelebA-HQ and WIDER-Face datasets, CodeFormer achieves superior results across FID, NIQE, and identity similarity metrics compared to previous methods. Practical applications include restoring old family photographs, enhancing faces in AI-generated images, extracting facial details from low-resolution video frames, and professional photo retouching. The model is open source, integrates with popular tools like ComfyUI, AUTOMATIC1111 WebUI, and Fooocus, and offers cloud inference through Replicate API and Hugging Face Spaces demos for accessible experimentation.

Open Source

4.6

Quick Info

ParametersN/A

Typediffusion

LicenseResearch Only

Released2024-01

ArchitectureSDXL-based diffusion model with degradation-aware encoder

Rating4.6 / 5

CreatorTencent ARC

Links

Official Website GitHub arXiv Paper

SUPIR

Key Highlights

SDXL Generative Backbone

Semantically-Aware Restoration

Extreme Degradation Handling

State-of-the-Art Quality

About

Use Cases

Severe Image Restoration

Archive Photo Recovery

Forensic Image Enhancement

Professional Print Preparation

Pros & Cons

Pros

Cons

Technical Details

Features

Benchmark Results

Available Platforms

Frequently Asked Questions

What makes SUPIR different from Real-ESRGAN?

What are the hardware requirements for SUPIR?

How does SUPIR use language models for restoration?

How long does SUPIR take to process an image?

Is SUPIR open source?

Can SUPIR restore text in images?

Related Models

Real-ESRGAN

Topaz Gigapixel AI

Upscayl

CodeFormer

Quick Info

Links

Tags