What makes SwinIR different from CNN-based upscaling models?

SwinIR uses the Swin Transformer architecture with shifted window self-attention instead of convolutional layers for deep feature extraction. This allows the model to capture long-range dependencies across the image — understanding how distant parts of an image relate to each other — which CNN-based models with limited receptive fields struggle to achieve. Despite this global context awareness, SwinIR remains computationally efficient thanks to the windowed attention mechanism that limits self-attention to local regions.

How does SwinIR compare to Real-ESRGAN?

SwinIR and Real-ESRGAN serve complementary roles in image restoration. SwinIR is a research-oriented model that achieves superior scores on synthetic benchmarks with clean degradation patterns (bicubic downsampling). Real-ESRGAN is more practically oriented, trained on complex real-world degradations that include blur, noise, and compression artifacts combined. For real photographs with unknown degradations, Real-ESRGAN typically produces better visual results, while SwinIR excels on controlled benchmark evaluations.

Can SwinIR handle real-world degraded images?

The standard SwinIR models are trained primarily on synthetic degradations like bicubic downsampling and Gaussian noise, so they may not perform optimally on real-world images with complex, combined degradation types. However, SwinIR's architecture has been adopted and retrained by subsequent models for real-world scenarios. For practical real-world image restoration, models that build on SwinIR's architecture but train with more diverse degradation pipelines, like those in Real-ESRGAN, are generally more effective.

What are the hardware requirements for SwinIR?

SwinIR is relatively lightweight compared to many transformer-based models, with the super-resolution variant containing approximately 11.8 million parameters. For inference, a GPU with 4GB or more VRAM is sufficient for processing standard-resolution images. The model also runs on CPU, though processing times will be longer. For batch processing or training, 8GB or more VRAM is recommended. SwinIR's efficiency makes it accessible on consumer-grade hardware including laptops with discrete GPUs.

What super-resolution scales does SwinIR support?

SwinIR supports three standard super-resolution scales: 2x, 3x, and 4x upscaling. Each scale requires a separately trained model checkpoint optimized for that specific magnification factor. The 2x model preserves the most original detail with minimal artifacts, while the 4x model adds the most synthesized detail but with higher risk of hallucinating incorrect textures. Pre-trained weights for all three scales are available in the official repository for both classical and lightweight model variants.

Is SwinIR open source?

Yes, SwinIR is fully open source with code and pre-trained model weights available on GitHub under the Apache 2.0 license. The repository includes training scripts, evaluation code, and pre-trained checkpoints for all supported tasks (super-resolution, denoising, and JPEG artifact removal) and all supported scales. The permissive Apache 2.0 license allows both academic and commercial use, making SwinIR freely available for integration into any project.

SwinIR

Open Source

4.4

ETH Zurich

SwinIR is a Transformer-based image restoration model developed by Jingyun Liang and the research team at ETH Zurich that achieves state-of-the-art performance across multiple restoration tasks including super-resolution, image denoising, and JPEG compression artifact removal. Released in August 2021 under the Apache 2.0 license, SwinIR adapts the Swin Transformer architecture for image processing by leveraging shifted window attention mechanisms that efficiently capture both local detail and global context in images. The model consists of three main modules: a shallow feature extraction layer, a deep feature extraction module built from Swin Transformer blocks with residual connections, and a reconstruction module that produces the restored high-quality output. With only 12 million parameters, SwinIR is remarkably lightweight compared to many competing models while delivering superior or comparable results. The model supports multiple super-resolution scales including 2x, 3x, and 4x upscaling, classical and lightweight variants for different quality-speed trade-offs, and separate configurations optimized for denoising at various noise levels and JPEG artifact removal at different quality factors. SwinIR demonstrated that Transformer architectures could outperform CNN-based approaches in low-level image processing tasks, marking an important milestone in the field. The model is fully open source with pre-trained weights available on GitHub and integrates well with standard deep learning frameworks. SwinIR is widely used in academic research as a baseline for image restoration benchmarks and in practical applications by photographers, graphic designers, and content creators who need high-quality image enhancement. Its efficient architecture makes it suitable for deployment on consumer hardware without specialized GPU requirements.

Image Upscale

Visit Website

Key Highlights

Swin Transformer Architecture

Efficiently captures both local texture details and long-range structural dependencies with shifted window attention mechanism

Multiple Restoration Tasks

Supports various image restoration tasks including super-resolution, image denoising and JPEG compression artifact removal

Efficient Computation

An efficient transformer architecture providing superior performance with fewer parameters and computations compared to CNN-based methods

Benchmark Leader

Results outperforming CNN-based methods on standard benchmarks including Set5, Set14, BSD100, Urban100 and Manga109

About

SwinIR (Swin Transformer for Image Restoration) is a Transformer-based image restoration model that achieves state-of-the-art performance across multiple restoration tasks including super-resolution, image denoising, and JPEG compression artifact removal. Developed by Jingyun Liang and the research team at ETH Zurich in 2021, SwinIR represents a pivotal shift from CNN-based approaches to Transformer architectures in the image restoration domain, demonstrating that vision transformer technology is equally effective for low-level image processing tasks that were traditionally dominated by convolutional networks.

The technical foundation of SwinIR relies on Swin Transformer blocks that employ a shifted window mechanism for computing self-attention. This approach reduces the quadratic computational complexity of standard Transformers to linear complexity relative to image size, enabling efficient processing of high-resolution images that would be prohibitively expensive with global attention. The architecture comprises three main components: a shallow feature extraction layer using a single convolutional layer, a deep feature extraction module consisting of multiple Residual Swin Transformer blocks (RSTB) with residual connections, and an image reconstruction module tailored to each specific task. Channel attention mechanisms further enhance feature representation, allowing the model to selectively emphasize the most informative channels for each restoration operation.

SwinIR has been trained and evaluated across five distinct restoration tasks: classical super-resolution with bicubic downsampling, lightweight super-resolution with reduced parameters for resource-constrained deployment, real-world super-resolution handling unknown degradations, JPEG compression artifact removal at various quality levels, and both color and grayscale image denoising at multiple noise levels. Pre-trained weights are provided for each task configuration individually. The lightweight variant achieves impressive results with only 878K parameters, while the full model with 11.8M parameters delivers maximum quality, providing deployment flexibility ranging from mobile devices to server environments.

Practical applications span diverse industries and use cases with broad professional relevance. Photographers and restoration specialists use SwinIR for recovering degraded vintage photographs and enhancing scan quality from archival materials. Media companies employ it for archival footage restoration and broadcast quality improvement. Web platforms integrate it into upload pipelines for automatic image enhancement of user-generated content. In scientific domains, SwinIR finds applications in medical imaging for enhancing MRI and CT scan resolution, satellite imagery processing for remote sensing analysis, and microscopy image enhancement. Its JPEG artifact removal capability is particularly valuable for rescuing images that have suffered quality degradation through repeated social media sharing and compression cycles. Educational publishing also benefits from its ability to enhance visual materials.

In the academic landscape, SwinIR serves as a benchmark reference model for image restoration research worldwide. It surpasses CNN-based methods on traditional metrics like PSNR and SSIM while remaining competitive on perceptual quality measures such as LPIPS and FID. The model is implemented in PyTorch and can be exported to ONNX format for cross-platform deployment flexibility across different inference frameworks. Its widespread adoption by the research community has spawned numerous variants, adaptations, and extensions that continue to push the boundaries of restoration quality in competitions and real-world applications.

One of SwinIR's most significant advantages is its ability to handle multiple restoration tasks within a single architectural framework, reducing the need to deploy and maintain separate specialized models in production environments. Released under the Apache 2.0 license, it is freely available for both academic research and commercial applications without restriction. As a foundational work in Transformer-based image restoration, SwinIR has directly inspired next-generation models including HAT, Restormer, and SRFormer, cementing its enduring legacy as a transformative contribution to the image processing research field.

Use Cases

Academic Image Restoration

Using as a baseline architecture and benchmark model in image restoration research

Photo Upscaling

Enhancing detail and sharpness by upscaling low-resolution photos at 2x, 3x or 4x

JPEG Artifact Removal

Cleaning up blocking and blurring artifacts caused by heavy JPEG compression

Image Denoising

Removing noise from images shot in low light or with high ISO values

Pros & Cons

Pros

Outperforms state-of-the-art methods by 0.14-0.45dB while using up to 67% fewer parameters than CNN and transformer counterparts
Exceptional parameter efficiency with 11.8M parameters vs IPT's 115M+
Produces visually pleasing images with clear and sharp edges; avoids artifacts common in other methods
Strong performance across multiple restoration tasks including super-resolution, denoising, and JPEG compression reduction

Cons

Newer models like HAT have surpassed SwinIR in PSNR and SSIM scores across all scales
Room for improvement in handling periodic noise and combining local-global features
As a 2021 model, may struggle to compete with most current architectures
4.2% improvement seen when merged with Lewin architecture; standalone may be insufficient

Technical Details

Parameters

12M

Architecture

Swin Transformer with residual and convolutional layers

Training Data

DIV2K and Flickr2K datasets for training, Set5/Set14/Urban100 for evaluation

License

Apache 2.0

Features

Shifted Window Self-Attention
2x/3x/4x Super-Resolution
JPEG Artifact Removal
Image Denoising
Residual Swin Transformer Blocks
Lightweight Model Architecture

Benchmark Results

Metric	Value	Compared To	Source
PSNR (Set5, ×4)	32.92 dB	RCAN: 32.63 dB	ICCV 2021 Workshop Paper
SSIM (Set5, ×4)	0.9044	RCAN: 0.9002	ICCV 2021 Workshop Paper
PSNR (Urban100, ×4)	27.45 dB	RCAN: 26.82 dB	ICCV 2021 Workshop Paper
Parametre Sayısı	11.8M	EDSR: 43M	GitHub JingyunLiang/SwinIR

Available Platforms

hugging face

replicate

Frequently Asked Questions

Related Models

Real-ESRGAN

Tencent ARC|N/A

Real-ESRGAN is an open-source image upscaling and restoration model developed by Xintao Wang and collaborators at Tencent ARC Lab that enhances low-resolution, degraded, or compressed images to high-resolution outputs with remarkable detail recovery. Released in 2021 under the BSD license, Real-ESRGAN builds on the original ESRGAN architecture by introducing a high-order degradation modeling approach that simulates the complex, unpredictable quality loss found in real-world images, including compression artifacts, noise, blur, and downsampling. The model uses a U-Net architecture with Residual-in-Residual Dense Blocks as its generator network, trained with a combination of perceptual loss, GAN loss, and pixel loss to produce sharp, natural-looking upscaled results. Real-ESRGAN supports upscaling factors of 2x, 4x, and higher, and includes specialized model variants for anime and illustration content alongside the general-purpose photographic model. The model handles real-world degradations far better than its predecessor ESRGAN, which was trained only on synthetic degradation patterns. Real-ESRGAN has become one of the most widely deployed AI upscaling solutions, integrated into numerous applications including desktop tools, web services, mobile apps, and professional image editing workflows. The model runs efficiently on both CPU and GPU, with the lighter RealESRGAN-x4plus-anime variant optimized for consumer hardware. As a fully open-source project available on GitHub with pre-trained weights, it serves as the backbone for popular tools like Upscayl and various ComfyUI nodes. Real-ESRGAN is essential for photographers, content creators, game developers, and anyone who needs to enhance image resolution while preserving natural appearance and adding realistic detail.

Open Source

4.7

Topaz Gigapixel AI

Topaz Labs|N/A

Topaz Gigapixel AI is a commercial desktop application for AI-powered image upscaling and enhancement developed by Topaz Labs, positioned as an industry-standard tool for professional photographers, graphic designers, and image processing specialists. Available on Windows and macOS, the software uses a proprietary hybrid neural network architecture that combines multiple AI models to upscale images by up to 600 percent while preserving and even enhancing fine details, textures, and sharpness. Topaz Gigapixel AI includes specialized processing modes for different content types including faces, standard photography, computer graphics, and low-resolution sources, with each mode optimized to produce the best possible results for its target content. The software features intelligent face detection and enhancement that improves facial details during upscaling, producing natural-looking results even from very low-resolution source images. Topaz Gigapixel AI supports batch processing for handling large volumes of images and integrates with Adobe Lightroom and Photoshop as a plugin, fitting seamlessly into professional photography workflows. The application processes images locally on the user's machine using GPU acceleration, ensuring privacy and fast processing without requiring an internet connection. Output quality is widely regarded as among the best available in commercial upscaling software, with particular strength in preserving natural textures and avoiding the artificial smoothing common in many AI upscalers. As a proprietary product with a one-time purchase or subscription model, Topaz Gigapixel AI is particularly valued by professional photographers enlarging prints, real estate photographers enhancing property images, forensic analysts improving evidence imagery, and archivists restoring historical photographs to modern resolution standards.

Proprietary

4.6

Upscayl

Upscayl Team|N/A

Upscayl is a free and open-source desktop application for AI-powered image upscaling, built on top of Real-ESRGAN and other super-resolution models. Developed by Nayam Amarshe and TGS963, Upscayl provides a user-friendly graphical interface that makes advanced AI image upscaling accessible to non-technical users on Windows, macOS, and Linux platforms. The application wraps multiple AI upscaling models in an Electron-based desktop app, allowing users to enhance image resolution with just a few clicks without any command-line knowledge or Python environment setup. Upscayl includes several pre-installed upscaling models optimized for different content types including general photography, digital art, anime, and sharpening, with each model producing different aesthetic characteristics suited to its target content. Users can select upscaling factors of 2x, 3x, or 4x and process individual images or entire folders through batch processing. The application supports common image formats including PNG, JPG, and WebP, and provides options for output format and quality settings. Upscayl also supports custom model loading, allowing users to import additional NCNN-compatible upscaling models from the community. Released under the AGPL-3.0 license, Upscayl is fully open source with its code available on GitHub and has accumulated a large community of users and contributors. The application runs entirely locally with no internet connection required, ensuring privacy for sensitive images. Upscayl is particularly popular among photographers, graphic designers, content creators, and hobbyists who need a simple, free solution for enhancing image quality without subscriptions or cloud processing dependencies.

Open Source

4.5

CodeFormer

Tencent ARC|N/A

CodeFormer is a state-of-the-art blind face restoration model developed by researchers at Nanyang Technological University in collaboration with Tencent ARC, presented at NeurIPS 2022. The model employs a unique Transformer-based architecture with a discrete codebook lookup mechanism to restore severely degraded facial images with exceptional fidelity. Its most distinguishing feature is an adjustable w parameter ranging from 0.0 to 1.0 that gives users precise control over the balance between identity preservation and restoration quality. Architecturally, CodeFormer consists of three core components: a VQGAN encoder-decoder that learns discrete visual codes from high-quality face datasets, a codebook that stores these learned representations, and a Transformer module that predicts optimal code combinations during restoration. This approach enables the model to produce plausible facial details even under extreme degradation because it draws information from learned priors rather than solely from the corrupted input. In benchmark evaluations on CelebA-HQ and WIDER-Face datasets, CodeFormer achieves superior results across FID, NIQE, and identity similarity metrics compared to previous methods. Practical applications include restoring old family photographs, enhancing faces in AI-generated images, extracting facial details from low-resolution video frames, and professional photo retouching. The model is open source, integrates with popular tools like ComfyUI, AUTOMATIC1111 WebUI, and Fooocus, and offers cloud inference through Replicate API and Hugging Face Spaces demos for accessible experimentation.

Open Source

4.6

Quick Info

Parameters12M

Typetransformer

LicenseApache 2.0

Released2021-08

ArchitectureSwin Transformer with residual and convolutional layers

Rating4.4 / 5

CreatorETH Zurich

Links

Official Website GitHub arXiv Paper HuggingFace

SwinIR

Key Highlights

Swin Transformer Architecture

Multiple Restoration Tasks

Efficient Computation

Benchmark Leader

About

Use Cases

Academic Image Restoration

Photo Upscaling

JPEG Artifact Removal

Image Denoising

Pros & Cons

Pros

Cons

Technical Details

Features

Benchmark Results

Available Platforms

Frequently Asked Questions

What makes SwinIR different from CNN-based upscaling models?

How does SwinIR compare to Real-ESRGAN?

Can SwinIR handle real-world degraded images?

What are the hardware requirements for SwinIR?

What super-resolution scales does SwinIR support?

Is SwinIR open source?

Related Models

Real-ESRGAN

Topaz Gigapixel AI

Upscayl

CodeFormer

Quick Info

Links

Tags