Wonder3D icon

Wonder3D

Open Source
4.1
Tsinghua University

Wonder3D is a single-image 3D reconstruction model developed by researchers at Tsinghua University that generates both multi-view color images and corresponding normal maps from a single input image for high-quality 3D mesh reconstruction. Accepted at CVPR 2024, Wonder3D introduces a cross-domain diffusion approach that simultaneously produces RGB color views and geometric normal maps, ensuring that the generated views are both visually consistent and geometrically accurate. This dual-output strategy provides significantly richer information for downstream 3D reconstruction compared to methods that generate only color images. The model uses a multi-view cross-domain attention mechanism that enforces consistency between the color and normal map domains during the diffusion process, resulting in coherent multi-view outputs that faithfully represent the 3D structure of the input object. Wonder3D can reconstruct a complete textured 3D mesh from a single photograph in approximately two to three minutes. The output meshes feature clean geometry with well-defined surface details, making them suitable for use in professional 3D workflows. Released under the Apache 2.0 license, the model is fully open source with code and pre-trained weights available on GitHub. Wonder3D handles diverse object categories including characters, animals, furniture, and manufactured objects with consistent quality. The model is particularly valuable for applications in game development, animation, product visualization, and virtual reality where high-quality 3D assets need to be created from limited reference imagery. Its cross-domain approach has influenced subsequent research in multi-view generation for 3D reconstruction.

Image to 3D

Key Highlights

Dual Color + Normal Map Generation

Simultaneously generates both multi-view color images and surface normal maps, providing explicit geometric information that significantly improves 3D reconstruction accuracy

Cross-Domain Attention Mechanism

Novel attention design enables information sharing between color and normal map branches during diffusion, ensuring geometric consistency between appearance and shape outputs

Fine Surface Detail Recovery

Normal map supervision provides strong geometric constraints that recover fine surface details and sharp features typically lost in color-only 3D reconstruction methods

Apache 2.0 Open Research

Fully open-source from Tsinghua University under Apache 2.0 license with reproducible code and weights, advancing the state of art in geometric 3D reconstruction

About

Wonder3D is a single-image 3D reconstruction model developed by researchers at Tsinghua University that generates both multi-view color images and corresponding normal maps from a single input image for high-quality 3D mesh reconstruction. Released in 2023 and accepted at CVPR 2024, Wonder3D introduces a cross-domain diffusion model architecture that achieves significant progress in image-to-3D conversion quality. The model produces results that stand out particularly in geometric consistency and texture quality.

Wonder3D's technical architecture is built on two key innovations. First is the cross-domain attention mechanism, which enables information exchange between RGB color images and normal maps, allowing each domain to strengthen the other. Second is the multi-view consistency module, which ensures that images generated from different viewing angles are geometrically compatible. The model is built on Stable Diffusion and fine-tuned on the Objaverse dataset. During the generation process, 6 viewing angles and their corresponding normal maps are produced simultaneously.

In terms of performance, Wonder3D achieves 18.6 dB PSNR and 0.862 SSIM on the GSO (Google Scanned Objects) dataset. While these values are slightly lower than Unique3D's 20.1 dB PSNR, they demonstrate that Wonder3D holds a competitive position in normal map generation quality and geometric consistency. Particularly strong results are achieved in preserving fine geometric details and accurately reconstructing complex structures. The model can produce a 3D mesh from a single image in approximately 2-3 minutes.

Wonder3D finds applications in game development, product design, e-commerce visualization, virtual reality, augmented reality, and digital content creation. It serves as a valuable tool for designers, 3D artists, and engineers who need rapid 3D model creation from a single photograph or render. It provides significant time savings particularly in projects requiring quick 3D visualization during prototyping stages.

Wonder3D is available as open-source under the Apache 2.0 license. Model weights, training code, and inference pipeline are accessible via GitHub. Built on PyTorch, it is optimized for NVIDIA GPUs. Demos and pre-trained models are available through Hugging Face. While the model can run on consumer GPUs, optimal performance is achieved on A100 GPUs.

Wonder3D is a significant work that successfully applies cross-domain attention mechanisms in single-image 3D reconstruction. While subsequent models like TRELLIS and SPA3D have adopted different approaches, Wonder3D's architecture combining RGB and normal map domains offers a distinctive advantage. Compared to other single-view 3D models like Zero123++ and One-2-3-45, Wonder3D achieves higher geometric accuracy through normal map integration.

Delving into Wonder3D's technical depth, the operation of the cross-domain attention mechanism becomes better understood. Information flow between RGB and normal map domains occurs at each diffusion step, enabling these two domains to continuously guide each other during the generation process. Normal maps encode surface orientations, making critical contributions to preserving geometric details. This information is utilized during the mesh reconstruction stage to obtain more accurate and detailed 3D models. The model's training on the Objaverse dataset has enabled generalization across a wide variety of objects. Wonder3D's output quality is particularly notable for organic forms and complex geometries. The model also works in integration with NeuS-based mesh extraction, converting multi-view and normal map information into high-quality 3D meshes. Its impact in the academic community is evident through its use as a reference point for subsequent research works in the single-image 3D reconstruction field.

Use Cases

1

High-Fidelity 3D Reconstruction

Reconstruct 3D objects with accurate geometry and surface details from single photographs for applications requiring geometric precision

2

Normal Map-Enhanced Asset Creation

Generate both 3D meshes and corresponding normal maps from images for assets that require detailed surface geometry in rendering pipelines

3

Research in Geometric Reconstruction

Use as a baseline and reference for academic research exploring the role of geometric supervision in improving single-image 3D reconstruction quality

4

Digital Asset Documentation

Create accurate 3D digital records of physical objects from photographs for heritage preservation, inventory management, and archival purposes

Pros & Cons

Pros

  • Reconstructs highly-detailed textured meshes from a single image in only 2-3 minutes
  • Achieves lowest Chamfer Distance (0.0199) and highest Volume IoU (0.6244) on Google Scanned Object dataset
  • Robust generalization across diverse image styles including sketches, cartoons, and real photographs
  • Generates both consistent multi-view images and corresponding normal maps for accurate geometry
  • Handles diverse lighting conditions and geometric complexities in input images

Cons

  • Sensitive to input image facing direction — front-facing images produce significantly better results
  • Limited to 6 views at 256x256 resolution due to computational resource constraints
  • Cannot accurately reconstruct objects with very thin structures and severe occlusions
  • Background segmentation using rembg is imperfect and mask quality significantly affects mesh quality
  • Expanding to more views would demand increased computational resources during training

Technical Details

Parameters

N/A

License

Apache 2.0

Features

  • Single Image to 3D
  • Normal Map Generation
  • Color Image Multi-View
  • Cross-Domain Diffusion
  • Geometry and Texture Quality
  • Open-Source Apache 2.0
  • Tsinghua University Research
  • Mesh Reconstruction Pipeline

Benchmark Results

MetricValueCompared ToSource
Novel View PSNR18.6 dB (GSO)Unique3D: 20.1 dBCVPR 2024 Paper
SSIM (GSO)0.862InstantMesh: 0.880CVPR 2024 Paper
Üretim Süresi~3 dakika (6 view + mesh)InstantMesh: ~10 saniyeGitHub xxlong0/Wonder3D
Normal Map KalitesiCross-domain diffusionCVPR 2024 Paper

Available Platforms

hugging face
replicate

Frequently Asked Questions

Related Models

TripoSR icon

TripoSR

Stability AI & Tripo|N/A

TripoSR is a fast feed-forward 3D reconstruction model jointly developed by Stability AI and Tripo AI that generates detailed 3D meshes from single input images in under one second. Unlike optimization-based methods that require minutes of processing per object, TripoSR uses a transformer-based architecture built on the Large Reconstruction Model framework to predict 3D geometry directly from a single 2D photograph in a single forward pass. The model accepts any standard image as input and produces a textured 3D mesh suitable for use in game engines, 3D modeling software, and augmented reality applications. TripoSR excels at reconstructing everyday objects, furniture, vehicles, characters, and organic shapes with impressive geometric accuracy and surface detail. Released under the MIT license in March 2024, the model is fully open source and can run on consumer-grade GPUs without specialized hardware. It supports batch processing for efficient conversion of multiple images and integrates seamlessly with popular 3D pipelines including Blender, Unity, and Unreal Engine. The model is particularly valuable for game developers, product designers, and e-commerce teams who need rapid 3D asset creation from product photographs. Output meshes can be exported in OBJ and GLB formats with configurable resolution settings. TripoSR represents a significant step toward democratizing 3D content creation by making high-quality reconstruction accessible without expensive scanning equipment or manual modeling expertise.

Open Source
4.5
TRELLIS icon

TRELLIS

Microsoft Research|Unknown

TRELLIS is a revolutionary AI model developed by Microsoft Research that generates high-quality 3D assets from text descriptions or single 2D images using a novel Structured Latent Diffusion architecture. Released in December 2024, TRELLIS represents a fundamental advancement in 3D content generation by operating in a structured latent space that encodes geometry, texture, and material properties simultaneously rather than treating them as separate stages. The model produces complete 3D meshes with detailed PBR (Physically Based Rendering) textures, enabling direct use in game engines, 3D rendering pipelines, and AR/VR applications without extensive manual post-processing. TRELLIS supports both text-to-3D generation where users describe desired objects in natural language and image-to-3D reconstruction where a single photograph is converted into a full 3D model with inferred geometry from occluded viewpoints. The structured latent representation ensures geometric consistency and prevents the common artifacts seen in other 3D generation approaches such as floating geometry, texture seams, and unrealistic proportions. TRELLIS outputs standard 3D formats including GLB and OBJ with UV-mapped textures, making integration with professional tools like Blender, Unity, and Unreal Engine straightforward. Released under the MIT license, the model is fully open source and available on GitHub. Key applications include rapid 3D asset prototyping for game development, architectural visualization, product design mockups, virtual staging for real estate, educational 3D content creation, and metaverse asset generation. The model particularly benefits indie developers and small studios who lack resources for traditional 3D modeling workflows.

Open Source
4.5
Stable Point Aware 3D (SPA3D) icon

Stable Point Aware 3D (SPA3D)

Stability AI|Unknown

Stable Point Aware 3D (SPA3D) is an advanced feed-forward 3D reconstruction model developed by Stability AI that generates high-quality textured 3D meshes from a single input image in seconds. Unlike iterative optimization-based approaches that require minutes of processing, SPA3D uses a direct feed-forward architecture that predicts 3D geometry and texture in a single pass, making it practical for interactive workflows and production pipelines. The model employs point cloud alignment techniques that significantly improve geometric consistency compared to other single-view reconstruction methods, ensuring that generated 3D models maintain accurate proportions and structural integrity from multiple viewpoints. SPA3D produces industry-standard mesh outputs with clean topology and UV-mapped textures, enabling direct import into 3D software including Blender, Unity, Unreal Engine, and professional CAD tools. The model handles diverse object categories from organic shapes like characters and animals to hard-surface objects like furniture and vehicles, adapting its reconstruction approach to the structural characteristics of each input. Released under the Stability AI Community License, the model is open source for personal and commercial use with revenue-based restrictions. Key applications include rapid 3D asset creation for game development, augmented reality content production, 3D printing preparation, virtual product photography, architectural visualization, and e-commerce 3D product displays. SPA3D is particularly valuable for creative professionals who need quick 3D mockups from concept sketches or photographs without investing hours in manual modeling. The model runs on consumer GPUs and is available through cloud APIs for scalable deployment.

Open Source
4.3
Zero123++ icon

Zero123++

Stability AI|N/A

Zero123++ is a multi-view image generation model developed by Stability AI that generates six consistent canonical views of an object from a single input image. Released in 2023 under the Apache 2.0 license, the model extends the original Zero123 approach with significantly improved view consistency and serves as a critical component in modern 3D reconstruction pipelines. Zero123++ takes a single photograph or rendered image of an object and produces six evenly spaced views covering the full 360-degree range around the object, all maintaining consistent geometry, lighting, and appearance. The model is built on a fine-tuned Stable Diffusion backbone with specialized conditioning mechanisms that ensure multi-view coherence. Unlike the original Zero123 which generates views independently and often produces inconsistent results, Zero123++ generates all six views simultaneously in a single diffusion process, dramatically improving 3D consistency. The generated multi-view images serve as input for downstream 3D reconstruction methods like NeRF, Gaussian Splatting, or direct mesh reconstruction, enabling high-quality 3D model creation from a single photograph. Zero123++ is fully open source with pre-trained weights available on Hugging Face, making it accessible to researchers and developers building 3D generation systems. The model has become a foundational component in many state-of-the-art 3D generation pipelines and is widely used in academic research. It is particularly valuable for applications in game development, product visualization, and virtual reality where converting 2D images to 3D assets is a frequent workflow requirement.

Open Source
4.3

Quick Info

ParametersN/A
Typediffusion
LicenseApache 2.0
Released2023-10
Rating4.1 / 5
CreatorTsinghua University

Links

Tags

wonder3d
3d
geometry
image-to-3d
Visit Website