Depth Anything v2 icon

Depth Anything v2

Open Source
4.6
TikTok / ByteDance

Depth Anything v2 is a state-of-the-art monocular depth estimation model developed by TikTok and ByteDance researchers as a significant upgrade to the original Depth Anything. The model extracts precise depth maps from single RGB images without requiring stereo pairs or specialized depth sensors. Built on a DINOv2 vision foundation model backbone combined with a DPT (Dense Prediction Transformer) decoder head, Depth Anything v2 achieves remarkable improvements in fine-grained detail preservation and edge sharpness compared to its predecessor. The model comes in three scale variants ranging from 25 million to 335 million parameters, offering flexible trade-offs between accuracy and inference speed for different deployment scenarios. A key innovation in v2 is the use of large-scale synthetic training data generated from precise depth sensors combined with pseudo-labeled real images, which significantly reduces the noise and artifacts common in earlier monocular depth models. The model produces both relative and metric depth estimates, making it suitable for diverse applications from 3D scene reconstruction and augmented reality to autonomous navigation and robotics. Released under the Apache 2.0 license, it is fully open source and available through Hugging Face with pre-trained checkpoints. Depth Anything v2 integrates naturally with creative AI workflows including ControlNet depth conditioning for Stable Diffusion and FLUX, enabling artists and developers to generate depth-aware compositions. It also supports video depth estimation with temporal consistency, making it valuable for visual effects production and spatial computing applications.

Depth Estimation

Key Highlights

Universal Depth Estimation

Domain-agnostic monocular depth estimation that generates high-quality depth maps from any image

Multiple Model Sizes

Wide hardware support from mobile devices to servers by offering models in Small, Base, and Large sizes

Real-Time Performance

Optimized architecture capable of depth estimation at speeds sufficient to process video frames in real time

Superior Accuracy

Results that significantly surpass previous monocular depth estimation models on multiple benchmarks

About

Depth Anything v2 is one of the most advanced models for extracting depth maps from a single image (monocular depth estimation). Developed as a successor to the original Depth Anything model, the v2 version offers significant improvements particularly in fine details and edge accuracy. Created by researchers at the University of Hong Kong and TikTok's research team, this model has established a new performance standard in the depth estimation field, significantly improving the reliability of monocular depth estimation in practical real-world applications.

The model is trained with a smart combination of synthetic and real-world data using a carefully designed curriculum. This approach, leveraging labeled synthetic data and unlabeled real-world data, enables the model to both produce accurate depth values and generalize to diverse real-world scenes without domain-specific fine-tuning. It delivers consistent results across all types of scenes including indoor, outdoor, natural, and urban environments. The v2 training strategy leverages high-quality depth labels from synthetic data while strengthening the model's generalization capacity through the diversity of real-world data. This hybrid training approach improves both absolute depth accuracy and relative depth ordering simultaneously.

Depth Anything v2's most notable improvements over the previous version are seen in edge precision and fine detail preservation throughout the depth map. Object boundaries, thin structures, and complex geometries are represented with sharper and more accurate depth transitions that closely match ground truth. This improvement makes a significant difference particularly in applications where edge quality is critical, such as 3D scene reconstruction and augmented reality overlay alignment. The model also demonstrates improved performance in traditionally challenging scenarios such as reflective surfaces, transparent objects, and repetitive textures. It produces consistent predictions even in regions with ambiguous depth cues such as skies, water surfaces, and glass.

Available in three model sizes—Small, Base, and Large—each optimized for different deployment scenarios. The smallest variant is suitable for real-time applications with minimal latency requirements, while the Large variant provides the highest accuracy for offline processing tasks. Its DINOv2-based image encoder forms the foundation of the model's strong feature extraction capacity, utilizing rich visual representations from self-supervised pretraining. Each size is optimized for different computational budgets and latency requirements, allowing developers to select the most appropriate balance for their application needs and hardware constraints.

Depth Anything v2's practical applications are extremely broad and span multiple industries. 3D scene creation, augmented reality effects, robotic navigation, autonomous vehicle perception, and photo editing are among the most common applications driving adoption. In photography apps, it provides essential input for portrait mode background blurring, depth-based focus effects, and layered editing capabilities that simulate professional camera results. It can also be used as input for 3D reconstruction techniques like NeRF and Gaussian Splatting, significantly improving the quality of these reconstruction methods. In video games and film production, it is used for virtual camera effects and depth-based post-production compositing.

Released as open source, the model is accessible via Hugging Face and available in PyTorch, ONNX, and TensorRT formats for flexible deployment. It can be quickly tested through a Gradio-based demo interface for evaluation purposes. It offers frame-by-frame processing support for video depth estimation and can be combined with temporal consistency techniques to achieve smooth video depth maps suitable for production use. The model has rapidly gained acceptance as a standard depth estimation solution in the computer vision research community, serving as a foundation for numerous downstream applications and research directions.

Use Cases

1

3D Scene Reconstruction

Creating virtual environments and 3D models by extracting 3D scene geometry from a single photograph

2

Augmented Reality

Depth-aware placement of virtual objects into real-world scenes in AR applications

3

Robotic Navigation

Obstacle avoidance and path planning by estimating the distance of objects in robots' surroundings

4

Photographic Depth Effects

Creating depth maps for portrait mode, bokeh effect, and depth-based image editing

Pros & Cons

Pros

  • More than 10x faster and more accurate than models built on Stable Diffusion for depth estimation
  • ViT-B model surpasses the larger MiDaS ViT-L model; efficient for computationally constrained environments
  • Superior performance on KITTI and NYUv2 benchmarks without training on their images (true zero-shot)
  • Models ranging from 25M to 1.3B parameters supporting extensive scenarios
  • Pseudo-labels from teacher model are superior in quality to manual labels in existing real-world datasets

Cons

  • Distribution gap between synthetic and real-world data can limit generalization
  • Constrained variety of scenes from rendering engines may lead to suboptimal real-world performance
  • Struggles with rotated images; can misinterpret reflections and paintings
  • Common failure cases including hallucinated depth at strong edges and missed thin structures
  • Incorrect relative depths between disconnected objects and blurred backgrounds due to limited resolution

Technical Details

Parameters

25M-335M

Architecture

DINOv2 + DPT

Training Data

Synthetic + real-world depth data

License

Apache 2.0

Features

  • Monocular Depth Estimation
  • Multi-Scale Architecture
  • Real-Time Inference
  • Multi-Size Models
  • Zero-Shot Generalization
  • Metric Depth Support

Benchmark Results

MetricValueCompared ToSource
Absolute Relative Error (NYUv2)0.043Depth Anything v1: 0.056Depth Anything v2 Paper (arXiv:2406.09414)
delta1 DoÄŸruluk (NYUv2)0.982MiDaS v3.1: 0.955Depth Anything v2 Paper (arXiv:2406.09414)
Desteklenen Çözünürlük518×518 (native), arbitrary input—Hugging Face Model Card
İşleme Hızı (A100)~30 FPS (ViT-S), ~12 FPS (ViT-L)ZoeDepth: ~8 FPSGitHub Repository Benchmarks

Available Platforms

GitHub
HuggingFace

Frequently Asked Questions

Quick Info

Parameters25M-335M
TypeVision Transformer
LicenseApache 2.0
Released2024-06
ArchitectureDINOv2 + DPT
Version2
Rating4.6 / 5
CreatorTikTok / ByteDance

Links

Tags

depth
estimation
3d
computer-vision
Visit Website