How does Depth Anything v2 work?

Depth Anything v2 uses a DINOv2 vision transformer-based architecture to extract depth maps from a single image. The model is trained through pre-training on large-scale unlabeled data and fine-tuning with synthetic depth data. It estimates relative or metric depth values for each pixel.

What is the difference between Depth Anything v2 and v1?

V2 offers significant accuracy improvements over v1 across multiple benchmarks. It particularly provides clearer depth transitions at edge regions, more accurate depth ordering in complex scenes, and enhanced metric depth estimation capability. Inference speed has also been optimized.

Does Depth Anything v2 work in real time?

Yes, especially the Small model runs at sufficient speed for real-time applications. It can process video frames at 30+ FPS on modern GPUs. Larger models offer higher accuracy but run slower. The appropriate size can be selected based on the use scenario.

Is Depth Anything v2 open source?

Yes, Depth Anything v2 is released as open source. Model weights (Small, Base, Large), source code, and a demo application are available on GitHub and Hugging Face. It is freely usable for both research and commercial use under the Apache 2.0 license.

What is a depth map used for?

A depth map represents the distance of each pixel in an image from the camera. It is used in a wide range of applications including 3D scene reconstruction, augmented reality object placement, robotic navigation, portrait mode bokeh effect, 3D photography, and visual effects.

Does Depth Anything v2 work on mobile devices?

Yes, the Small model is lightweight enough to run on mobile devices. It can be integrated into Android and iOS applications by exporting in ONNX and TensorFlow Lite formats. It can perform real-time depth estimation on mobile devices for use in AR applications and photographic effects.

Depth Anything v2

Open Source

4.6

TikTok / ByteDance

Depth Anything v2 is a state-of-the-art monocular depth estimation model developed by TikTok and ByteDance researchers as a significant upgrade to the original Depth Anything. The model extracts precise depth maps from single RGB images without requiring stereo pairs or specialized depth sensors. Built on a DINOv2 vision foundation model backbone combined with a DPT (Dense Prediction Transformer) decoder head, Depth Anything v2 achieves remarkable improvements in fine-grained detail preservation and edge sharpness compared to its predecessor. The model comes in three scale variants ranging from 25 million to 335 million parameters, offering flexible trade-offs between accuracy and inference speed for different deployment scenarios. A key innovation in v2 is the use of large-scale synthetic training data generated from precise depth sensors combined with pseudo-labeled real images, which significantly reduces the noise and artifacts common in earlier monocular depth models. The model produces both relative and metric depth estimates, making it suitable for diverse applications from 3D scene reconstruction and augmented reality to autonomous navigation and robotics. Released under the Apache 2.0 license, it is fully open source and available through Hugging Face with pre-trained checkpoints. Depth Anything v2 integrates naturally with creative AI workflows including ControlNet depth conditioning for Stable Diffusion and FLUX, enabling artists and developers to generate depth-aware compositions. It also supports video depth estimation with temporal consistency, making it valuable for visual effects production and spatial computing applications.

Depth Estimation

Visit Website

Key Highlights

Universal Depth Estimation

Domain-agnostic monocular depth estimation that generates high-quality depth maps from any image

Multiple Model Sizes

Wide hardware support from mobile devices to servers by offering models in Small, Base, and Large sizes

Real-Time Performance

Optimized architecture capable of depth estimation at speeds sufficient to process video frames in real time

Superior Accuracy

Results that significantly surpass previous monocular depth estimation models on multiple benchmarks

About

Depth Anything v2 is one of the most advanced models for extracting depth maps from a single image (monocular depth estimation). Developed as a successor to the original Depth Anything model, the v2 version offers significant improvements particularly in fine details and edge accuracy. Created by researchers at the University of Hong Kong and TikTok's research team, this model has established a new performance standard in the depth estimation field, significantly improving the reliability of monocular depth estimation in practical real-world applications.

The model is trained with a smart combination of synthetic and real-world data using a carefully designed curriculum. This approach, leveraging labeled synthetic data and unlabeled real-world data, enables the model to both produce accurate depth values and generalize to diverse real-world scenes without domain-specific fine-tuning. It delivers consistent results across all types of scenes including indoor, outdoor, natural, and urban environments. The v2 training strategy leverages high-quality depth labels from synthetic data while strengthening the model's generalization capacity through the diversity of real-world data. This hybrid training approach improves both absolute depth accuracy and relative depth ordering simultaneously.

Depth Anything v2's most notable improvements over the previous version are seen in edge precision and fine detail preservation throughout the depth map. Object boundaries, thin structures, and complex geometries are represented with sharper and more accurate depth transitions that closely match ground truth. This improvement makes a significant difference particularly in applications where edge quality is critical, such as 3D scene reconstruction and augmented reality overlay alignment. The model also demonstrates improved performance in traditionally challenging scenarios such as reflective surfaces, transparent objects, and repetitive textures. It produces consistent predictions even in regions with ambiguous depth cues such as skies, water surfaces, and glass.

Available in three model sizes—Small, Base, and Large—each optimized for different deployment scenarios. The smallest variant is suitable for real-time applications with minimal latency requirements, while the Large variant provides the highest accuracy for offline processing tasks. Its DINOv2-based image encoder forms the foundation of the model's strong feature extraction capacity, utilizing rich visual representations from self-supervised pretraining. Each size is optimized for different computational budgets and latency requirements, allowing developers to select the most appropriate balance for their application needs and hardware constraints.

Depth Anything v2's practical applications are extremely broad and span multiple industries. 3D scene creation, augmented reality effects, robotic navigation, autonomous vehicle perception, and photo editing are among the most common applications driving adoption. In photography apps, it provides essential input for portrait mode background blurring, depth-based focus effects, and layered editing capabilities that simulate professional camera results. It can also be used as input for 3D reconstruction techniques like NeRF and Gaussian Splatting, significantly improving the quality of these reconstruction methods. In video games and film production, it is used for virtual camera effects and depth-based post-production compositing.

Released as open source, the model is accessible via Hugging Face and available in PyTorch, ONNX, and TensorRT formats for flexible deployment. It can be quickly tested through a Gradio-based demo interface for evaluation purposes. It offers frame-by-frame processing support for video depth estimation and can be combined with temporal consistency techniques to achieve smooth video depth maps suitable for production use. The model has rapidly gained acceptance as a standard depth estimation solution in the computer vision research community, serving as a foundation for numerous downstream applications and research directions.

Use Cases

3D Scene Reconstruction

Creating virtual environments and 3D models by extracting 3D scene geometry from a single photograph

Augmented Reality

Depth-aware placement of virtual objects into real-world scenes in AR applications

Robotic Navigation

Obstacle avoidance and path planning by estimating the distance of objects in robots' surroundings

Photographic Depth Effects

Creating depth maps for portrait mode, bokeh effect, and depth-based image editing

Pros & Cons

Pros

More than 10x faster and more accurate than models built on Stable Diffusion for depth estimation
ViT-B model surpasses the larger MiDaS ViT-L model; efficient for computationally constrained environments
Superior performance on KITTI and NYUv2 benchmarks without training on their images (true zero-shot)
Models ranging from 25M to 1.3B parameters supporting extensive scenarios
Pseudo-labels from teacher model are superior in quality to manual labels in existing real-world datasets

Cons

Distribution gap between synthetic and real-world data can limit generalization
Constrained variety of scenes from rendering engines may lead to suboptimal real-world performance
Struggles with rotated images; can misinterpret reflections and paintings
Common failure cases including hallucinated depth at strong edges and missed thin structures
Incorrect relative depths between disconnected objects and blurred backgrounds due to limited resolution

Technical Details

Parameters

25M-335M

Architecture

DINOv2 + DPT

Training Data

Synthetic + real-world depth data

License

Apache 2.0

Features

Monocular Depth Estimation
Multi-Scale Architecture
Real-Time Inference
Multi-Size Models
Zero-Shot Generalization
Metric Depth Support

Benchmark Results

Metric	Value	Compared To	Source
Absolute Relative Error (NYUv2)	0.043	Depth Anything v1: 0.056	Depth Anything v2 Paper (arXiv:2406.09414)
delta1 Doğruluk (NYUv2)	0.982	MiDaS v3.1: 0.955	Depth Anything v2 Paper (arXiv:2406.09414)
Desteklenen Çözünürlük	518×518 (native), arbitrary input	—	Hugging Face Model Card
İşleme Hızı (A100)	~30 FPS (ViT-S), ~12 FPS (ViT-L)	ZoeDepth: ~8 FPS	GitHub Repository Benchmarks

Available Platforms

GitHub

HuggingFace

Frequently Asked Questions

Quick Info

Parameters25M-335M

TypeVision Transformer

LicenseApache 2.0

Released2024-06

ArchitectureDINOv2 + DPT

Version2

Rating4.6 / 5

CreatorTikTok / ByteDance

Links

Official Website GitHub arXiv Paper