Depth Anything 3: Recovering the Visual Space from Any Views

TL;DR: Depth Anything 3 recovers the space with superior geometry and 3DGS rendering from any visual inputs.
The secret? No complex tasks! No special architecture!
just a single, plain transformer trained with a depth-ray representation.

Loading video...

Abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINOv2 encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 35.7% in camera pose accuracy and 23.6% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

Abilities

Loading video...

Video Reconstruction

DA3 recovers the visual space from any number of views, covering from single view to multiple views. This demo illustrates the ability of DA3 to recover the visual space from a difficult video.

Loading video...

SLAM for Large-Scale Scenes

Accurate visual geometry estimation improves SLAM performance. Quantitative results show that simply replacing VGGT in VGGT-Long with DA3 (DA3-Long) significantly reduces drift in large-scale environments, even better than COLMAP, which takes more 48 hours to complete.

Loading video...

Feed-Forward 3D Gaussians Estimation

By freezing the entire backbone and training a DPT head to predict 3DGS parameters, our model achieves very strong and generalizable novel view synthesis capability.

Loading video...

Spatial Perception from Multiple Cameras

Given several images of different viewpoints from a vehicle (even without overlap), DA3 estimates stable and fusible depth maps, enhancing autonomous vehicles' environmental understanding.

Citation


@article{depthanything3,
  title={Depth Anything 3: recovering the visual space from any views},
  author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},
  journal={arXiv preprint arXiv:2511.10647},
  year={2025}
}

Depth Anything 3

Recovering the Visual Space from Any Views

Abstract

Abilities

Video Reconstruction

SLAM for Large-Scale Scenes

Feed-Forward 3D Gaussians Estimation

Spatial Perception from Multiple Cameras

Interactive Examples

Comparison

Awesome DA3 Projects

Citation