Publication Details
Vision UFormer: Long-Range Monocular Absolute Depth Estimation
Čadík Martin, doc. Ing., Ph.D. (DCGM)
Keller Yosi, prof., M.Sc., Ph.D.
Beneš Bedřich
Absolute Depth Estimation, Monocular Depth Prediction, Long Range Distance,
Transformer, UNet, Staged Training
We introduce Vision UFormer (ViUT), a novel deep neural long-range monocular
depth estimator. The input is an RGB image, and the output is an image that
stores the absolute distance of the object in the scene as its per-pixel values.
ViUT consists of a Transformer encoder and a ResNet decoder combined with UNet
style of skip connections. It is trained on 1M images across ten datasets in
a staged regime that starts with easier-to-predict data such as indoor
photographs and continues to more complex long-range outdoor scenes. We show that
ViUT provides comparable results for normalized relative distances and
short-range classical datasets such as NYUv2 and KITTI. We further show that it
successfully estimates of absolute long-range depth in meters. We validate ViUT
on a wide variety of long-range scenes showing its high estimation capabilities
with a relative improvement of up to 23%. Absolute depth estimation finds
application in many areas, and we show its usability in image composition, range
annotation, defocus, and scene reconstruction.
@article{BUT185048,
author="Tomáš {Polášek} and Martin {Čadík} and Yosi {Keller} and Bedřich {Beneš}",
title="Vision UFormer: Long-Range Monocular Absolute Depth Estimation",
journal="COMPUTERS & GRAPHICS-UK",
year="2023",
volume="111",
number="4",
pages="180--189",
doi="10.1016/j.cag.2023.02.003",
issn="0097-8493",
url="https://www.sciencedirect.com/science/article/pii/S0097849323000262"
}