Skip to main navigation Skip to search Skip to main content

CaBins: CLIP-based Adaptive Bins for Monocular Depth Estimation

  • Eunjin Son
  • , Sang Jun Lee*
  • *Corresponding author for this work
  • Jeonbuk National University

Research output: Contribution to conferenceConference paperpeer-review

Abstract

Traditional deep-learning models use pre-trained knowledge on large-scale datasets to fine-tune the model. This strategy significantly improves the performance of downstream tasks such as object detection and segmentation. Recently, vision-language (VL) models that jointly train an image encoder and a text encoder have gained attention. Notably, CLIP, which employs contrastive learning for classification, contributed significantly to establishing the foundation for the VL model paradigm. In depth estimation, several CLIP-based models have been proposed that use images and texts called semantic bins. However, it is questionable whether these human-set semantic bins are reasonable. In this work, we propose a network for monocular depth estimation, leveraging CLIP's pre-trained knowledge. Our model employs a regression-classification formulation, predicting depth through a linear combination of depth candidates and a probability map derived from the similarity score between image embedding and text embedding. Unlike previous works relying on human-set semantic bins for the text embedding, our model converts the predicted depth candidates into distance classes using the CaBins module. Moreover, we modify CLIP's image encoder, which is designed for classification, to address the dense prediction task. Experiments were conducted on the NYU-Depth V2 and KITTI datasets. We compared the performance of our model with CLIP-based as well as unimodal monocular depth estimation models. Our proposed model outperformed previous CLIP-based models across all evaluation metrics and showed high-quality boundary predictions on both datasets. Our model is available at https://github.com/EunjinSon1/CaBins.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
PublisherIEEE Computer Society
Pages4557-4567
Number of pages11
ISBN (Electronic)9798350365474
DOIs
StatePublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024 - Seattle, United States
Duration: 2024.06.162024.06.22

Publication series

NameIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
ISSN (Print)2160-7508
ISSN (Electronic)2160-7516

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
Country/TerritoryUnited States
CitySeattle
Period24.06.1624.06.22

Quacquarelli Symonds(QS) Subject Topics

  • Computer Science & Information Systems
  • Engineering - Electrical & Electronic
  • Engineering - Petroleum
  • Data Science

Fingerprint

Dive into the research topics of 'CaBins: CLIP-based Adaptive Bins for Monocular Depth Estimation'. Together they form a unique fingerprint.

Cite this