TY - GEN
T1 - CaBins
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
AU - Son, Eunjin
AU - Lee, Sang Jun
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Traditional deep-learning models use pre-trained knowledge on large-scale datasets to fine-tune the model. This strategy significantly improves the performance of downstream tasks such as object detection and segmentation. Recently, vision-language (VL) models that jointly train an image encoder and a text encoder have gained attention. Notably, CLIP, which employs contrastive learning for classification, contributed significantly to establishing the foundation for the VL model paradigm. In depth estimation, several CLIP-based models have been proposed that use images and texts called semantic bins. However, it is questionable whether these human-set semantic bins are reasonable. In this work, we propose a network for monocular depth estimation, leveraging CLIP's pre-trained knowledge. Our model employs a regression-classification formulation, predicting depth through a linear combination of depth candidates and a probability map derived from the similarity score between image embedding and text embedding. Unlike previous works relying on human-set semantic bins for the text embedding, our model converts the predicted depth candidates into distance classes using the CaBins module. Moreover, we modify CLIP's image encoder, which is designed for classification, to address the dense prediction task. Experiments were conducted on the NYU-Depth V2 and KITTI datasets. We compared the performance of our model with CLIP-based as well as unimodal monocular depth estimation models. Our proposed model outperformed previous CLIP-based models across all evaluation metrics and showed high-quality boundary predictions on both datasets. Our model is available at https://github.com/EunjinSon1/CaBins.
AB - Traditional deep-learning models use pre-trained knowledge on large-scale datasets to fine-tune the model. This strategy significantly improves the performance of downstream tasks such as object detection and segmentation. Recently, vision-language (VL) models that jointly train an image encoder and a text encoder have gained attention. Notably, CLIP, which employs contrastive learning for classification, contributed significantly to establishing the foundation for the VL model paradigm. In depth estimation, several CLIP-based models have been proposed that use images and texts called semantic bins. However, it is questionable whether these human-set semantic bins are reasonable. In this work, we propose a network for monocular depth estimation, leveraging CLIP's pre-trained knowledge. Our model employs a regression-classification formulation, predicting depth through a linear combination of depth candidates and a probability map derived from the similarity score between image embedding and text embedding. Unlike previous works relying on human-set semantic bins for the text embedding, our model converts the predicted depth candidates into distance classes using the CaBins module. Moreover, we modify CLIP's image encoder, which is designed for classification, to address the dense prediction task. Experiments were conducted on the NYU-Depth V2 and KITTI datasets. We compared the performance of our model with CLIP-based as well as unimodal monocular depth estimation models. Our proposed model outperformed previous CLIP-based models across all evaluation metrics and showed high-quality boundary predictions on both datasets. Our model is available at https://github.com/EunjinSon1/CaBins.
UR - https://www.scopus.com/pages/publications/85206451437
U2 - 10.1109/CVPRW63382.2024.00458
DO - 10.1109/CVPRW63382.2024.00458
M3 - Conference paper
AN - SCOPUS:85206451437
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 4557
EP - 4567
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -