We study how vision foundation models enable networks to comprehend 2D and 3D environments without relying on labelled data. To accomplish this, we introduce a novel framework called Cross-modality Noisy Supervision (CNS). By effectively harnessing the strengths of CLIP and SAM, our approach simultaneously trains 2D and 3D networks, yielding remarkable performance. [Preprint Paper]
The codebase is adapted from CLIP2Scene. Codes will be released later this year.
scannet 2D. scannet 3D. nuImages. nuScenes 2D. nuScenes 3D.
@inproceedings{chen2023clip2scene,
title={CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP},
author={Chen, Runnan and Liu, Youquan and Kong, Lingdong and Zhu, Xinge and Ma, Yuexin and Li, Yikang and Hou, Yuenan and Qiao, Yu and Wang, Wenping},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={7020--7030},
year={2023}
}
@inproceedings{chen2023towards,
title={Towards label-free scene understanding by vision foundation models},
author={Chen, Runnan and Liu, Youquan and Kong, Lingdong and Chen, Nenglun and Xinge, ZHU and Ma, Yuexin and Liu, Tongliang and Wang, Wenping},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}
}