Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP)
This repo contains the source code of our ECCV 2022 paper MS-CLIP:
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
2022 European Conference on Computer Vision (ECCV 2022)
By Haoxuan You*, Luowei Zhou*, Bin Xiao*, Noel Codella*, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan.
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. More specifically, we question how many parameters of a transformer model can be shared across modalities during contrastive pre-training, and rigorously examine architectural design choices that position the proportion of parameters shared along a spectrum. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Additionally, we find that lightweight modality-specific parallel modules further improve performance.
- [07/20/2022] Released pretrained model and zero-shot evaluation on ImageNet-1k.
Model | Training Set | Top-1 on IN-1K | LP* on 24 datasets | Download |
---|---|---|---|---|
MS-CLIP-S (ViT-B/32) | YFCC-22M | 36.7 | 68.5 | ckpt/config |
MS-CLIP-S (ViT-B/16) | YFCC-22M | 39.0 | 70.4 | ckpt/config |
MS-CLIP-S (ViT-B/32) | LAION-20M | 40.2 | 73.3 | ckpt/config |
*LP: Linear Probing
Please follow INSTALL.md for installation
Please follow DATA.md for data preparation.
Download from the links in the table above. Put the weights under ./OUTPUT_MODEL/
.
To evaluate a pre-trained MS-CLIP-S on ImageNet Zero-shot Classification, run:
CUDA_VISIBLE_DEVICES=0 python tools/eval_zeroshot.py --model <config-file>
where <config-file>
is the config yaml under experiments/model/
. E.g. experiments/model/b32-laion-msclips.yaml
If you have any questions, please contact Haoxuan You or Luowei Zhou.