This repo is the official implementation of "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet".
Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%, 88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset. These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP.
ViT-Base/16224 | ViT-Base/16384 | ViT-Large/16384 | ViT-Large/14224 | ViT-Large/14336 | |
---|---|---|---|---|---|
FLOPS | 17.5G | 55.4G | 190.7G | 80.7G | 190.6G |
Supervised Baseline | |||||
ImageNet-21K | 84.0 | 86.2 | 87.1 | ---- | ---- |
JFT-300M | ---- | 86.7 | 88.0 | ---- | ---- |
JFT-3B | ---- | 86.6 | 88.5 | ---- | ---- |
MIM with CLIP as prediction target | |||||
MVP | 84.4 | ---- | ---- | ---- | ---- |
FD-CLIP | 84.9 | ---- | ---- | ---- | ---- |
CAE-v2 | 85.3 | ---- | ---- | ---- | ---- |
BEiT-2 | 85.5 | ---- | ---- | ---- | ---- |
Fine-tuning CLIP directly | |||||
FT-CLIP(ours) | 85.7 | 86.6 | ---- | 88.0 | 88.3 |
PyTorch, Timm and DeepSpeed is needed. CUDA version or GPU difference may slightly influence the results.
pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install --user timm==0.4.12
pip install --user deepspeed==0.4.0
The CLIP-Base/16 model can be fine-tuned on ImageNet-1k using 8 A100-40GB:
MODEL=CLIP_B16
OUTPUT_DIR=/path/to/save/your_model
DATA_PATH=/path/to/imagenet
echo $OUTPUT_DIR
mkdir -p $OUTPUT_DIR
cp $0 $OUTPUT_DIR
OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
--model ${MODEL} --data_path $DATA_PATH \
--input_size 224 \
--finetune True \
--num_workers 8 \
--output_dir ${OUTPUT_DIR} \
--batch_size 256 --lr 6e-4 --update_freq 1 \
--warmup_epochs 10 --epochs 50 \
--layer_decay 0.6 \
--drop_path 0 \
--dist_eval --eval_all --no_save_ckpt \
--enable_deepspeed \
--clip_mean_and_std \
--layer_scale_init_value 0 \
--abs_pos_emb --disable_rel_pos_bias \
--weight_decay 0.05 --mixup 0 --cutmix 0 \
--nb_classes 1000 --model_prefix visual.\
--model_ema --model_ema_decay 0.9998 \
2>&1 | tee -a ${OUTPUT_DIR}/log.txt
--batch_size
: batch size per GPU.- Effective batch size =
number of GPUs
*--batch_size
*--update_freq
. So in the above example, the effective batch size is8*256*1 = 2048
. --lr
: base learning rate.--layer_decay
: layer-wise learning rate decay. The LR of i_th layer islr * layer_decay ** i
.--warmup_epochs
: learning rate warmup epochs.--epochs
: total pre-training epochs.--clip_mean_and_std
: use the CLIP norm factor, instead of the ImageNet norm.
see scripts/ for more config
This repository is modified from BEiT, built using the timm library, the DeiT repository and the CLIP repository. The CLIP model file is modified from DeCLIP.
If you use this code for your research, please cite our paper.
@article{dong2022ftclip,
title={CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet},
author={Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Shuyang, Gu and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
journal={arXiv preprint arXiv:2212.06138},
year={2022}
}