Skip to content

Latest commit

 

History

History
121 lines (75 loc) · 9.49 KB

MODEL_ZOO.md

File metadata and controls

121 lines (75 loc) · 9.49 KB

Detic model zoo

Introduction

This file documents a collection of models reported in our paper. The training time was measured on Big Basin servers with 8 NVIDIA V100 GPUs & NVLink.

How to Read the Tables

The "Name" column contains a link to the config file. To train a model, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml

To evaluate a model with a trained/ pretrained model, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth

Third-party ImageNet-21K Pretrained Models

Our paper uses ImageNet-21K pretrained models that are not part of Detectron2 (ResNet-50-21K from MIIL and SwinB-21K from Swin-Transformer). Before training, please download the models and place them under DETIC_ROOT/models/, and following this tool to convert the format.

Open-vocabulary LVIS

Name Training time mask mAP mask mAP_novel Download
Box-Supervised_C2_R50_640_4x 17h 30.2 16.4 model
Detic_C2_IN-L_R50_640_4x 22h 32.4 24.9 model
Detic_C2_CCimg_R50_640_4x 22h 31.0 19.8 model
Detic_C2_CCcapimg_R50_640_4x 22h 31.0 21.3 model
Box-Supervised_C2_SwinB_896_4x 43h 38.4 21.9 model
Detic_C2_IN-L_SwinB_896_4x 47h 40.7 33.8 model

Note

  • The open-vocabulary LVIS setup is LVIS without rare class annotations in training. We evaluate rare classes as novel classes in testing.

  • The models with C2 are trained using our improved LVIS baseline (Appendix D of the paper), including CenterNet2 detector, Federated Loss, large-scale jittering, etc.

  • All models use CLIP embeddings as classifiers. This makes the box-supervised models have non-zero mAP on novel classes.

  • The models with IN-L use the overlap classes between ImageNet-21K and LVIS as image-labeled data.

  • The models with CC use Conception Captions. CCimg uses image labels extracted from the captions (using a naive text-match) as image-labeled data. CCcapimg additionally uses the row captions (Appendix C of the paper).

  • The Detic models are finetuned on the corresponding Box-Supervised models above (indicated by MODEL.WEIGHTS in the config files). Please train or download the Box-Supervised model and place them under DETIC_ROOT/models/ before training the Detic models.

Standard LVIS

Name Training time mask mAP mask mAP_rare Download
Box-Supervised_C2_R50_640_4x 17h 31.5 25.6 model
Detic_C2_R50_640_4x 22h 33.2 29.7 model
Box-Supervised_C2_SwinB_896_4x 43h 40.7 35.9 model
Detic_C2_SwinB_896_4x 47h 41.7 41.7 model
Name Training time box mAP box mAP_rare Download
Box-Supervised_DeformDETR_R50_2x 31h 31.7 21.4 model
Detic_DeformDETR_R50_2x 47h 32.5 26.2 model

Note

  • All Detic models use the overlap classes between ImageNet-21K and LVIS as image-labeled data;

  • The models with C2 are trained using our improved LVIS baseline in the paper, including CenterNet2 detector, Federated loss, large-scale jittering, etc.

  • The models with DeformDETR are Deformable DETR models. We train the models with Federated Loss.

Open-vocabulary COCO

Name Training time box mAP50 box mAP50_novel Download
BoxSup_CLIP_R50_1x 12h 39.3 1.3 model
Detic_CLIP_R50_1x_image 13h 44.7 24.1 model
Detic_CLIP_R50_1x_caption 16h 43.8 21.0 model
Detic_CLIP_R50_1x_caption-image 16h 45.0 27.8 model

Note

  • All models are trained with ResNet50-C4 without multi-scale augmentation. All models use CLIP embeddings as the classifier.

  • We extract class names from COCO-captions as image-labels. Detic_CLIP_R50_1x_image uses the max-size loss; Detic_CLIP_R50_1x_caption directly uses CLIP caption embedding within each mini-batch for classification; Detic_CLIP_R50_1x_caption-image uses both losses.

  • We report box mAP50 under the "generalized" open-vocabulary setting.

Cross-dataset evaluation

Name Training time Objects365 box mAP OpenImages box mAP50 Download
Box-Supervised_C2_SwinB_896_4x 43h 19.1 46.2 model
Detic_C2_SwinB_896_4x 47h 21.2 53.0 model
Detic_C2_SwinB_896_4x_IN-21K 47h 21.4 55.2 model
Box-Supervised_C2_SwinB_896_4x+COCO 43h 19.7 46.4 model
Detic_C2_SwinB_896_4x_IN-21K+COCO 47h 21.6 54.6 model

Note

  • Box-Supervised_C2_SwinB_896_4x and Detic_C2_SwinB_896_4x are the same model in the Standard LVIS section, but evaluated with Objects365/ OpenImages vocabulary (i.e. CLIP embeddings of the corresponding class names as classifier). To run the evaluation on Objects365/ OpenImages, run

    python train_net.py --num-gpus 8 --config-file configs/Detic_C2_SwinB_896_4x.yaml --eval-only DATASETS.TEST "('oid_val_expanded','objects365_v2_val',)" MODEL.RESET_CLS_TESTS True MODEL.TEST_CLASSIFIERS "('datasets/metadata/oid_clip_a+cname.npy','datasets/metadata/o365_clip_a+cnamefix.npy',)" MODEL.TEST_NUM_CLASSES "(500,365)" MODEL.MASK_ON False
    
  • Detic_C2_SwinB_896_4x_IN-21K trains on the full ImageNet-22K. We additionally use a dynamic class sampling ("Modified Federated Loss" in Section 4.4) and use a larger data sampling ratio of ImageNet images (1:16 instead of 1:4).

  • Detic_C2_SwinB_896_4x_IN-21K-COCO is a model trained on combined LVIS-COCO and ImageNet-21K for better demo purposes. LVIS models do not detect persons well due to its federated annotation protocol. LVIS+COCO models give better visual results.