We only use aliyun to maintain the model zoo since MMDetection V2.0. The model zoo of V1.x has been deprecated.
- All models were trained on
coco_2017_train
, and tested on thecoco_2017_val
. - We use distributed training.
- All pytorch-style pretrained backbones on ImageNet are from PyTorch model zoo, caffe-style pretrained backbones are converted from the newly released model from detectron2.
- For fair comparison with other codebases, we report the GPU memory as the maximum value of
torch.cuda.max_memory_allocated()
for all 8 GPUs. Note that this value is usually less than whatnvidia-smi
shows. - We report the inference time as the total time of network forwarding and post-processing, excluding the data loading time. Results are obtained with the script benchmark.py which computes the average time on 2000 images.
It is common to initialize from backbone models pre-trained on ImageNet classification task. All pre-trained model links can be found at open_mmlab. According to img_norm_cfg
and source of weight, we can divide all the ImageNet pre-trained model weights into some cases:
- TorchVision: Corresponding to torchvision weight, including ResNet50, ResNet101. The
img_norm_cfg
isdict(mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
. - Pycls: Corresponding to pycls weight, including RegNetX. The
img_norm_cfg
isdict( mean=[103.530, 116.280, 123.675], std=[57.375, 57.12, 58.395], to_rgb=False)
. - MSRA styles: Corresponding to MSRA weights, including ResNet50_Caffe and ResNet101_Caffe. The
img_norm_cfg
isdict( mean=[103.530, 116.280, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False)
. - Caffe2 styles: Currently only contains ResNext101_32x8d. The
img_norm_cfg
isdict(mean=[103.530, 116.280, 123.675], std=[57.375, 57.120, 58.395], to_rgb=False)
. - Other styles: E.g SSD which corresponds to
img_norm_cfg
isdict(mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True)
and YOLOv3 which corresponds toimg_norm_cfg
isdict(mean=[0, 0, 0], std=[255., 255., 255.], to_rgb=True)
.
The detailed table of the commonly used backbone models in MMDetection is listed below :
model | source | link | description |
---|---|---|---|
ResNet50 | TorchVision | torchvision's ResNet-50 | From torchvision's ResNet-50. |
ResNet101 | TorchVision | torchvision's ResNet-101 | From torchvision's ResNet-101. |
RegNetX | Pycls | RegNetX_3.2gf, RegNetX_800mf. etc. | From pycls. |
ResNet50_Caffe | MSRA | MSRA's ResNet-50 | Converted copy of Detectron2's R-50.pkl model. The original weight comes from MSRA's original ResNet-50. |
ResNet101_Caffe | MSRA | MSRA's ResNet-101 | Converted copy of Detectron2's R-101.pkl model. The original weight comes from MSRA's original ResNet-101. |
ResNext101_32x8d | Caffe2 | Caffe2 ResNext101_32x8d | Converted copy of Detectron2's X-101-32x8d.pkl model. The ResNeXt-101-32x8d model trained with Caffe2 at FB. |
Please refer to RPN for details.
Please refer to Faster R-CNN for details.
Please refer to Mask R-CNN for details.
Please refer to Fast R-CNN for details.
Please refer to RetinaNet for details.
Please refer to Cascade R-CNN for details.
Please refer to HTC for details.
Please refer to SSD for details.
Please refer to Group Normalization for details.
Please refer to Weight Standardization for details.
Please refer to Deformable Convolutional Networks for details.
Please refer to CARAFE for details.
Please refer to Instaboost for details.
Please refer to Libra R-CNN for details.
Please refer to Guided Anchoring for details.
Please refer to FCOS for details.
Please refer to FoveaBox for details.
Please refer to RepPoints for details.
Please refer to FreeAnchor for details.
Please refer to Grid R-CNN for details.
Please refer to GHM for details.
Please refer to GCNet for details.
Please refer to HRNet for details.
Please refer to Mask Scoring R-CNN for details.
Please refer to Rethinking ImageNet Pre-training for details.
Please refer to NAS-FPN for details.
Please refer to ATSS for details.
Please refer to FSAF for details.
Please refer to RegNet for details.
Please refer to Res2Net for details.
Please refer to GRoIE for details.
Please refer to Dynamic R-CNN for details.
Please refer to PointRend for details.
Please refer to DetectoRS for details.
Please refer to Generalized Focal Loss for details.
Please refer to CornerNet for details.
Please refer to YOLOv3 for details.
Please refer to PAA for details.
Please refer to SABL for details.
Please refer to CentripetalNet for details.
Please refer to ResNeSt for details.
Please refer to DETR for details.
Please refer to Deformable DETR for details.
Please refer to AutoAssign for details.
Please refer to YOLOF for details.
Please refer to Seesaw Loss for details.
Please refer to CenterNet for details.
Please refer to YOLOX for details.
Please refer to PVT for details.
Please refer to SOLO for details.
We also benchmark some methods on PASCAL VOC, Cityscapes and WIDER FACE.
We also train Faster R-CNN and Mask R-CNN using ResNet-50 and RegNetX-3.2G with multi-scale training and longer schedules. These models serve as strong pre-trained models for downstream tasks for convenience.
We provide analyze_logs.py to get average time of iteration in training. You can find examples in Log Analysis.
We compare the training speed of Mask R-CNN with some other popular frameworks (The data is copied from detectron2). For mmdetection, we benchmark with mask_rcnn_r50_caffe_fpn_poly_1x_coco_v1.py, which should have the same setting with mask_rcnn_R_50_FPN_noaug_1x.yaml of detectron2. We also provide the checkpoint and training log for reference. The throughput is computed as the average throughput in iterations 100-500 to skip GPU warmup time.
Implementation | Throughput (img/s) |
---|---|
Detectron2 | 62 |
MMDetection | 61 |
maskrcnn-benchmark | 53 |
tensorpack | 50 |
simpledet | 39 |
Detectron | 19 |
matterport/Mask_RCNN | 14 |
We provide benchmark.py to benchmark the inference latency.
The script benchmarkes the model with 2000 images and calculates the average time ignoring first 5 times. You can change the output log interval (defaults: 50) by setting LOG-INTERVAL
.
python toools/benchmark.py ${CONFIG} ${CHECKPOINT} [--log-interval $[LOG-INTERVAL]] [--fuse-conv-bn]
The latency of all models in our model zoo is benchmarked without setting fuse-conv-bn
, you can get a lower latency by setting it.
We compare mmdetection with Detectron2 in terms of speed and performance. We use the commit id 185c27e(30/4/2020) of detectron. For fair comparison, we install and run both frameworks on the same machine.
- 8 NVIDIA Tesla V100 (32G) GPUs
- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
- Python 3.7
- PyTorch 1.4
- CUDA 10.1
- CUDNN 7.6.03
- NCCL 2.4.08
Type | Lr schd | Detectron2 | mmdetection | Download |
---|---|---|---|---|
Faster R-CNN | 1x | 37.9 | 38.0 | model | log |
Mask R-CNN | 1x | 38.6 & 35.2 | 38.8 & 35.4 | model | log |
Retinanet | 1x | 36.5 | 37.0 | model | log |
The training speed is measure with s/iter. The lower, the better.
Type | Detectron2 | mmdetection |
---|---|---|
Faster R-CNN | 0.210 | 0.216 |
Mask R-CNN | 0.261 | 0.265 |
Retinanet | 0.200 | 0.205 |
The inference speed is measured with fps (img/s) on a single GPU, the higher, the better. To be consistent with Detectron2, we report the pure inference speed (without the time of data loading). For Mask R-CNN, we exclude the time of RLE encoding in post-processing. We also include the officially reported speed in the parentheses, which is slightly higher than the results tested on our server due to differences of hardwares.
Type | Detectron2 | mmdetection |
---|---|---|
Faster R-CNN | 25.6 (26.3) | 22.2 |
Mask R-CNN | 22.5 (23.3) | 19.6 |
Retinanet | 17.8 (18.2) | 20.6 |
Type | Detectron2 | mmdetection |
---|---|---|
Faster R-CNN | 3.0 | 3.8 |
Mask R-CNN | 3.4 | 3.9 |
Retinanet | 3.9 | 3.4 |