Masked-attention Mask Transformer for Universal Image Segmentation
Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).
- Mask2Former model needs to install MMDetection first.
pip install "mmdet>=3.0.0rc4"
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | Device | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|---|
Mask2Former | R-50-D32 | 512x1024 | 90000 | 5.67 | 9.17 | A100 | 80.44 | - | config | model | log |
Mask2Former | R-101-D32 | 512x1024 | 90000 | 6.81 | 7.11 | A100 | 80.80 | - | config | model | log) |
Mask2Former | Swin-T | 512x1024 | 90000 | 6.36 | 7.18 | A100 | 81.71 | - | config | model | log) |
Mask2Former | Swin-S | 512x1024 | 90000 | 8.09 | 5.57 | A100 | 82.57 | - | config | model | log) |
Mask2Former | Swin-B (in22k) | 512x1024 | 90000 | 10.89 | 4.32 | A100 | 83.52 | - | config | model | log) |
Mask2Former | Swin-L (in22k) | 512x1024 | 90000 | 15.83 | 2.86 | A100 | 83.65 | - | config | model | log) |
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | Device | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|---|
Mask2Former | R-50-D32 | 512x512 | 160000 | 3.31 | 26.59 | A100 | 47.87 | - | config | model | log) |
Mask2Former | R-101-D32 | 512x512 | 160000 | 4.09 | 22.97 | A100 | 48.60 | - | config | model | log) |
Mask2Former | Swin-T | 512x512 | 160000 | 3826 | 23.82 | A100 | 48.66 | - | config | model | log) |
Mask2Former | Swin-S | 512x512 | 160000 | 3.74 | 19.69 | A100 | 51.24 | - | config | model | log) |
Mask2Former | Swin-B | 640x640 | 160000 | 5.66 | 12.48 | A100 | 52.44 | - | config | model | log) |
Mask2Former | Swin-B (in22k) | 640x640 | 160000 | 5.66 | 12.43 | A100 | 53.90 | - | config | model | log) |
Mask2Former | Swin-L (in22k) | 640x640 | 160000 | 8.86 | 8.81 | A100 | 56.01 | - | config | model | log) |
Note:
- All experiments of Mask2Former are implemented with 8 A100 GPUs with 2 samplers per GPU.
- As mentioned at the official repo, the results of Mask2Former are relatively not stable, the result of Mask2Former(swin-s) on ADE20K dataset in the table is the medium result obtained by training 5 times following the suggestion of the author.
- The ResNet backbones utilized in MaskFormer models are standard
ResNet
rather thanResNetV1c
. - Test time augmentation is not supported in MMSegmentation 1.x version yet, we would add "ms+flip" results as soon as possible.
@inproceedings{cheng2021mask2former,
title={Masked-attention Mask Transformer for Universal Image Segmentation},
author={Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander Kirillov and Rohit Girdhar},
journal={CVPR},
year={2022}
}
@inproceedings{cheng2021maskformer,
title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
journal={NeurIPS},
year={2021}
}