Skip to content

Latest commit

 

History

History
74 lines (54 loc) · 14.6 KB

README.md

File metadata and controls

74 lines (54 loc) · 14.6 KB

Mask2Former

Masked-attention Mask Transformer for Universal Image Segmentation

Introduction

Official Repo

Code Snippet

Abstract

Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).

Usage

  • Mask2Former model needs to install MMDetection first.
pip install "mmdet>=3.0.0rc4"

Results and models

Cityscapes

Method Backbone Crop Size Lr schd Mem (GB) Inf time (fps) Device mIoU mIoU(ms+flip) config download
Mask2Former R-50-D32 512x1024 90000 5.67 9.17 A100 80.44 - config model | log
Mask2Former R-101-D32 512x1024 90000 6.81 7.11 A100 80.80 - config model | log)
Mask2Former Swin-T 512x1024 90000 6.36 7.18 A100 81.71 - config model | log)
Mask2Former Swin-S 512x1024 90000 8.09 5.57 A100 82.57 - config model | log)
Mask2Former Swin-B (in22k) 512x1024 90000 10.89 4.32 A100 83.52 - config model | log)
Mask2Former Swin-L (in22k) 512x1024 90000 15.83 2.86 A100 83.65 - config model | log)

ADE20K

Method Backbone Crop Size Lr schd Mem (GB) Inf time (fps) Device mIoU mIoU(ms+flip) config download
Mask2Former R-50-D32 512x512 160000 3.31 26.59 A100 47.87 - config model | log)
Mask2Former R-101-D32 512x512 160000 4.09 22.97 A100 48.60 - config model | log)
Mask2Former Swin-T 512x512 160000 3826 23.82 A100 48.66 - config model | log)
Mask2Former Swin-S 512x512 160000 3.74 19.69 A100 51.24 - config model | log)
Mask2Former Swin-B 640x640 160000 5.66 12.48 A100 52.44 - config model | log)
Mask2Former Swin-B (in22k) 640x640 160000 5.66 12.43 A100 53.90 - config model | log)
Mask2Former Swin-L (in22k) 640x640 160000 8.86 8.81 A100 56.01 - config model | log)

Note:

  • All experiments of Mask2Former are implemented with 8 A100 GPUs with 2 samplers per GPU.
  • As mentioned at the official repo, the results of Mask2Former are relatively not stable, the result of Mask2Former(swin-s) on ADE20K dataset in the table is the medium result obtained by training 5 times following the suggestion of the author.
  • The ResNet backbones utilized in MaskFormer models are standard ResNet rather than ResNetV1c.
  • Test time augmentation is not supported in MMSegmentation 1.x version yet, we would add "ms+flip" results as soon as possible.

Citation

@inproceedings{cheng2021mask2former,
  title={Masked-attention Mask Transformer for Universal Image Segmentation},
  author={Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander Kirillov and Rohit Girdhar},
  journal={CVPR},
  year={2022}
}
@inproceedings{cheng2021maskformer,
  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
  journal={NeurIPS},
  year={2021}
}