Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes. Models are available at this https URL.
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | Device | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|---|
UPerNet | R-50 | 512x1024 | 40000 | 6.4 | 4.25 | V100 | 77.10 | 78.37 | config | model | log |
UPerNet | R-101 | 512x1024 | 40000 | 7.4 | 3.79 | V100 | 78.69 | 80.11 | config | model | log |
UPerNet | R-50 | 769x769 | 40000 | 7.2 | 1.76 | V100 | 77.98 | 79.70 | config | model | log |
UPerNet | R-101 | 769x769 | 40000 | 8.4 | 1.56 | V100 | 79.03 | 80.77 | config | model | log |
UPerNet | R-50 | 512x1024 | 80000 | - | - | V100 | 78.19 | 79.19 | config | model | log |
UPerNet | R-101 | 512x1024 | 80000 | - | - | V100 | 79.40 | 80.46 | config | model | log |
UPerNet | R-50 | 769x769 | 80000 | - | - | V100 | 79.39 | 80.92 | config | model | log |
UPerNet | R-101 | 769x769 | 80000 | - | - | V100 | 80.10 | 81.49 | config | model | log |
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | Device | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|---|
UPerNet | R-50 | 512x512 | 80000 | 8.1 | 23.40 | V100 | 40.70 | 41.81 | config | model | log |
UPerNet | R-101 | 512x512 | 80000 | 9.1 | 20.34 | V100 | 42.91 | 43.96 | config | model | log |
UPerNet | R-50 | 512x512 | 160000 | - | - | V100 | 42.05 | 42.78 | config | model | log |
UPerNet | R-101 | 512x512 | 160000 | - | - | V100 | 43.82 | 44.85 | config | model | log |
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | Device | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|---|
UPerNet | R-50 | 512x512 | 20000 | 6.4 | 23.17 | V100 | 74.82 | 76.35 | config | model | log |
UPerNet | R-101 | 512x512 | 20000 | 7.5 | 19.98 | V100 | 77.10 | 78.29 | config | model | log |
UPerNet | R-50 | 512x512 | 40000 | - | - | V100 | 75.92 | 77.44 | config | model | log |
UPerNet | R-101 | 512x512 | 40000 | - | - | V100 | 77.43 | 78.56 | config | model | log |
@inproceedings{xiao2018unified,
title={Unified perceptual parsing for scene understanding},
author={Xiao, Tete and Liu, Yingcheng and Zhou, Bolei and Jiang, Yuning and Sun, Jian},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
pages={418--434},
year={2018}
}