Skip to content

[ECCV2024] PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

Notifications You must be signed in to change notification settings

ProvenceStar/PartGLEE

Repository files navigation

PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

Junyi Li1* · Junfeng Wu1* · Weizhi Zhao1 · Song Bai2 · Xiang Bai1†

1Huazhong University of Science and Technology   2Bytedance Inc.

*Equal Contribution  †Corresponding Author

Paper PDF Project Page

Highlight:

  • PartGLEE is accepted by ECCV2024!
  • PartGLEE is a part-level foundation model for locating and identifying both objects and parts in images.
  • PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario.
  • PartGLEE achieves SOTA performance across various part-level tasks and obtain competitive results on traditional object-level tasks.

We will release the following contents for PartGLEE:

  • Demo Code

  • [√] Model Zoo

  • [√] Comprehensive User Guide

  • [√] Training Code and Scripts

  • [√] Evaluation Code and Scripts

Getting started

  1. Installation: Please refer to INSTALL.md for more details.
  2. Data preparation: Please refer to DATA.md for more details.
  3. Training: Please refer to TRAIN.md for more details.
  4. Testing: Please refer to TEST.md for more details.
  5. Model zoo: Please refer to MODEL_ZOO.md for more details.

Introduction

We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images. Through a unified framework, PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario. Specifically, we propose a Q-Former to construct the hierarchical relationship between objects and parts, parsing every object into corresponding semantic parts.

data_vis

PartGLEE is comprised of an image encoder, a Q-Former, two independent decoders and a text encoder. We propose a Q-Former to establish the hierarchical relationship between objects and parts. A set of parsing queries are initialized in the Q-Former to interact with each object query, parsing objects into their corresponding parts. This Q-Former functions as a decomposer, extracting and representing parts from object queries. Hence, by training jointly on extensive object-level datasets and limited hierarchical datasets which contain object-part correspondences, our Q-Former obtains strong generalization ability to parse any novel object into its consitute parts.

pipeline

Datsets Unification

To facilitate training our Q-Former, we augment the original part-level datasets with object-level annotations to establish hierarchical correspondences. Specifically, we add object-level annotations to Pascal Part, PartImageNet, Pascal-Part-116, ADE-Part-234, in order to establish the hierarchical correspondence between objects and parts. We further introduce a subset of the open-world instance segmentation dataset SA-1B and augment it into a hierarchical dataset, thus further improving the generalization capability of our model.

Hierarchical_Vis Hierarchical_SA1B

Results

Hierarchical Cognitive Performance

HierarchicalTasks

Tradition Object-level Tasks

ObjectTasks

Generalization Performance

Cross Category Generalization Performance

Cross-Category-Performance

Cross Dataset Generalization Performance

Cross-Dataset-Performance

Visualization Results

Comparison with SAM

SAM_Comparison

Visualization of Generalization Capability

Generalization_Performance

Citing PartGLEE

@article{li2024partglee,
  title={PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects},
  author={Li, Junyi and Wu, Junfeng and Zhao, Weizhi and Bai, Song and Bai, Xiang},
  journal={arXiv preprint arXiv:2407.16696},
  year={2024}
}

Acknowledgments

  • Thanks GLEE for the implementation of multi-dataset training and data processing.

  • Thanks MaskDINO for providing a powerful detector and segmenter.