Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LVIS chunked evaluation and image chunked inference of GLIP #11136

Merged
merged 5 commits into from
Nov 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion configs/glip/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ model.save_pretrained("your path/bert-base-uncased")
tokenizer.save_pretrained("your path/bert-base-uncased")
```

## Results and Models
## COCO Results and Models

| Model | Zero-shot or Finetune | COCO mAP | Official COCO mAP | Pre-Train Data | Config | Download |
| :--------: | :-------------------: | :------: | ----------------: | :------------------------: | :---------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
Expand All @@ -78,3 +78,24 @@ Note:
3. Taking the GLIP-T(A) model as an example, I trained it twice using the official code, and the fine-tuning mAP were 52.5 and 52.6. Therefore, the mAP we achieved in our reproduction is higher than the official results. The main reason is that we modified the `weight_decay` parameter.
4. Our experiments revealed that training for 24 epochs leads to overfitting. Therefore, we chose the best-performing model. If users want to train on a custom dataset, it is advisable to shorten the number of epochs and save the best-performing model.
5. Due to the official absence of fine-tuning hyperparameters for the GLIP-L model, we have not yet reproduced the official accuracy. I have found that overfitting can also occur, so it may be necessary to consider custom modifications to data augmentation and model enhancement. Given the high cost of training, we have not conducted any research on this matter at the moment.

## LVIS Results

| Model | Official | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP | Pre-Train Data | Config | Download |
| :--------: | :------: | :---------: | :---------: | :---------: | :--------: | :--------: | :--------: | :--------: | :-------: | :------------------------: | :---------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: |
| GLIP-T (A) | ✔ | | | | | | | | | O365 | [config](lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth) |
| GLIP-T (A) | | 12.1 | 15.5 | 25.8 | 20.2 | 6.2 | 10.9 | 22.8 | 14.7 | O365 | [config](lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth) |
| GLIP-T (B) | ✔ | | | | | | | | | O365 | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_b_mmdet-6dfbd102.pth) |
| GLIP-T (B) | | 8.6 | 13.9 | 26.0 | 19.3 | 4.6 | 9.8 | 22.6 | 13.9 | O365 | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_b_mmdet-6dfbd102.pth) |
| GLIP-T (C) | ✔ | 14.3 | 19.4 | 31.1 | 24.6 | | | | | O365,GoldG | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_c_mmdet-2fc427dd.pth) |
| GLIP-T (C) | | 14.4 | 19.8 | 31.9 | 25.2 | 8.3 | 13.2 | 28.1 | 18.2 | O365,GoldG | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_c_mmdet-2fc427dd.pth) |
| GLIP-T | ✔ | | | | | | | | | O365,GoldG,CC3M,SBU | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_mmdet-c24ce662.pth) |
| GLIP-T | | 18.1 | 21.2 | 33.1 | 26.7 | 10.8 | 14.7 | 29.0 | 19.6 | O365,GoldG,CC3M,SBU | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_mmdet-c24ce662.pth) |
| GLIP-L | ✔ | 29.2 | 34.9 | 42.1 | 37.9 | | | | | FourODs,GoldG,CC3M+12M,SBU | [config](lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_l_mmdet-abfe026b.pth) |
| GLIP-L | | 27.9 | 33.7 | 39.7 | 36.1 | | | | | FourODs,GoldG,CC3M+12M,SBU | [config](lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_l_mmdet-abfe026b.pth) |

Note:

1. The above are zero-shot evaluation results.
2. The evaluation metric we used is LVIS FixAP. For specific details, please refer to [Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details](https://arxiv.org/pdf/2102.01066.pdf).
3. We found that the performance on small models is better than the official results, but it is lower on large models. This is mainly due to the incomplete alignment of the GLIP post-processing.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py'

model = dict(
backbone=dict(
embed_dims=192,
depths=[2, 2, 18, 2],
num_heads=[6, 12, 24, 48],
window_size=12,
drop_path_rate=0.4,
),
neck=dict(in_channels=[384, 768, 1536]),
bbox_head=dict(early_fuse=True, num_dyhead_blocks=8))
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py'

model = dict(
backbone=dict(
embed_dims=192,
depths=[2, 2, 18, 2],
num_heads=[6, 12, 24, 48],
window_size=12,
drop_path_rate=0.4,
),
neck=dict(in_channels=[384, 768, 1536]),
bbox_head=dict(early_fuse=True, num_dyhead_blocks=8))
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'

model = dict(test_cfg=dict(
max_per_img=300,
chunked_size=40,
))

dataset_type = 'LVISV1Dataset'
data_root = 'data/coco/'

val_dataloader = dict(
dataset=dict(
data_root=data_root,
type=dataset_type,
ann_file='annotations/lvis_od_val.json',
data_prefix=dict(img='')))
test_dataloader = val_dataloader

# numpy < 1.24.0
val_evaluator = dict(
_delete_=True,
type='LVISFixedAPMetric',
ann_file=data_root + 'annotations/lvis_od_val.json')
test_evaluator = val_evaluator
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'

model = dict(test_cfg=dict(
max_per_img=300,
chunked_size=40,
))

dataset_type = 'LVISV1Dataset'
data_root = 'data/coco/'

val_dataloader = dict(
dataset=dict(
data_root=data_root,
type=dataset_type,
ann_file='annotations/lvis_v1_minival_inserted_image_name.json',
data_prefix=dict(img='')))
test_dataloader = val_dataloader

# numpy < 1.24.0
val_evaluator = dict(
_delete_=True,
type='LVISFixedAPMetric',
ann_file=data_root +
'annotations/lvis_v1_minival_inserted_image_name.json')
test_evaluator = val_evaluator
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py'

model = dict(bbox_head=dict(early_fuse=True))
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py'

model = dict(bbox_head=dict(early_fuse=True))
37 changes: 35 additions & 2 deletions demo/image_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,16 @@
glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
--texts 'There are a lot of cars here.'

python demo/image_demo.py demo/demo.jpg \
glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
--texts '$: coco'

python demo/image_demo.py demo/demo.jpg \
glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
--texts '$: lvis' --pred-score-thr 0.7 \
--palette random --chunked-size 80


Visualize prediction results::

python demo/image_demo.py demo/demo.jpg rtmdet-ins-s --show
Expand All @@ -41,6 +51,7 @@
from mmengine.logging import print_log

from mmdet.apis import DetInferencer
from mmdet.evaluation import get_classes


def parse_args():
Expand All @@ -60,7 +71,12 @@ def parse_args():
type=str,
default='outputs',
help='Output directory of images or prediction results.')
parser.add_argument('--texts', help='text prompt')
# Once you input a format similar to $: xxx, it indicates that
# the prompt is based on the dataset class name.
# support $: coco, $: voc, $: cityscapes, $: lvis, $: imagenet_det.
# detail to `mmdet/evaluation/functional/class_names.py`
parser.add_argument(
'--texts', help='text prompt, such as "bench . car .", "$: coco"')
parser.add_argument(
'--device', default='cuda:0', help='Device used for inference')
parser.add_argument(
Expand Down Expand Up @@ -91,14 +107,21 @@ def parse_args():
default='none',
choices=['coco', 'voc', 'citys', 'random', 'none'],
help='Color palette used for visualization')
# only for GLIP
# only for GLIP and Grounding DINO
parser.add_argument(
'--custom-entities',
'-c',
action='store_true',
help='Whether to customize entity names? '
'If so, the input text should be '
'"cls_name1 . cls_name2 . cls_name3 ." format')
parser.add_argument(
'--chunked-size',
'-s',
type=int,
default=-1,
help='If the number of categories is very large, '
'you can specify this parameter to truncate multiple predictions.')

call_args = vars(parser.parse_args())

Expand All @@ -111,6 +134,12 @@ def parse_args():
call_args['weights'] = call_args['model']
call_args['model'] = None

if call_args['texts'] is not None:
if call_args['texts'].startswith('$:'):
dataset_name = call_args['texts'][3:].strip()
class_names = get_classes(dataset_name)
call_args['texts'] = [tuple(class_names)]

init_kws = ['model', 'weights', 'device', 'palette']
init_args = {}
for init_kw in init_kws:
Expand All @@ -125,6 +154,10 @@ def main():
# may consume too much memory if your input folder has a lot of images.
# We will be optimized later.
inferencer = DetInferencer(**init_args)

chunked_size = call_args.pop('chunked_size')
inferencer.model.test_cfg.chunked_size = chunked_size

inferencer(**call_args)

if call_args['out_dir'] != '' and not (call_args['no_save_vis']
Expand Down
Loading