Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make pruner to support FPN like structure? #79

Closed
twmht opened this issue Feb 11, 2022 · 23 comments · Fixed by #126
Closed

How to make pruner to support FPN like structure? #79

twmht opened this issue Feb 11, 2022 · 23 comments · Fixed by #126
Assignees
Labels
bug Something isn't working

Comments

@twmht
Copy link
Contributor

twmht commented Feb 11, 2022

I am trying to prune from mmdet (https://github.com/open-mmlab/mmdetection/blob/master/configs/atss/atss_r50_fpn_1x_coco.py)

But it throw the exception when forwarding with FPN.

image

Any idea?

By the way, I think it's better to let users to configure the whole block as a group (like neck and bbox_head) which sharing the mask, since these blocks are always complicated, and the parsers are hard to modify to deal with these cases.

@pppppM pppppM self-assigned this Feb 16, 2022
@pppppM
Copy link
Collaborator

pppppM commented Feb 16, 2022

Could you upload pruner config?

@pppppM pppppM added the bug Something isn't working label Feb 17, 2022
@pppppM
Copy link
Collaborator

pppppM commented Feb 17, 2022

I'm very sorry for the inconvenience to you.
There is a bug in the trace mechanism in pruner. The sharable head only traces its first parent module (FPN 0), and other parent modules (FPN 1, FPN 2, ...) are not traced.
I will fix it as soon as possible.

@twmht
Copy link
Contributor Author

twmht commented Feb 17, 2022

@pppppM

This is what I concerned.

By the way, I think it's better to let users to configure the whole block as a group (like neck and bbox_head) which sharing the mask, since these blocks are always complicated, and the parsers are hard to modify to deal with these cases.

I have done this by passing the prebuilt channel space (in txt format) to my reimplemented autoslim.

The parser is hard to deal with all the network architectures. the same problem can be found in nni(https://nni.readthedocs.io/en/stable/Compression/ModelSpeedup.html#limitations).

the channel space can be generated by nni or mmrazor and saved it with text file, and can be modified by the users if the channel dependencies are not correctly built.

What is your opinion?

@pppppM
Copy link
Collaborator

pppppM commented Feb 17, 2022

Sounds great!
Are you interested in making a PR? We can discuss further.

@twmht
Copy link
Contributor Author

twmht commented Feb 17, 2022

@pppppM

Sure.

@pppppM
Copy link
Collaborator

pppppM commented Feb 17, 2022

Before our open source version, most popular models can be handled correctly, such as ResNet, MobileNet, RetinaNet, Yolox, etc. Probably something went wrong when we refactored the code.
It does require some configurable mechanism to handle models that cannot be handled correctly.
I'm very excited to develop this feature with you. Looking forward to your PR.

@pppppM pppppM linked a pull request Apr 1, 2022 that will close this issue
@pppppM pppppM removed a link to a pull request Apr 1, 2022
@pppppM pppppM linked a pull request Apr 1, 2022 that will close this issue
@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 15, 2022

Hi! This bug has been fixed in pr#126.

@Bing1002
Copy link

it's the same as https://github.com/open-mmlab/mmrazor/blob/master/configs/pruning/autoslim/autoslim_mbv2_supernet_8xb256_in1k.py#L41, except the model is change to https://github.com/open-mmlab/mmdetection/blob/master/configs/atss/atss_r50_fpn_1x_coco.py

Hi, Can you please upload the prune config file? I used the way you referred but still got errors. Did you successfully to run autoslim on object detection task? Thanks.

@twmht
Copy link
Contributor Author

twmht commented Apr 16, 2022

@Bing1002

I have not tried the latest mmraor. Did you try?

@Bing1002
Copy link

Bing1002 commented Apr 16, 2022 via email

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 16, 2022

I'm very sorry for the inconvenience to you.
Pruning models with GroupNorm haven't been supported at present. And GroupNorm is the default normalization in ATSSHead. We will fix it as soon as possible.
Models, such as RetinaNet and YoloX, can be pruned in our codes. The following codes can be used:

model = dict(
    type='mmdet.RetinaNet',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs='on_input',
        num_outs=5),
    bbox_head=dict(
        type='RetinaHead',
        num_classes=80,
        in_channels=256,
        stacked_convs=4,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            octave_base_scale=4,
            scales_per_octave=3,
            ratios=[0.5, 1.0, 2.0],
            strides=[8, 16, 32, 64, 128]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[.0, .0, .0, .0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='FocalLoss',
            use_sigmoid=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    # model training and testing settings
    train_cfg=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.4,
            min_pos_iou=0,
            ignore_iof_thr=-1),
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(type='nms', iou_threshold=0.5),
        max_per_img=100))

algorithm_cfg = ConfigDict(
    type='AutoSlim',
    architecture=dict(type='MMDetArchitecture', model=model),
    pruner=dict(
        type='RatioPruner',
        ratios=(2 / 12, 3 / 12, 4 / 12, 5 / 12, 6 / 12, 7 / 12, 8 / 12, 9 / 12,
                10 / 12, 11 / 12, 1.0)),
    retraining=False,
    bn_training_mode=True,
    input_shape=None)

algorithm = build_algorithm(algorithm_cfg)

@Bing1002
Copy link

Hi, thanks for your reply. I tried this config but still failed.

Here is the config:

model = dict(
    type='mmdet.RetinaNet',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs='on_input',
        num_outs=5),
    bbox_head=dict(
        type='RetinaHead',
        num_classes=80,
        in_channels=256,
        stacked_convs=4,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            octave_base_scale=4,
            scales_per_octave=3,
            ratios=[0.5, 1.0, 2.0],
            strides=[8, 16, 32, 64, 128]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[.0, .0, .0, .0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='FocalLoss',
            use_sigmoid=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    # model training and testing settings
    train_cfg=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.4,
            min_pos_iou=0,
            ignore_iof_thr=-1),
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(type='nms', iou_threshold=0.5),
        max_per_img=100))

dataset_type = 'CocoDataset'
data_root = '/mnt/data/coco_demo/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=4,
    workers_per_gpu=2,
    train=dict(
        type='CocoDataset',
        ann_file=data_root + 'annotations/instances_train2017.json',
        img_prefix=data_root + 'train2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ]),
    val=dict(
        type='CocoDataset',
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CocoDataset',
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
evaluation = dict(interval=1, metric='bbox')
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
work_dir = './work_dirs/retinanet_r50_fpn_1x_coco'
auto_resume = False
gpu_ids = range(0, 1)


algorithm = dict(
    type='AutoSlim',
    architecture=dict(type='MMDetArchitecture', model=model),
    pruner=dict(
        type='RatioPruner',
        ratios=(2 / 12, 3 / 12, 4 / 12, 5 / 12, 6 / 12, 7 / 12, 8 / 12, 9 / 12,
                10 / 12, 11 / 12, 1.0)),
    retraining=False,
    bn_training_mode=True,
    input_shape=None)

And the error is the same as usual:

2022-04-16 09:49:55,736 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2022-04-16 09:49:55,736 - mmdet - INFO - Checkpoints will be saved to /home/local/york.lan/bing.zha/code/mmrazor/work_dirs/retinanet_r50_fpn_1x_coco by HardDiskBackend.
Traceback (most recent call last):
  File "/home/local/york.lan/bing.zha/code/mmrazor/tools/mmdet/train_mmdet.py", line 210, in <module>
    main()
  File "/home/local/york.lan/bing.zha/code/mmrazor/tools/mmdet/train_mmdet.py", line 199, in main
    train_mmdet_model(
  File "/home/local/york.lan/bing.zha/code/mmrazor/mmrazor/apis/mmdet/train.py", line 206, in train_mmdet_model
    runner.run(data_loader, cfg.workflow)
  File "/home/local/york.lan/bing.zha/code/mmcv_1.4.6/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/local/york.lan/bing.zha/code/mmcv_1.4.6/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/home/local/york.lan/bing.zha/code/mmcv_1.4.6/mmcv/runner/base_runner.py", line 309, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/local/york.lan/bing.zha/code/mmcv_1.4.6/mmcv/runner/hooks/optimizer.py", line 56, in after_train_iter
    runner.outputs['loss'].backward()
  File "/home/local/york.lan/bing.zha/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/local/york.lan/bing.zha/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Trying to backward through the graph a second time (or directly access saved variables after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved variables after calling backward.

Looking forward your response. Thanks.

@twmht
Copy link
Contributor Author

twmht commented Apr 16, 2022

You may try to set optimizer_config to None.

@Bing1002
Copy link

You may try to set optimizer_config to None.

After changing that part, now I can run pruning. Could you please explain why that setting matters?

@twmht
Copy link
Contributor Author

twmht commented Apr 16, 2022

They call optimizer.step() in autoslim, not by mmcv hook. Setting optimizer_config to None would not register mmcv hook and you would not call optimizer.step() twice.

@Bing1002
Copy link

Thanks. Then it seems the return loss become nan.

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 5000/5000, 15.2 task/s, elapsed: 329s, ETA:     0s2022-04-16 14:02:58,786 - mmdet - INFO - Evaluating bbox...
Loading and preparing results...
2022-04-16 14:02:58,787 - mmdet - ERROR - The testing results of the whole dataset is empty.
2022-04-16 14:02:58,816 - mmdet - INFO - Exp name: autoslim_retinanet.py
2022-04-16 14:02:58,841 - mmdet - INFO - Epoch(val) [4][5000]
2022-04-16 14:04:45,274 - mmdet - INFO - Epoch [5][50/1239]     lr: 1.000e-02, eta: 5:31:29, time: 2.128, data_time: 0.051, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:06:28,567 - mmdet - INFO - Epoch [5][100/1239]    lr: 1.000e-02, eta: 5:29:53, time: 2.066, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:08:13,967 - mmdet - INFO - Epoch [5][150/1239]    lr: 1.000e-02, eta: 5:28:21, time: 2.108, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:09:58,547 - mmdet - INFO - Epoch [5][200/1239]    lr: 1.000e-02, eta: 5:26:47, time: 2.092, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:11:42,629 - mmdet - INFO - Epoch [5][250/1239]    lr: 1.000e-02, eta: 5:25:11, time: 2.082, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:13:25,492 - mmdet - INFO - Epoch [5][300/1239]    lr: 1.000e-02, eta: 5:23:34, time: 2.057, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:15:09,866 - mmdet - INFO - Epoch [5][350/1239]    lr: 1.000e-02, eta: 5:21:59, time: 2.087, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan

Do you have any idea about it? Thanks a lot!

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 17, 2022

We have not verified whether AutoSlim works on object detection. Maybe you can try to prune Mobilenet v2 first to check if there is a problem with the codes or the AutoSlim.

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 19, 2022

Thanks. Then it seems the return loss become nan.

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 5000/5000, 15.2 task/s, elapsed: 329s, ETA:     0s2022-04-16 14:02:58,786 - mmdet - INFO - Evaluating bbox...
Loading and preparing results...
2022-04-16 14:02:58,787 - mmdet - ERROR - The testing results of the whole dataset is empty.
2022-04-16 14:02:58,816 - mmdet - INFO - Exp name: autoslim_retinanet.py
2022-04-16 14:02:58,841 - mmdet - INFO - Epoch(val) [4][5000]
2022-04-16 14:04:45,274 - mmdet - INFO - Epoch [5][50/1239]     lr: 1.000e-02, eta: 5:31:29, time: 2.128, data_time: 0.051, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:06:28,567 - mmdet - INFO - Epoch [5][100/1239]    lr: 1.000e-02, eta: 5:29:53, time: 2.066, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:08:13,967 - mmdet - INFO - Epoch [5][150/1239]    lr: 1.000e-02, eta: 5:28:21, time: 2.108, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:09:58,547 - mmdet - INFO - Epoch [5][200/1239]    lr: 1.000e-02, eta: 5:26:47, time: 2.092, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:11:42,629 - mmdet - INFO - Epoch [5][250/1239]    lr: 1.000e-02, eta: 5:25:11, time: 2.082, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:13:25,492 - mmdet - INFO - Epoch [5][300/1239]    lr: 1.000e-02, eta: 5:23:34, time: 2.057, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan
2022-04-16 14:15:09,866 - mmdet - INFO - Epoch [5][350/1239]    lr: 1.000e-02, eta: 5:21:59, time: 2.087, data_time: 0.007, memory: 9424, max_model.loss_cls: nan, max_model.loss_bbox: nan, min_model.loss_cls: nan, min_model.loss_bbox: nan, prune_model1.loss_cls: nan, prune_model1.loss_bbox: nan, prune_model2.loss_cls: nan, prune_model2.loss_bbox: nan, loss: nan

Do you have any idea about it? Thanks a lot!

Do you detach the teacher's output in the loss function? Such as here.

@twmht
Copy link
Contributor Author

twmht commented Apr 19, 2022

@HIT-cwh

he did not use distilling.

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 19, 2022

@HIT-cwh

he did not use distilling.

My bad.
Due to a lack of manpower, the progress of transferring AutoSlim to other tasks is not very satisfactory. And I'm very sorry for the inconvenience to you.
We are reproducing BigNAS, if it goes well, we will release the BigNAS example on semantic segmentation.

@twmht
Copy link
Contributor Author

twmht commented Apr 19, 2022

@HIT-cwh

In fact, I have implemented my own autoslim, it's quite different from mmrazor, the memory usage is much efficient than mmrazor.

I use grad clip to clip the gradient in object detection, without distilling the result is satisfied. but when applying the distilling like cwd, the result is bad. You may try grad clip if you hava nan in the beginning of training.

by the way, most anytime network (like BigNAS) does not explain how they use distilling in object detection, I am exploring this and I am looking forward your experiment on this.

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 19, 2022

@HIT-cwh

In fact, I have implemented my own autoslim, it's quite different from mmrazor, the memory usage is much efficient than mmrazor.

I use grad clip to clip the gradient in object detection, without distilling the result is satisfied. but when applying the distilling like cwd, the result is bad. You may try grad clip if you hava nan in the beginning of training.

by the way, most anytime network (like BigNAS) does not explain how they use distilling in object detection, I am exploring this and I am looking forward your experiment on this.

I will appreciate it if you can share how to save memory in your implementation. And we will improve our codes based on that.

@pppppM pppppM closed this as completed Jul 1, 2022
humu789 pushed a commit to humu789/mmrazor that referenced this issue Feb 13, 2023
* [Refactor] Refactor configs according to new standard (open-mmlab#67)

* modify cfg and cfg_util

* modify tensorrt config

* fix bug

* lint

* Fix

1. Delete print
2. Modify the return value from "False, None" to "None" and related code
3. Rename 2 get functions

* modify apply_marks

* [Feature] Refactor ocr config (open-mmlab#71)

* add text detection config refactor

* add text recognition refactor

* add static exporting for mmocr

* fix lint

* set max space in child config

* use Sequence[int] instead

* add assert input_shape

* fix static bug and add ppl ort and trt static (open-mmlab#77)

* [Feature] Refine setup.py (open-mmlab#61)

* add setup.py and related files

* lint

* Edit requirements

* modify onnx version

* modify according to comments

* [Refactor] Refactor mmseg configs  (open-mmlab#73)

* refactor mmseg config

* change create_input

* fix lint

* fix lint

* fix lint

* fix yapf

* fix yapf

* update export

* remove Segmentation

* remove tast assert

* add onnx_config

* remove hardcode

* Inherit with static

* Remove blank line

* Add segmentation task enum

* add assert task

* mmocr version 0.3.0 (open-mmlab#79)

* add dump_info

* [Feature]: Refactor config in mmdet (open-mmlab#75)

* support onnxruntime

* add two stage

* test two-stage ort and ppl

* update fcos post_params

* fix calib

* test ok with maskrcnn dynamic

* add empty line

* add static into config filename

* add input_shape to create_input in mmdet

* add static to some configs

* remove todo codes

* remove partition config in base

* refactor create_input

* rename task name in mmdet

* return None if input_shape is None

* add size info into mmdet configs filenames

* reorganize mmdet configs

* add object detection task for mmdet

* rename get_mmdet_params

* keep naming style consistent

* update post_params for fcos

* fix typo in ncnn config

* [Refactor] Refactor mmedit static config (open-mmlab#78)

* add static cfg

* update create_input

* [Refactor]: Refactor mmcls configs (open-mmlab#74)

* refactor mmcls2.0

* fix classify_tensorrt_dynamic.py

* fix classify_tensorrt_dynmic.py

* classify_tensorrt_dynamic_int8.py

* fix file name

* fix ncnn ppl

* updata prepare_input.py

* update utils.py

* updata constant.py

* add

* fix prepare_input.py

* fix prepare_input.py

* add static config file

* add blank lines

* fix prepare_input.py(wait test)

* fix input_shape(wait test)

* Update prepare_input.py

* fix classification_tensorrt_dynamic(wait test)

* fix classification_tensorrt_dynamic_int8(wait test)

* fix classification_tensorrt_static_int8(wait test)

* Rename classification_tensorrt_dynamic.py to classification_tensorrt_dynamic-224x224-224x224.py

* Rename classification_tensorrt_dynamic_int8.py to classification_tensorrt_dynamic_int8-224x224-224x224.py

* Rename classification_tensorrt_dynamic_int8-224x224-224x224.py to classification_tensorrt_int8_dynamic_224x224-224x224.py

* Rename classification_tensorrt_dynamic-224x224-224x224.py to classification_tensorrt_dynamic_224x224-224x224.py

* Rename classification_tensorrt_static.py to classification_tensorrt_static_224x224.py

* Rename classification_tensorrt_static_int8.py to classification_tensorrt_int8_static_224x224.py

* Update prepare_input.py

* Rename classification_tensorrt_dynamic_224x224-224x224.py to classification_tensorrt_dynamic-224x224-224x224.py

* Rename classification_tensorrt_int8_dynamic_224x224-224x224.py to classification_tensorrt_int8-dynamic_224x224-224x224.py

* Rename classification_tensorrt_int8-dynamic_224x224-224x224.py to classification_tensorrt_int8_dynamic-224x224-224x224.py

* Rename classification_tensorrt_int8_static_224x224.py to classification_tensorrt_int8_static-224x224.py

* Rename classification_tensorrt_static_224x224.py to classification_tensorrt_static-224x224.py

* Update prepare_input.py

* Update prepare_input.py

* Update prepare_input.py

* Update prepare_input.py

* Update prepare_input.py

* Update prepare_input.py

* Update prepare_input.py

* change logging msg

Co-authored-by: maningsheng <mnsheng@yeah.net>

* fix

* fix else branch

* fix bug for trt in mmseg

* enable dump trt info

* fix trt static for mmdet

* remove two-stage_partition_tensorrt_static-800x1344 config

* fix wrong backend in ppl config

* fix partition calibration

Co-authored-by: Yifan Zhou <singlezombie@163.com>
Co-authored-by: AllentDan <41138331+AllentDan@users.noreply.github.com>
Co-authored-by: hanrui1sensetime <83800577+hanrui1sensetime@users.noreply.github.com>
Co-authored-by: RunningLeon <maningsheng@sensetime.com>
Co-authored-by: VVsssssk <88368822+VVsssssk@users.noreply.github.com>
Co-authored-by: maningsheng <mnsheng@yeah.net>
Co-authored-by: AllentDan <AllentDan@yeah.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants