Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix Nets: A New Deep Architecture for Object Detection - mAP of 47.8@0.5...0.95 on MS COCO, #3772

Open
WongKinYiu opened this issue Aug 17, 2019 · 84 comments
Labels
ToDo RoadMap want enhancement Want to improve accuracy, speed or functionality

Comments

@WongKinYiu
Copy link
Collaborator

Could this repo supports max pooling layer with different x, y strides.
I would like to implement the state-of-the-art object detector.
Thanks.

image

@AlexeyAB AlexeyAB added the want enhancement Want to improve accuracy, speed or functionality label Aug 17, 2019
@AlexeyAB
Copy link
Owner

AlexeyAB commented Aug 17, 2019

https://arxiv.org/abs/1908.04646v2


xNets can be applied to any backbone, similar to FPNs.
...
We detect corners for objects of different sizes and aspect ratios using different matrix layers, and simplify the matching process by removing the embedding layer entirely and regressing the object centers directly.
We show that KP-xNet outperforms all existing single-shot detectors by achieving 47.8% mAP on the MS COCO benchmark.


image

xNets map objects with different sizes and aspect ratios into layers where the sizes and the aspect ratios of the objects within their layers are nearly uniform. Hence, xNets provide a scale and aspect ratio aware architecture. We leverage xNets to enhance key-points based object detection. Our architecture achieves mAP of 47.8 on MS COCO, which is higher than any other single-shot detector while using half the number of parameters and training 3x faster than the next best architecture.

@AlexeyAB AlexeyAB changed the title Matrix Nets: A New Deep Architecture for Object Detection Matrix Nets: A New Deep Architecture for Object Detection - mAP of 47.8@0.5...0.95 on MS COCO, Aug 17, 2019
@AlexeyAB
Copy link
Owner

AlexeyAB commented Aug 17, 2019

@WongKinYiu

Could this repo supports max pooling layer with different x, y strides.

Is this the only necessary feature for the implementation of the Matrix Net?
So we should have:

stride_x=
stride_y=

Or do you need something else?

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB Hello,
Yes, but if it can support convolutional layer & average pooling layer with stride_x and stride_y is better.
The paper seems use convolutional layers to do down-sampling.

@AlexeyAB AlexeyAB added the ToDo RoadMap label Aug 17, 2019
@AlexeyAB
Copy link
Owner

AlexeyAB commented Aug 27, 2019

@WongKinYiu Hi,

I added support for

[maxpool]
stride_x=2
stride_y=3
...

Try to make some network, and if it will work fine with increasing accuracy, I will add stride_x & stride_y for convolutional layer.

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB Hello, thank you very very much.

@AlexeyAB
Copy link
Owner

It seems that MatrixNet (different strides) and TridentNet (different dilations) are very promising approaches for generalizing different sizes and aspect ratios of objects.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Sep 1, 2019

@WongKinYiu Hi,

What progress?

I added stride_x=... stride_y=... for convolutional layer. So you can try to make MatrixNet.

[convolutional]
stride_x=2
stride_y=3
...

https://arxiv.org/pdf/1908.04646v2.pdf

2.1. Layer Generation
....
The upper triangular layers are obtained by applying a series of shared 3x3 convolutions with stride 1x2 on the diagonal layers. Similarly, the bottom left layers are obtained using shared 3x3 convolutions with stride 2x1. The parameters are shared across all down-sampling convolutions to minimize the number of new parameters.

@WongKinYiu
Do you understand:

  • Does it mean that some layers share their weights (as in TridentNet)?
  • What layers with which layers share their weight? (only layers with the same aspect-ratio, f.e. all conv-layers with w=4x h=1x share the same weights?)

63207472-b22a0a00-c0f9-11e9-9004-51bbeeb71b9d

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB Hello, for the version with different stride max pooling layers, now training 200k epochs.
it needs 7~9 more days to finish training.

@LukeAI
Copy link

LukeAI commented Sep 1, 2019

@WongKinYiu are you training a matrixnet with the original resnext-101 backbone? or something else?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Sep 1, 2019

@WongKinYiu Thanks, do you know about "shared 3x3 convolutions"? #3772 (comment)

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB Yes, I think it similar to TridentNet.
In my understand, only layers with the same aspect-ratio share the same weights.
But for max pooling version, there is no parameters.

@WongKinYiu
Copy link
Collaborator Author

@LukeAI Hello.
I do not use big model like resnet-50, resnext-101...
I only train the very small models.

@LukeAI
Copy link

LukeAI commented Sep 2, 2019

which one? :) darknet-53 ?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Sep 2, 2019

@WongKinYiu Thanks.

You can try to train 3 models:

  1. with [maxpool] stride_x=... stride_y=...

  2. with [convolutional] stride_x=... stride_y=... with share_index=... in some layers, to get weights from conv-layer with this number (as in TridentNet)

  3. then with [convolutional] stride_x=... stride_y=... antialiasing=1 with share_index=... in some layers

And compare results.

I added antialiasing=1 parameter for [convolutional] layer: #3672

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB Thank you for your advise. I will get free gpus after two weeks.

@WongKinYiu
Copy link
Collaborator Author

WongKinYiu commented Sep 7, 2019

@AlexeyAB

maxpool version:

Model +anchors BFLOPs mAP@0.5 mAP@0.5:0.95
original - 5.32 48.7 25.7
matrix net no 5.27 49.1 25.4
matrix net yes 5.32 48.6 26.3

@AlexeyAB
Copy link
Owner

AlexeyAB commented Sep 7, 2019

@WongKinYiu Thanks for results!

  • What does it mean anchors - no? Did you implement [yolo]-layers without anchors?
  • Can you share cfg-file?

@WongKinYiu
Copy link
Collaborator Author

WongKinYiu commented Sep 7, 2019

@AlexeyAB

It means without increase number of anchors.
image

I can not access the cfg file now, becuz power supply of my office is going to be cut off tomorrow, i turned off my computer already.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Sep 7, 2019

@WongKinYiu Is Original - original MatrixNet from https://arxiv.org/abs/1908.04646v2 ?

@WongKinYiu
Copy link
Collaborator Author

WongKinYiu commented Sep 7, 2019

@AlexeyAB

No, original means a yolov3-based model, without adding the feature proposed by MatrixNet.

You can see the figure in #3772 (comment)
original means a model with three yolo layers (3+3+3 = 9 anchors).
and the model right hand side has nine yolo layers with different aspect ratios (MatrixNet), each yolo layer predict bbox of one anchor (1+1+1+1+1+1+1+1+1 = 9 anchors).

@WongKinYiu
Copy link
Collaborator Author

WongKinYiu commented Sep 8, 2019

cfg files based on yolov3-tiny

yolov3-tiny_3l(15.778BFLOPs).cfg.txt
yolov3-tiny_3l_maxpoolmatrixnet(15.565BFLOPs).cfg.txt
yolov3-tiny_3l_maxpoolmatrixnet_addanchors(15.787BFLOPs).cfg.txt

for conv version, just replace maxpool layers by conv layers with shared weights.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Sep 8, 2019

@WongKinYiu Thanks. What mAP@0.5 did you get for yolov3-tiny_3l_maxpoolmatrixnet_addanchors(15.787BFLOPs).cfg.txt ?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Sep 8, 2019

@WongKinYiu

Did you anywhere meet original MatrixNet-backbone (not just yolov3/tiny with non-unoform strides), if yes - can you share it?

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB Hello,

I do not meet original MatrixNet-backbone.

In the paper, they only show the concept of MatrixNet and use it on CornerNet without telling details.
But the concept can be used on YOLO.
image

@glenn-jocher
Copy link

@WongKinYiu yes I think you are right. The darknet training set (2014) is 117k images, and tests on 5k images. The 2017 set I believe trains on 120k images and tests on 20k. So testing will take 4X longer on 2017, but I believe most of the training data is the same.

@glenn-jocher
Copy link

glenn-jocher commented Nov 28, 2019

I've updated my mAP section, so the jump between YOLOv3 and YOLOv3-SPP is more clear. We are not at the level of ASFF yet, but it's encouraging to see that we are not too far away either.

https://github.com/ultralytics/yolov3#map

img-size COCO mAP
@0.5...0.95
COCO mAP
@0.5
YOLOv3-tiny
YOLOv3
YOLOv3-SPP
YOLOv3-SPP ultralytics
320 14.0
28.7
30.5
35.2
29.0
51.5
52.3
53.9
YOLOv3-tiny
YOLOv3
YOLOv3-SPP
YOLOv3-SPP ultralytics
416 16.0
31.1
33.9
38.8
32.9
55.3
56.8
58.7
YOLOv3-tiny
YOLOv3
YOLOv3-SPP
YOLOv3-SPP ultralytics
608 16.6
33.0
37.0
40.4
35.5
57.9
60.6
60.1

The main takeaway from the ASFF paper is Figure 1. This shows that while some of the new work that's coming out claims very high mAPs, they do it at the expense of inference time, so in this sense I like the ASFF approach of adding to YOLOv3 to obtain results with minimal hits to FPS.
Screen Shot 2019-11-27 at 4 03 08 PM

@WongKinYiu
Copy link
Collaborator Author

@glenn-jocher yes,

there are too many tricks which are hard to implemented in darknet...
maybe i ll test new methods on pytorch then feedback it to @AlexeyAB in the future.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Nov 28, 2019

@WongKinYiu

there are too many tricks which are hard to implemented in darknet...

What features do you mean?

I added several fixes, so ASFF and BiFPN (from EfficientDet) can be implemented there - I hope there are no bugs:

Instead of Softmax - we use Fast normalized fusion from EfficientDet that is more efficient than Softmax: 3.3 - page 4: https://arxiv.org/pdf/1911.09070v1.pdf

Our ablation study shows this fast fusion approach has very similar learning behavior and accuracy as the softmax-based fusion, but runs up to 30% faster on GPUs (Table 5).

ASSF - like:

[route]
layers=22,33,44 # 3-layers which are already resized to the same WxHxC

[convolutional]
stride=1
size=1
filters=3
activation=normalize_channels # ReLU is integrated to activation=normalize_channels

[route]
layers=-1
group_id=0
groups=3

[scale_channels]
from=22
scale_wh=1

[route]
layers=-3
group_id=1
groups=3

[scale_channels]
from=33
scale_wh=1


[route]
layers=-5
group_id=2
groups=3

[scale_channels]
from=44
scale_wh=1

[shortcut]
from=-3
activation=linear

[shortcut]
from=-6
activation=linear


@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB Hello,

For example, Autoaugment, fast-autoaugment, autograd, deformable conv, and multi-task training...

In my project, I need do detection, segmentation, and tracking simultaneously.
link1, link2, link3
but modify the data loader and build a multi-task detector-segmentor-tracker are difficult to me.

by the way, I will create a ASFF-like cfg in thease days.

@AlexeyAB
Copy link
Owner

@WongKinYiu

Which of these tasks do you consider the most promising?

Yes, instance segmentation - YOLACT still in todo: https://github.com/AlexeyAB/darknet/projects/6 Like as tracker: #3042

I will think can I do fast "Joint Detection and Embedding for fast multi-object tracking"

Just now the simplest way just to add Monving to data augmentation, so we will can train conv-LSTM models on non-sequential datasets like MS COCO, so it will increase accuracy of detection on video, but will still require additional tracker for track_id's.

The highest priorities were: #4264 #4346 #4382 #4203 #3830

Do you think Deformable Conv better than Deformable Kernels ? #4066

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB

For me, "Joint Detection and Embedding for fast multi-object tracking".
For improving YOLOv3, ASFF and tricks used in the ASFF paper.

Oh, I treat deformable kernel as deformable conv v3.

@AlexeyAB
Copy link
Owner

@WongKinYiu

Did you try to compare CSPNet with ResNet101-deformable-conv-TridentNet (FPS / accuracy)? https://arxiv.org/abs/1901.01892v1 issue #3363

How slow are deformable convolutions?

It addes only +~1% AP but may be it is very slow?

image

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB

I have not try it.
If darknet support deformable conv, i can train the model.

In ICCV, the author of tridentnet show the fast version of tridentnet.
They train on multiple branch, and test on single branch.
https://github.com/facebookresearch/detectron2/tree/master/projects/TridentNet

@AlexeyAB
Copy link
Owner

@WongKinYiu

They don't use Deformable-conv in the Resnet101-TridentNet-Fast scale-aware.

Resnet152-TridentNet is already implemented in the Darknet. https://github.com/AlexeyAB/darknet/blob/master/cfg/resnet152_trident.cfg

But we should implement Resnet101-TridentNet rather than Resnet152: #3363

To implement Resnet101-TridentNet-Fast scale-aware we should:

It will be Resnet101-TridentNet-Fast scale-aware


They use ResNet-101 (40.6% AP) rather than ResNet-101-Deformable (41.8% AP), because Deformable-conv is very slow for Fast-network.

ResNet-101 (40.6% AP) vs ResNet-101-Deformable (41.8% AP)
image


Resnet101-TridentNet-Fast scale-aware
image


@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB Hello,

They released their long paper and code.

image

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jan 15, 2020

@WongKinYiu Thanks!

MatrixNet achieves only 3-4 FPS, while CSPResNeXt-50-PANet-SPP-optimal achieves 30-40 FPS with ~ the same AP.

I added something like CenterNet that uses 3 x [Gaussian_yolo] layers (left-top, right-bottom, center) for each final resolution. #3229 (comment)
You can try to use it with MatrixNet structure.

But I'm not sure that there will be a good result.

@AlexeyAB
Copy link
Owner

@WongKinYiu Also I fixed a bug with Tensor Cores, and it seems any groups value larger than groups=1 slows down inference speed, so CSPDarknet53 and CSPDarkNet53-PANet-SPP should be the best model even for 608x608 if Tensor Cores are used.

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB Thanks a lot,

I am doing ablation study of CSPResNeXt50-PANet-SPP-Optimal, so there is no available large-ram-gpus recently.
I will test new methods using small models on small dataset.

Yes, CSPDarknet53-PANet-SPP gets best AP and FPS currently, I am developing new methods based on CSPDarknet53.

I find that the bad detection usually appears on overlapped objects in my project.
If your focused task also has this kind of problem, do you think the Soft-IoU layer (cvpr19-code) can be combined with YOLOv3 since you already introduced the concept of CenterNet into YOLOv3?

It seems ASFF and Bi-FPN are very easy to get NaN even combine with original Darknet53.
I am trying to solve the problems.

@AlexeyAB
Copy link
Owner

@WongKinYiu

I find that the bad detection usually appears on overlapped objects in my project.
If your focused task also has this kind of problem, do you think the Soft-IoU layer (cvpr19-code) can be combined with YOLOv3 since you already introduced the concept of CenterNet into YOLOv3?

I will think about it. I created an issue: #4701

Also what do you think about Repulsion Loss: Detecting Pedestrians in a Crowd (CVPR 2018) - without NMS at all ? https://arxiv.org/abs/1711.07752v2 and #3113

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB

  • yes, it gets nan after 140k epochs when learning rate is 0.0005.

  • i do not train with dropblock, but all of asff, rfb, asff+rfb are easy to get nan when use ciou or giou loss.
    mse loss seems more stable for asff and rfb.

  • yes, training with dropblock still slower than without dropblock.

  • yes, it gets nan when all of other hyper-parameters are set as same as csresnext50-panet-spp-optimal.
    currently i reduce the learning rate to 0.001 and re-train it.

  • thanks

  • yes, if there is no nms is the best solution since nms does not considerate any semantic information.

@Kyuuki93
Copy link

@WongKinYiu What normalizer you used in ASFF module? In my experiments, ASFF got NaN when use relu for weights normalize and change to softmax will resolve this, the reason could be something like dead neurons which happens in relu

@AlexeyAB
Copy link
Owner

@Kyuuki93
@WongKinYiu Yes, you should use activation=normalize_channels_softmax instead of activation=normalize_channels in conv-layer before [scale_channels] scale_wh=1 in ASFF:

[convolutional]
stride=1
size=1
filters=3
activation=normalize_channels_softmax
#activation=normalize_channels            # don't use

yes, training with dropblock still slower than without dropblock.

Is it about ~2x slower?

@AlexeyAB
Copy link
Owner

@WongKinYiu Also use batch_normalization=1 in RFB-block

@AlexeyAB
Copy link
Owner

@WongKinYiu

Try to use max_delta=10 in [yolo] layers to avoid Nan.

  • Did you solve issue with Nan for ASFF and RFB ?

  • I added max_delta param for [yolo] and [Gaussian_yolo], so you can use max_delta=10 (preferably) or max_delta=100 it should limit delta to avoid Nan

  • Also you can try to use scale_x_y=1.2 or higher, it may allow you to use lower max_delta, but it can lead to lower AP (I don't know)

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB Thanks a lot

nan is not yet solved for asff model (rfb without asff do not get nan)
i m in an one week vocation, will start training at 1/30.

@AlexeyAB
Copy link
Owner

@WongKinYiu

I added [convolutional] activation=normalize_channels_softmax_maxval 298805c

Try to train with
[convolutional] activation=normalize_channels_softmax_maxval
instead of
[convolutional] activation=normalize_channels_softmax
for ASFF in 3 [conv] layers

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB OK, thanks a lot.

@AlexeyAB
Copy link
Owner

@WongKinYiu

I just checked your cfg-files:

  • Download the latest Darknet

  • set batch_normalize=1 instead of batch_normalize=0 in RFB-block

  • add max_delta=10 for each [yolo] layer (the closer to 0 value - the more stable the training)

  • Try both, either
    activation=normalize_channels_softmax_maxval or
    activation=normalize_channels_softmax (I don't know which is more stable)

@WongKinYiu
Copy link
Collaborator Author

  • Download the latest Darknet

  • set batch_normalize=1 instead of batch_normalize=0 in RFB-block

  • add max_delta=10 for each [yolo] layer (the closer to 0 value - the more stable the training)

  • activation=normalize_channels_softmax

still not stable enough for asff and asff+rfb, now reduce learning rate and retraining.

@AlexeyAB
Copy link
Owner

@WongKinYiu

Update your code at least to this fix: 2862b28

still not stable enough for asff and asff+rfb,

At what number of iterations it goes to Nan?

Also try
activation=normalize_channels_softmax_maxval

@WongKinYiu
Copy link
Collaborator Author

@AlexeyAB

I use https://github.com/AlexeyAB/darknet/tree/61499b27a4e24656a0f84bb83b92df95b0917f74 for training.

I already clean the results, but it got nan at < 35k epochs with lr=0.001.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ToDo RoadMap want enhancement Want to improve accuracy, speed or functionality
Projects
None yet
Development

No branches or pull requests

7 participants