Matrix Nets: A New Deep Architecture for Object Detection - mAP of 47.8@0.5...0.95 on MS COCO, #3772

WongKinYiu · 2019-08-17T06:17:18Z

Could this repo supports max pooling layer with different x, y strides.
I would like to implement the state-of-the-art object detector.
Thanks.

AlexeyAB · 2019-08-17T09:11:36Z

https://arxiv.org/abs/1908.04646v2

xNets can be applied to any backbone, similar to FPNs.
...
We detect corners for objects of different sizes and aspect ratios using different matrix layers, and simplify the matching process by removing the embedding layer entirely and regressing the object centers directly.
We show that KP-xNet outperforms all existing single-shot detectors by achieving 47.8% mAP on the MS COCO benchmark.

xNets map objects with different sizes and aspect ratios into layers where the sizes and the aspect ratios of the objects within their layers are nearly uniform. Hence, xNets provide a scale and aspect ratio aware architecture. We leverage xNets to enhance key-points based object detection. Our architecture achieves mAP of 47.8 on MS COCO, which is higher than any other single-shot detector while using half the number of parameters and training 3x faster than the next best architecture.

AlexeyAB · 2019-08-17T09:14:55Z

@WongKinYiu

Could this repo supports max pooling layer with different x, y strides.

Is this the only necessary feature for the implementation of the Matrix Net?
So we should have:

stride_x=
stride_y=

Or do you need something else?

WongKinYiu · 2019-08-17T12:58:37Z

@AlexeyAB Hello,
Yes, but if it can support convolutional layer & average pooling layer with stride_x and stride_y is better.
The paper seems use convolutional layers to do down-sampling.

AlexeyAB · 2019-08-27T11:05:21Z

@WongKinYiu Hi,

I added support for

[maxpool]
stride_x=2
stride_y=3
...

Try to make some network, and if it will work fine with increasing accuracy, I will add stride_x & stride_y for convolutional layer.

WongKinYiu · 2019-08-27T14:37:11Z

@AlexeyAB Hello, thank you very very much.

AlexeyAB · 2019-08-27T15:55:10Z

It seems that MatrixNet (different strides) and TridentNet (different dilations) are very promising approaches for generalizing different sizes and aspect ratios of objects.

AlexeyAB · 2019-09-01T13:17:18Z

@WongKinYiu Hi,

What progress?

I added stride_x=... stride_y=... for convolutional layer. So you can try to make MatrixNet.

[convolutional]
stride_x=2
stride_y=3
...

https://arxiv.org/pdf/1908.04646v2.pdf

2.1. Layer Generation
....
The upper triangular layers are obtained by applying a series of shared 3x3 convolutions with stride 1x2 on the diagonal layers. Similarly, the bottom left layers are obtained using shared 3x3 convolutions with stride 2x1. The parameters are shared across all down-sampling convolutions to minimize the number of new parameters.

@WongKinYiu
Do you understand:

Does it mean that some layers share their weights (as in TridentNet)?
What layers with which layers share their weight? (only layers with the same aspect-ratio, f.e. all conv-layers with w=4x h=1x share the same weights?)

WongKinYiu · 2019-09-01T15:07:01Z

@AlexeyAB Hello, for the version with different stride max pooling layers, now training 200k epochs.
it needs 7~9 more days to finish training.

LukeAI · 2019-09-01T21:46:57Z

@WongKinYiu are you training a matrixnet with the original resnext-101 backbone? or something else?

AlexeyAB · 2019-09-01T21:51:58Z

@WongKinYiu Thanks, do you know about "shared 3x3 convolutions"? #3772 (comment)

WongKinYiu · 2019-09-01T23:24:33Z

@AlexeyAB Yes, I think it similar to TridentNet.
In my understand, only layers with the same aspect-ratio share the same weights.
But for max pooling version, there is no parameters.

WongKinYiu · 2019-09-01T23:26:08Z

@LukeAI Hello.
I do not use big model like resnet-50, resnext-101...
I only train the very small models.

LukeAI · 2019-09-02T09:08:53Z

which one? :) darknet-53 ?

AlexeyAB · 2019-09-02T10:06:40Z

@WongKinYiu Thanks.

You can try to train 3 models:

with [maxpool] stride_x=... stride_y=...
with [convolutional] stride_x=... stride_y=... with share_index=... in some layers, to get weights from conv-layer with this number (as in TridentNet)
then with [convolutional] stride_x=... stride_y=... antialiasing=1 with share_index=... in some layers

And compare results.

I added antialiasing=1 parameter for [convolutional] layer: #3672

WongKinYiu · 2019-09-03T06:30:05Z

@AlexeyAB Thank you for your advise. I will get free gpus after two weeks.

WongKinYiu · 2019-09-07T09:36:26Z

@AlexeyAB

maxpool version:

Model	+anchors	BFLOPs	mAP@0.5	mAP@0.5:0.95
original	-	5.32	48.7	25.7
matrix net	no	5.27	49.1	25.4
matrix net	yes	5.32	48.6	26.3

AlexeyAB · 2019-09-07T09:58:06Z

@WongKinYiu Thanks for results!

What does it mean anchors - no? Did you implement [yolo]-layers without anchors?
Can you share cfg-file?

WongKinYiu · 2019-09-07T10:08:05Z

@AlexeyAB

It means without increase number of anchors.

I can not access the cfg file now, becuz power supply of my office is going to be cut off tomorrow, i turned off my computer already.

AlexeyAB · 2019-09-07T10:26:40Z

@WongKinYiu Is Original - original MatrixNet from https://arxiv.org/abs/1908.04646v2 ?

WongKinYiu · 2019-09-07T10:46:04Z

@AlexeyAB

No, original means a yolov3-based model, without adding the feature proposed by MatrixNet.

You can see the figure in #3772 (comment)
original means a model with three yolo layers (3+3+3 = 9 anchors).
and the model right hand side has nine yolo layers with different aspect ratios (MatrixNet), each yolo layer predict bbox of one anchor (1+1+1+1+1+1+1+1+1 = 9 anchors).

WongKinYiu · 2019-09-08T11:44:01Z

cfg files based on yolov3-tiny

yolov3-tiny_3l(15.778BFLOPs).cfg.txt
yolov3-tiny_3l_maxpoolmatrixnet(15.565BFLOPs).cfg.txt
yolov3-tiny_3l_maxpoolmatrixnet_addanchors(15.787BFLOPs).cfg.txt

for conv version, just replace maxpool layers by conv layers with shared weights.

AlexeyAB · 2019-09-08T19:24:40Z

@WongKinYiu Thanks. What mAP@0.5 did you get for yolov3-tiny_3l_maxpoolmatrixnet_addanchors(15.787BFLOPs).cfg.txt ?

AlexeyAB · 2019-09-08T21:07:20Z

@WongKinYiu

Did you anywhere meet original MatrixNet-backbone (not just yolov3/tiny with non-unoform strides), if yes - can you share it?

WongKinYiu · 2019-09-08T22:51:12Z

@AlexeyAB Hello,

I do not meet original MatrixNet-backbone.

In the paper, they only show the concept of MatrixNet and use it on CornerNet without telling details.
But the concept can be used on YOLO.

glenn-jocher · 2019-11-28T02:00:51Z

@WongKinYiu yes I think you are right. The darknet training set (2014) is 117k images, and tests on 5k images. The 2017 set I believe trains on 120k images and tests on 20k. So testing will take 4X longer on 2017, but I believe most of the training data is the same.

glenn-jocher · 2019-11-28T02:06:38Z

I've updated my mAP section, so the jump between YOLOv3 and YOLOv3-SPP is more clear. We are not at the level of ASFF yet, but it's encouraging to see that we are not too far away either.

https://github.com/ultralytics/yolov3#map

	img-size	COCO mAP @0.5...0.95	COCO mAP @0.5
YOLOv3-tiny YOLOv3 YOLOv3-SPP YOLOv3-SPP ultralytics	320	14.0 28.7 30.5 35.2	29.0 51.5 52.3 53.9
YOLOv3-tiny YOLOv3 YOLOv3-SPP YOLOv3-SPP ultralytics	416	16.0 31.1 33.9 38.8	32.9 55.3 56.8 58.7
YOLOv3-tiny YOLOv3 YOLOv3-SPP YOLOv3-SPP ultralytics	608	16.6 33.0 37.0 40.4	35.5 57.9 60.6 60.1

The main takeaway from the ASFF paper is Figure 1. This shows that while some of the new work that's coming out claims very high mAPs, they do it at the expense of inference time, so in this sense I like the ASFF approach of adding to YOLOv3 to obtain results with minimal hits to FPS.

WongKinYiu · 2019-11-28T02:11:29Z

@glenn-jocher yes,

there are too many tricks which are hard to implemented in darknet...
maybe i ll test new methods on pytorch then feedback it to @AlexeyAB in the future.

AlexeyAB · 2019-11-28T19:47:16Z

@WongKinYiu

there are too many tricks which are hard to implemented in darknet...

What features do you mean?

I added several fixes, so ASFF and BiFPN (from EfficientDet) can be implemented there - I hope there are no bugs:

Instead of Softmax - we use Fast normalized fusion from EfficientDet that is more efficient than Softmax: 3.3 - page 4: https://arxiv.org/pdf/1911.09070v1.pdf

Our ablation study shows this fast fusion approach has very similar learning behavior and accuracy as the softmax-based fusion, but runs up to 30% faster on GPUs (Table 5).

ASSF - like:

[route]
layers=22,33,44 # 3-layers which are already resized to the same WxHxC

[convolutional]
stride=1
size=1
filters=3
activation=normalize_channels # ReLU is integrated to activation=normalize_channels

[route]
layers=-1
group_id=0
groups=3

[scale_channels]
from=22
scale_wh=1

[route]
layers=-3
group_id=1
groups=3

[scale_channels]
from=33
scale_wh=1


[route]
layers=-5
group_id=2
groups=3

[scale_channels]
from=44
scale_wh=1

[shortcut]
from=-3
activation=linear

[shortcut]
from=-6
activation=linear

WongKinYiu · 2019-11-28T23:52:11Z

@AlexeyAB Hello,

For example, Autoaugment, fast-autoaugment, autograd, deformable conv, and multi-task training...

In my project, I need do detection, segmentation, and tracking simultaneously.
link1, link2, link3
but modify the data loader and build a multi-task detector-segmentor-tracker are difficult to me.

by the way, I will create a ASFF-like cfg in thease days.

AlexeyAB · 2019-11-29T00:34:08Z

@WongKinYiu

Which of these tasks do you consider the most promising?

Yes, instance segmentation - YOLACT still in todo: https://github.com/AlexeyAB/darknet/projects/6 Like as tracker: #3042

I will think can I do fast "Joint Detection and Embedding for fast multi-object tracking"

Just now the simplest way just to add Monving to data augmentation, so we will can train conv-LSTM models on non-sequential datasets like MS COCO, so it will increase accuracy of detection on video, but will still require additional tracker for track_id's.

The highest priorities were: #4264 #4346 #4382 #4203 #3830

Do you think Deformable Conv better than Deformable Kernels ? #4066

WongKinYiu · 2019-11-29T00:41:53Z

@AlexeyAB

For me, "Joint Detection and Embedding for fast multi-object tracking".
For improving YOLOv3, ASFF and tricks used in the ASFF paper.

Oh, I treat deformable kernel as deformable conv v3.

AlexeyAB · 2019-11-29T11:49:12Z

@WongKinYiu

Did you try to compare CSPNet with ResNet101-deformable-conv-TridentNet (FPS / accuracy)? https://arxiv.org/abs/1901.01892v1 issue #3363

How slow are deformable convolutions?

It addes only +~1% AP but may be it is very slow?

WongKinYiu · 2019-11-29T12:03:35Z

@AlexeyAB

I have not try it.
If darknet support deformable conv, i can train the model.

In ICCV, the author of tridentnet show the fast version of tridentnet.
They train on multiple branch, and test on single branch.
https://github.com/facebookresearch/detectron2/tree/master/projects/TridentNet

AlexeyAB · 2019-11-29T22:01:43Z

@WongKinYiu

They don't use Deformable-conv in the Resnet101-TridentNet-Fast scale-aware.

Resnet152-TridentNet is already implemented in the Darknet. https://github.com/AlexeyAB/darknet/blob/master/cfg/resnet152_trident.cfg

But we should implement Resnet101-TridentNet rather than Resnet152: #3363

To implement Resnet101-TridentNet-Fast scale-aware we should:

Use ResNet101 as backbone: https://github.com/AlexeyAB/darknet/blob/master/cfg/resnet101.cfg and weights https://pjreddie.com/media/files/resnet101.weights
Implement 3-branches with shared weights share_index= and different dilation= as in https://github.com/AlexeyAB/darknet/blob/master/cfg/resnet152_trident.cfg
remove masks= in each [yolo] layer, for enlarging the scale-aware range of the major branch to incorporate objects of all scales
train this model on mscoco by using pre-trained weights file https://pjreddie.com/media/files/resnet101.weights
then just remove 1st and 3rd branches, keep branch-2

It will be Resnet101-TridentNet-Fast scale-aware

They use ResNet-101 (40.6% AP) rather than ResNet-101-Deformable (41.8% AP), because Deformable-conv is very slow for Fast-network.

ResNet-101 (40.6% AP) vs ResNet-101-Deformable (41.8% AP)

Resnet101-TridentNet-Fast scale-aware

WongKinYiu · 2020-01-15T12:16:02Z

@AlexeyAB Hello,

They released their long paper and code.

AlexeyAB · 2020-01-15T12:33:20Z

@WongKinYiu Thanks!

MatrixNet achieves only 3-4 FPS, while CSPResNeXt-50-PANet-SPP-optimal achieves 30-40 FPS with ~ the same AP.

I added something like CenterNet that uses 3 x [Gaussian_yolo] layers (left-top, right-bottom, center) for each final resolution. #3229 (comment)
You can try to use it with MatrixNet structure.

But I'm not sure that there will be a good result.

AlexeyAB · 2020-01-15T12:37:48Z

@WongKinYiu Also I fixed a bug with Tensor Cores, and it seems any groups value larger than groups=1 slows down inference speed, so CSPDarknet53 and CSPDarkNet53-PANet-SPP should be the best model even for 608x608 if Tensor Cores are used.

WongKinYiu · 2020-01-15T13:30:14Z

@AlexeyAB Thanks a lot,

I am doing ablation study of CSPResNeXt50-PANet-SPP-Optimal, so there is no available large-ram-gpus recently.
I will test new methods using small models on small dataset.

Yes, CSPDarknet53-PANet-SPP gets best AP and FPS currently, I am developing new methods based on CSPDarknet53.

I find that the bad detection usually appears on overlapped objects in my project.
If your focused task also has this kind of problem, do you think the Soft-IoU layer (cvpr19-code) can be combined with YOLOv3 since you already introduced the concept of CenterNet into YOLOv3?

It seems ASFF and Bi-FPN are very easy to get NaN even combine with original Darknet53.
I am trying to solve the problems.

AlexeyAB · 2020-01-15T14:15:29Z

@WongKinYiu

Do you mean that you trained the ASFF-model which @Kyuuki93 provided and got Nan? ASFF - Learning Spatial Fusion for Single-Shot Object Detection - 63% mAP@0.5 with 45.5FPS #4382
Did you train yolov3-spp-asff-it-rfb.cfg.txt on MS COCO with ASFF+RFB+DropBlock?
Do you get a slowdown in training with the current dropblock implementation?
Did you train this BiFPN model darknet53-bifpn3.cfg.txt on MS COCO and get Nan? Implemented weighted-multi_input-[shortcut] layer with weights-normalization #4662 (comment)

I find that the bad detection usually appears on overlapped objects in my project.
If your focused task also has this kind of problem, do you think the Soft-IoU layer (cvpr19-code) can be combined with YOLOv3 since you already introduced the concept of CenterNet into YOLOv3?

I will think about it. I created an issue: #4701

Also what do you think about Repulsion Loss: Detecting Pedestrians in a Crowd (CVPR 2018) - without NMS at all ? https://arxiv.org/abs/1711.07752v2 and #3113

WongKinYiu · 2020-01-15T14:40:57Z

@AlexeyAB

yes, it gets nan after 140k epochs when learning rate is 0.0005.
i do not train with dropblock, but all of asff, rfb, asff+rfb are easy to get nan when use ciou or giou loss.
mse loss seems more stable for asff and rfb.
yes, training with dropblock still slower than without dropblock.
yes, it gets nan when all of other hyper-parameters are set as same as csresnext50-panet-spp-optimal.
currently i reduce the learning rate to 0.001 and re-train it.
thanks
yes, if there is no nms is the best solution since nms does not considerate any semantic information.

Kyuuki93 · 2020-01-15T16:01:34Z

@WongKinYiu What normalizer you used in ASFF module? In my experiments, ASFF got NaN when use relu for weights normalize and change to softmax will resolve this, the reason could be something like dead neurons which happens in relu

AlexeyAB · 2020-01-15T16:21:29Z

@Kyuuki93
@WongKinYiu Yes, you should use activation=normalize_channels_softmax instead of activation=normalize_channels in conv-layer before [scale_channels] scale_wh=1 in ASFF:

[convolutional]
stride=1
size=1
filters=3
activation=normalize_channels_softmax
#activation=normalize_channels            # don't use

yes, training with dropblock still slower than without dropblock.

Is it about ~2x slower?

AlexeyAB · 2020-01-15T17:11:03Z

@WongKinYiu Also use batch_normalization=1 in RFB-block

AlexeyAB · 2020-01-23T18:59:48Z

@WongKinYiu

Try to use max_delta=10 in [yolo] layers to avoid Nan.

Did you solve issue with Nan for ASFF and RFB ?
I added max_delta param for [yolo] and [Gaussian_yolo], so you can use max_delta=10 (preferably) or max_delta=100 it should limit delta to avoid Nan
Also you can try to use scale_x_y=1.2 or higher, it may allow you to use lower max_delta, but it can lead to lower AP (I don't know)

WongKinYiu · 2020-01-25T10:10:23Z

@AlexeyAB Thanks a lot

nan is not yet solved for asff model (rfb without asff do not get nan)
i m in an one week vocation, will start training at 1/30.

AlexeyAB · 2020-01-28T00:47:58Z

@WongKinYiu

I added [convolutional] activation=normalize_channels_softmax_maxval 298805c

Try to train with
[convolutional] activation=normalize_channels_softmax_maxval
instead of
[convolutional] activation=normalize_channels_softmax
for ASFF in 3 [conv] layers

WongKinYiu · 2020-01-28T03:54:04Z

@AlexeyAB OK, thanks a lot.

AlexeyAB · 2020-01-28T15:38:03Z

@WongKinYiu

I just checked your cfg-files:

Download the latest Darknet
set batch_normalize=1 instead of batch_normalize=0 in RFB-block
add max_delta=10 for each [yolo] layer (the closer to 0 value - the more stable the training)
Try both, either
activation=normalize_channels_softmax_maxval or
activation=normalize_channels_softmax (I don't know which is more stable)

WongKinYiu · 2020-01-30T23:59:24Z

Download the latest Darknet
set batch_normalize=1 instead of batch_normalize=0 in RFB-block
add max_delta=10 for each [yolo] layer (the closer to 0 value - the more stable the training)
activation=normalize_channels_softmax

still not stable enough for asff and asff+rfb, now reduce learning rate and retraining.

AlexeyAB · 2020-01-31T00:24:47Z

@WongKinYiu

Update your code at least to this fix: 2862b28

still not stable enough for asff and asff+rfb,

At what number of iterations it goes to Nan?

Also try
activation=normalize_channels_softmax_maxval

WongKinYiu · 2020-01-31T00:28:44Z

@AlexeyAB

I use https://github.com/AlexeyAB/darknet/tree/61499b27a4e24656a0f84bb83b92df95b0917f74 for training.

I already clean the results, but it got nan at < 35k epochs with lr=0.001.

AlexeyAB added the want enhancement Want to improve accuracy, speed or functionality label Aug 17, 2019

AlexeyAB changed the title ~~Matrix Nets: A New Deep Architecture for Object Detection~~ Matrix Nets: A New Deep Architecture for Object Detection - mAP of 47.8@0.5...0.95 on MS COCO, Aug 17, 2019

AlexeyAB added the ToDo RoadMap label Aug 17, 2019

AlexeyAB mentioned this issue Sep 2, 2019

[feature request] anti-aliasing within the network ~+1-2% Top1 #3672

Closed

AlexeyAB mentioned this issue Nov 28, 2019

ASFF - Learning Spatial Fusion for Single-Shot Object Detection - 63% mAP@0.5 with 45.5FPS #4382

Closed

WongKinYiu mentioned this issue Dec 10, 2019

yolov3-spp-matrix.cfg in ultralytics/yolov3 - missing 'share_index' field implementation WongKinYiu/CrossStagePartialNetworks#2

Closed

AlexeyAB mentioned this issue Feb 22, 2020

Implemented weighted-multi_input-[shortcut] layer with weights-normalization #4662

Open

Matrix Nets: A New Deep Architecture for Object Detection - mAP of 47.8@0.5...0.95 on MS COCO, #3772

Matrix Nets: A New Deep Architecture for Object Detection - mAP of 47.8@0.5...0.95 on MS COCO, #3772

Comments

WongKinYiu commented Aug 17, 2019

AlexeyAB commented Aug 17, 2019 • edited Loading

AlexeyAB commented Aug 17, 2019 • edited Loading

WongKinYiu commented Aug 17, 2019

AlexeyAB commented Aug 27, 2019 • edited Loading

WongKinYiu commented Aug 27, 2019

AlexeyAB commented Aug 27, 2019

AlexeyAB commented Sep 1, 2019 • edited Loading

WongKinYiu commented Sep 1, 2019

LukeAI commented Sep 1, 2019

AlexeyAB commented Sep 1, 2019

WongKinYiu commented Sep 1, 2019

WongKinYiu commented Sep 1, 2019

LukeAI commented Sep 2, 2019 • edited Loading

AlexeyAB commented Sep 2, 2019

WongKinYiu commented Sep 3, 2019

WongKinYiu commented Sep 7, 2019 • edited Loading

AlexeyAB commented Sep 7, 2019

WongKinYiu commented Sep 7, 2019 • edited Loading

AlexeyAB commented Sep 7, 2019

WongKinYiu commented Sep 7, 2019 • edited Loading

WongKinYiu commented Sep 8, 2019 • edited Loading

AlexeyAB commented Sep 8, 2019

AlexeyAB commented Sep 8, 2019

WongKinYiu commented Sep 8, 2019

glenn-jocher commented Nov 28, 2019

glenn-jocher commented Nov 28, 2019 • edited Loading

WongKinYiu commented Nov 28, 2019

AlexeyAB commented Nov 28, 2019 • edited Loading

WongKinYiu commented Nov 28, 2019

AlexeyAB commented Nov 29, 2019

WongKinYiu commented Nov 29, 2019

AlexeyAB commented Nov 29, 2019

WongKinYiu commented Nov 29, 2019

AlexeyAB commented Nov 29, 2019

WongKinYiu commented Jan 15, 2020

AlexeyAB commented Jan 15, 2020 • edited Loading

AlexeyAB commented Jan 15, 2020

WongKinYiu commented Jan 15, 2020

AlexeyAB commented Jan 15, 2020

WongKinYiu commented Jan 15, 2020

Kyuuki93 commented Jan 15, 2020

AlexeyAB commented Jan 15, 2020

AlexeyAB commented Jan 15, 2020

AlexeyAB commented Jan 23, 2020

WongKinYiu commented Jan 25, 2020

AlexeyAB commented Jan 28, 2020

WongKinYiu commented Jan 28, 2020

AlexeyAB commented Jan 28, 2020

WongKinYiu commented Jan 30, 2020

AlexeyAB commented Jan 31, 2020

WongKinYiu commented Jan 31, 2020

AlexeyAB commented Aug 17, 2019 •

edited

Loading

AlexeyAB commented Aug 17, 2019 •

edited

Loading

AlexeyAB commented Aug 27, 2019 •

edited

Loading

AlexeyAB commented Sep 1, 2019 •

edited

Loading

LukeAI commented Sep 2, 2019 •

edited

Loading

WongKinYiu commented Sep 7, 2019 •

edited

Loading

WongKinYiu commented Sep 7, 2019 •

edited

Loading

WongKinYiu commented Sep 7, 2019 •

edited

Loading

WongKinYiu commented Sep 8, 2019 •

edited

Loading

glenn-jocher commented Nov 28, 2019 •

edited

Loading

AlexeyAB commented Nov 28, 2019 •

edited

Loading

AlexeyAB commented Jan 15, 2020 •

edited

Loading