Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EfficientDet: Scalable and Efficient Object Detection - 51.0% mAP@0.5...0.95 COCO #4346

Open
AlexeyAB opened this issue Nov 21, 2019 · 164 comments
Labels
ToDo RoadMap

Comments

@AlexeyAB
Copy link
Owner

EfficientDet: Scalable and Efficient Object Detection

First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion;


image


image


image


image

@AlexeyAB AlexeyAB added the ToDo RoadMap label Nov 21, 2019
@AlexeyAB AlexeyAB changed the title EfficientDet: Scalable and Efficient Object Detection EfficientDet: Scalable and Efficient Object Detection - 51.0% mAP@0.5...0.95 COCO Nov 21, 2019
@LukeAI
Copy link

LukeAI commented Nov 21, 2019

Looks really promising! The GPU latencies given are very low but it uses efficientnet as the backbone - how could that be?

@isra60
Copy link

isra60 commented Nov 22, 2019

So this could be implemented in this darknet repository??? I'm a little confused.

@tianfengyijiu
Copy link

How to get the best mAP@50 until now? Can I use EfficientDet-D0 - D6? I use yolov3-voc.cfg to train myself dataset and get mAP@50=80. on myself test set. I just add three lines:
flip = 1
letter_box=1
mixup = 1
thinks a lot! @AlexeyAB

@WongKinYiu
Copy link
Collaborator

@AlexeyAB code released.

https://github.com/google/automl/tree/master/efficientdet

@glenn-jocher
Copy link

@AlexeyAB @WongKinYiu guys I might have an interesting clue about increasing mAP.

Efficientdet has 5 outputs (P3-P7) compared to 3 (P3-P5) for yolov3, but these extra 2 are for larger objects, not smaller objects. In the past I've added a 4th layer to yolov3, with the same or slightly worse results, but this was for smaller objects.

On the same topic, I recently added test-time augmentation to my repo ultralytics/yolov3#931, which increased mAP from 42.4 to 44.7. I tested many different options, and settled on 2 winners: a left-right flip, and a 0.70 scale image. The highest mAP increase was coming from the larger objects. I think the 0.70 scale made these large objects smaller so they could fit in the P5 layer (whereas maybe before they would have needed to be in the P6 layer which doesn't exist).

So my proposal is (since the darknet53-bifpn cfg did not help), to simply add P6 and P7 outputs to yolov3-spp.cfg and test this out (with the same anchors redistributed among the layers I suppose). What do you think?

@WongKinYiu
Copy link
Collaborator

@glenn-jocher Hello,

Yes, from my previous analysis, I think we need at least one more scale (P6).
image
I already start training P3-P6 model for several days.
But there also a issue I have ignored first: The input size may should be 64x instead of 32x.

I saw your modification yesterday, and I already integrated YOLOv3-SPP into mmdetection.
There are two different tricks are used in ultralytics and mmdetection.
In ultralytics: train from scratch + prebias.
In mmdetection: with pretrained model but only BatchNorm layers will be update.
I would like to examine the performance of these two tricks.
Also, I will apply tricks in ATSS to YOLOv3-SPP if it need not to modify much code.

By the way, CSPDarkNet53 has also be integrated with TTFNet (anchor-free object detection), MSRCNN (instance segmentation), and JDE (simultaneously detect and track).

OK, I am training YOLOv3-SPP with almost same setting as CSPResNeXt50-PANet-SPP(optimal) using ultralytics. I think it can be baseline of your new model which integrate P3-P7.

@AlexeyAB
Copy link
Owner Author

@glenn-jocher @WongKinYiu

Yes, it seems that https://arxiv.org/abs/1909.00700v3 and https://github.com/ZJULearning/ttfnet (Training-Time-Friendly Network for Real-Time Object Detection) is a good way.

While I think we must move on:

  • use triplet-loss for train in ~10 seconds per object
  • use lstm to achieve +10-30% AP for Detectio on video and Training on static images

On the same topic, I recently added test-time augmentation to my repo ultralytics/yolov3#931, which increased mAP from 42.4 to 44.7. I tested many different options, and settled on 2 winners: a left-right flip, and a 0.70 scale image.

Yes the same result as for CenterNet, they achieve 40.3% AP / 14 FPS, then with Flip - they achieve 42.2% AP / 7.8 FPS, and with Multi-scale they achieve 45.1% AP / 1.4 FPS. https://github.com/xingyizhou/CenterNet#object-detection-on-coco-validation
But this is not one-stage detector already and makes the model not real time.
While I think about flip-invariance/rotation-invariance/scale-invariance weights with significant increasing accuracy and a small drop in FPS: #4495 (comment)

The highest mAP increase was coming from the larger objects. I think the 0.70 scale made these large objects smaller so they could fit in the P5 layer (whereas maybe before they would have needed to be in the P6 layer which doesn't exist).

May be yes.

  • May be we just should add P6.
  • But may be we should also increase network-resolution and weights-size for optimal AP/FPS as it is stated in EfficientNet/Det papers

Efficientdet has 5 outputs (P3-P7) compared to 3 (P3-P5) for yolov3, but these extra 2 are for larger objects, not smaller objects. In the past I've added a 4th layer to yolov3, with the same or slightly worse results, but this was for smaller objects.

Yes, it is because, they increased input network resolution from 512x512 for D0 (where are P3-P5 for big objects) to 1536x1536 for D7 (where are P3-P5 for small objects), so we should add P6-P7 for big objects. Because receptive field NxN of P5 don't depend on network resolution and remains the same in pixels (look below for yolo cfg files), so NxN for 512x512 is big, while the same NxN for 1536x1536 is small.

So may be we should:

  • increase network resolution width=896 height=896
  • add P6 yolo-layer with higher receptive field, so we have 4 [yolo]-layers
  • increase ~1.35x weights size (filters= for each conv-layer)
  • use 12 or 15 anchors for 4 yolo-layers

So my proposal is (since the darknet53-bifpn cfg did not help), to simply add P6 and P7 outputs to yolov3-spp.cfg and test this out (with the same anchors redistributed among the layers I suppose). What do you think?

Yes, try 4 ways:

  1. yolov3-spp + P6
  2. yolov3-spp + P6 + network resolutionn 896x896
  3. yolov3-bifpnn + P6-P7
  4. yolov3-bifpnn + P6-P7 + network resolutionn 896x896

I added receptive field calculation - usage: [net] show_receptive_field=1 in cfg-file:

  • yolov3-tiny.cfg
    1. yolo-layer 13x13: 318x318
    2. yolo-layer 26x26: 286x286

  • yolov3.cfg
    1. yolo-layer 13x13: 917x917
    2. yolo-layer 26x26: 949x949
    3. yolo-layer 26x26: 965x965

  • yolov3-spp.cfg
    1. yolo-layer 13x13: 1301x1301
    2. yolo-layer 26x26: 1333x1333
    3. yolo-layer 26x26: 1349x1349

While input network size if just 608x608.

  • So for [net] width=608 height=608 the 1st yolo-layer 13x13: 1301x1301 is for big objects

  • But for [net] width=3200 height=3200 the 1st yolo-layer 13x13: 1301x1301 is for small objects

@glenn-jocher
Copy link

glenn-jocher commented Mar 29, 2020

@WongKinYiu ah great! Lots of integrations going on. I have not looked at ATSS yet, I will check it out. TTFNet looks refreshingly simple.

Yes that's a good chart you have there. How do you calculate the receptive field exactly? I saw Efficientdet updated their anchor ratios to (1.0, 1.0), (1.4, 0.7), (0.7, 1.4). I'm not sure exactly how these work. Do you think these create anchors based on multiplying grid cells or the receptive field?

Yes a P6 layer requires 64-multiple size images, and a P7 layer would require 128-multiple size images, but its not a huge problem.

@WongKinYiu
Copy link
Collaborator

there is a 3x3 convolutional layer just before prediction layer, so i simply multiple size of grid by 3.

@WongKinYiu
Copy link
Collaborator

@AlexeyAB

from JDE paper, they provide results of different loss for embedding.
image
but unfortunately, they only submit code of cross entropy.

Also there are some issues to be solved. for example, it only support single class tracking and different anchors in same scale share same embedded feature.
image

the code is mainly based on ultralytics, so i think it can be start point for developing triplet-loss based tracker. https://github.com/Zhongdao/Towards-Realtime-MOT

@glenn-jocher
Copy link

glenn-jocher commented Mar 31, 2020

@WongKinYiu so you simply take a 3x3 grid as the receptive field. Ok.

Do you think it might be beneficial to have the anchors be fixed in units of gridspace instead of image space? Maybe this is what EfficientDet is doing with their (1,1), (1.4, 0.7), (0.7, 1.4) anchor multiples (I don't know what they do with these multiples).

Right now the anchors are fixed/defined in imagespace (pixels) rather than grid space, so the same anchor would take up varying gridpoints depending on the output layer (if it was applied to different layers).

What do you think of the idea defining the anchors as (1,1), (1.4, 0.7), (0.7, 1.4) local gridpoints, and then maybe testing out say a 2x and 3x multiple of that?

I implemented my yolov3-spp-p6 in other news, I'm training it now. I trimmed some of the convolutions to keep the size maneagable, its 81M params now and training about 25% slower than normal. Early mAP was lower, but seems to be crossing yolov3-spp and going higher at around 50 epochs. I'll keep my fingers crossed.

Screen Shot 2020-03-30 at 7 45 40 PM

@WongKinYiu
Copy link
Collaborator

WongKinYiu commented Mar 31, 2020

@glenn-jocher

from my previous analysis, i think {0.7,1.4} is due to IoU >= 0.5.
and sqrt(0.5)*sqrt(2) equals to 1, (0.7, 1.4), (1.4, 0.7), (1,1) almost has same area.
image

@glenn-jocher
Copy link

glenn-jocher commented Mar 31, 2020

@WongKinYiu ah yes that makes sense! Also (1.4, 0.7) IOU is about 0.55, close to 0.5. From your earlier plots though it looks like the current anchors correspond much better to about 3x3 gridpoints than 1x1 gridpoints.

At 512x512, the P3 grid is 64x64, P4 is 32x32, P5 is 16x16, and P6 is 8x8. If we had a P7 that would be 4x4, and 3 gridpoints at that scale would take up almost the entire image (which sounds about right). At the smaller scale though the P3 stride is 8, and we currently have anchors about that size (smallest is 10x13 pixels).

I'm worried my existing anchors are causing tension in the P6 model, as the GIoU loss is higher than normal. I simply spread out the 12 anchors I was using for yolov4.cfg (which has P3-P5, 4 at each level) to yolov4-p6 (which has P3-P6, 3 anchors at each level).

@glenn-jocher
Copy link

glenn-jocher commented Apr 3, 2020

@AlexeyAB @WongKinYiu ok, my P6 experiment was tracking worse than yolov3-spp after about 150 epochs so I cancelled it. I'm not sure why exactly.

If I look at the yolov3-spp receptive field, at P5, stride 32, the largest anchor is (373,326), or 10 grids, which would be 3X the receptive field according to @WongKinYiu

P6 has stride 64, so only 1.5X receptive field for the largest anchor, yet overall mAP is worse. I did trim some convolution operations to keep the parameter count reasonable, so this could be the cause. Back to the drawing board I guess. @WongKinYiu how did your P6 experiment go?

@WongKinYiu
Copy link
Collaborator

WongKinYiu commented Apr 4, 2020

@glenn-jocher

currently 140k iterations, it need several weeks to finish training.

for yolov3-spp, the receptive filed become very large due to spp module is added.
(13x13 max conv -> (32 * 13)x(32 * 13)) = 416x416 receptive field)
image

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Apr 4, 2020

@WongKinYiu @glenn-jocher

(13x13 max conv -> (32 * 13)x(32 * 13)) = 416x416 receptive field)

Also you should take into account that conv3x3 stride=1 increases receptive field too, not only conv3x3 stride=2.

You can see receptive field in the Darknet by using:

[net]
show_receptive_field=1

@WongKinYiu
Copy link
Collaborator

WongKinYiu commented Apr 4, 2020

@glenn-jocher

Could you provide your cfg file and training command?
I will modify it and train on ultralytics.
(it will get error in test.py if i add P6 yolo layer in cfg.)

by the way, do you train/val on coco2014, or on coco2017?

@glenn-jocher
Copy link

@WongKinYiu yes here is the p6 cfg with 12 anchors, and a modified version of yolov3-spp called yolov4 that has the same 12 anchors, which trains to slightly above yolov3-spp (+0.1mAP).

I had to add a lot of convolutions to p6, so it has 81M params. I doubled the width of the stem convolutions (which use few params), but reduced the width of the largest head convolutions (i.e. 1024 -> 640 channels). Overall the result was slightly negative though, so you may want to adjust the cfg.

python3 train.py --data coco2014.data --img-size 416 608 --epochs 300 --batch 16 --accum 4 --weights '' --device 0 --cfg yolov4-81M-p6.cfg --name p6 --multi

yolov4-81M-p6.cfg.txt

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Apr 4, 2020

@glenn-jocher Try to train and test this model with network resolution 832x832 (with random shapes).
Also why you didn't use SPP-block?

@glenn-jocher
Copy link

@AlexeyAB yes maybe I should put the SPP block back in on the P6 layer, and return the dn53 stem convolutions to their original sizes.

When I changed dn53 I saw that there were 8, 8 and 4 blocks in the last 3 downsamples. For p6 I changed this to 8, 8, 8 and 8 (no spp). Maybe I should update to 8, 8, 8, 4+spp, which would more closely mimic yolov3-spp.

@WongKinYiu
Copy link
Collaborator

@glenn-jocher

start training yolov3-spp and yolov3-spp-p6.
the loss of yolov3-spp-p6 is very large at the 1st epoch when compare to yolov3-spp.

@AlexeyAB
Copy link
Owner Author

@WongKinYiu

do semi-supervised learning on yolov4-tiny with labeled dataset and pseudo-labeled dataset.

Is this just a regular training ./darket detector train ... on old+new labeld datasets, or there is something else?

@WongKinYiu
Copy link
Collaborator

@AlexeyAB

yes, currently i just use ./darket detector train ....
the development of un/semi/weakly-supervised learning methods are still in progress.

@LukeAI
Copy link

LukeAI commented May 16, 2020

@glenn-jocher

Something like that is integrated with CVAT
https://www.youtube.com/watch?v=U3MYDhESHo4&feature=youtu.be

It seems to only support tensorflow models and is oriented towards labelling video frames (interpolation), I haven't actually tried it myself yet. I want to do something along these soon but haven't quite decided what appoach to take. I would prefer to maybe script the interpolation myself so that I'm not restricted to tensorflow and I can have more flexibility over the process. I do like CVAT though because it has a REST API and the fact that it runs in a web browser makes it easy to manage centrally and distribute batch jobs amongst many people.

@LukeAI
Copy link

LukeAI commented May 16, 2020

@glenn-jocher So we can use cheap ~0.8$ preemptable-GCP-VM only for hyperparameters search for the first 10 epochs, but not for training.
While for training we should use ~2.0$ regular GCP-VM without --resume.

Have you ever tried direct GPU rig renting like on vast.ai ? https://vast.ai/console/create/
seems much cheaper than GCP for the on-demand / uninterruptible instances.

@AlexeyAB
Copy link
Owner Author

@LukeAI

Something like that is integrated with CVAT
https://www.youtube.com/watch?v=U3MYDhESHo4&feature=youtu.be

Is it something like o button in the Yolo_mark, that tracks objects on sequence of frames from Video during labeling - by using OpticalFLow-tracking: https://github.com/AlexeyAB/Yolo_mark

Have you ever tried direct GPU rig renting like on vast.ai ? https://vast.ai/console/create/

Is it aggregator of different clouds: Vast, GCP, AWS, Paperspace, ... ?
Only 0.6$ per hour for TeslaV100, compared with 2.6$ on GCP and 3.0$ on AWS.

@LukeAI
Copy link

LukeAI commented May 16, 2020

It's individual people with big rigs of GPUs (probably mostly people who used to use them for mining cryptocurrencies, which is now not profitable on GPUs). The site vets them and requires certain standards for hardware, uptime, internet speed etc. and charges a commission for connecting them to buyers.

Not sure exactly how the cvat interpolation mode works, it requires a tensorflow model for detection. idk. if it also tries to leverage optical flow or some other tracking method.

Maybe the best automated video labeler would be some dedicated integrated video object-detector / tracker? I know some video objected detectors exist that try to leverage recent frames to refine predictions for current frame, but maybe better results could be had by some CNN that also used future frames? So you manually label every 10th frame, train a model and try to interpolate the rest?

@WongKinYiu
Copy link
Collaborator

WongKinYiu commented May 16, 2020

@LukeAI @AlexeyAB

I also use CVAT.
CVAT interpolation mode is just use linear interpolation between two key annotations.
Their provided auto-labeling tools are for detection and segmentation which using tensorflow model.
It is easier to replace it by yourself if you are familiar to opencv.

@LukeAI
Copy link

LukeAI commented May 16, 2020

ok thanks, good to know.

@LukeAI
Copy link

LukeAI commented May 16, 2020

* Pseudo-labeling and Reinforcement(hard-examples-mining)-labeling was added to the Darknet more than year ago
  
  * https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1125-L1132
  * https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1647-L1670

How can I use hard-examples-mining? So it returns a list of images with only low confidence detections?

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented May 16, 2020

@LukeAI

How can I use hard-examples-mining? So it returns a list of images with only low confidence detections?

Un-comment these lines: https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1125-L1132
It returns only images with fp+fn > 0 if you run ./darknet detector map ...

If someone need it, I can add flag for enabling this feature, wihtout changing source code.

@glenn-jocher
Copy link

glenn-jocher commented May 16, 2020

Ah, so refining coco labels and then getting mAP improvement is not a publishable result you are saying, even though you start from the same dataset and don't add any images to it? I'd be a little surprised if that were the case, as we augment the images themselves currently, which is also modifying the original dataset.

In any case, if it provided repeatably improved results on custom datasets, it would be adding value to the product your company might provide in an automl type solution. I suppose the best implementation would be to allow this functionality with a simple argument during training, i.e. --refine_labels.

@AlexeyAB the label refinement you have creates new labels, but are these fed back into the training set during training, or is that a manual step one would do?

@glenn-jocher
Copy link

@LukeAI about video labelling, I've never heard of cvat, but I've done a lot of object tracking, both visual and radar with kalman filters, and yes, if you can detect an object in one video frame, you can track it using a KLT tracker easily for a few more frames before drift becomes an issue.

So in this sense you could use yolov3/4 to detect an object in a video frame, then a KLT tracker to update your boxes in the remaining 29 frames, and repeat.

Google has a demo app that's supposed to do this called ODT, which is pretty bad in real life, and Swift also has track object requests (you supply the initial box):
https://developer.apple.com/documentation/vision/vntrackobjectrequest

I haven't used vast.ai, but yes the prices are much better!

@glenn-jocher
Copy link

@AlexeyAB @WongKinYiu also guys, are you sure that using refined labels for training is not fair, because the noisy student paper uses 300M unlabelled images (!). So they are going out and finding new images no one else has, which is a step beyond what I was proposing, which uses the same images and no more.

I would tend to view it as fair, because you are starting with the same dataset and labels as everyone else, and not adding new images. I suppose the rules of the game depend on the game you are playing...

@WongKinYiu
Copy link
Collaborator

@glenn-jocher

For mscoco, there are unlabeled data for developing semi-supervised learning approaches.
image
If use those data, you should compare your method to those semi-supervised learning methods.

Usually, we will split experiments into two different kind: use extra training data or not.
image
By the way, as I remember noisy students use 2048 TPUs and train 3.5 days.

@glenn-jocher
Copy link

glenn-jocher commented May 16, 2020

@WongKinYiu ah, so you are saying that refining the labels would count as 'extra training data', and then be classed togethor with methods which use unlabelled data. In this case then yes, you are competing with a much wider range of possible competitors.

Perhaps an alternate method then, leaving the labels alone, would be to update the objectness loss to account for the unequal probability of human-labelled FP and FN (false negative) mistakes. I'd suspect a human labeller would cause many more FN's than FP's, so we may want to apply an apriori distribution to the obj loss to reflect this.

In practice this might be accomplished by reducing the FP losses, i.e. I think multiplying them by a gain of 0.9 for example would imply that human labelers are only labelling 90% of the objects correctly. Does this make sense? It might be a more universal way to account for human labeling errors without having to check the 'extra training data' box.

EDIT: I realized I've just described class label smoothing. The only differences are that I proposed it for objectness and I proposed a positive-negative imbalance. Unfortunately I tried smoothing objectness before with poor results, and came to the conclusion that it is best applied only to classification loss, if at all.

@AlexeyAB
Copy link
Owner Author

@glenn-jocher @WongKinYiu

are you sure that using refined labels for training is not fair,

No.

  1. We can't use additional images/labels:

For fairly comparison we can't use additional non-MSCOCO pseudo-labeled datasets


  1. But our algorithm/network can change images/labels by itself, but without explicit a priori knowledge that only a person can calculate

We can try to re-lable MSCOCO by using pseudo-labeling, but it seems that there is problem only with person-labels in the MSCOCO

If we will try to refine labels of MSCOCO, then person-labels will be improved, but labels of all other 79 classes will be degraded.


Since our goal is ti find or create the best model rather than the win some challenge by using tricks. So we can use this model for real product, f.e. auto-labeling as you suggested.


@AlexeyAB the label refinement you have creates new labels, but are these fed back into the training set during training, or is that a manual step one would do?

This is a manual step curretly.

@LukeAI
Copy link

LukeAI commented May 17, 2020

How can I use hard-examples-mining? So it returns a list of images with only low confidence detections?

Un-comment these lines: https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1125-L1132
It returns only images with fp+fn > 0 if you run ./darknet detector map ...

So you would run the detector like darknet detector map obj.data yolov4.cfg yolov4_weights.weights using a valid list of already labelled images and it would return the list of all images with false positives or false negatives?

And the purpose would be to understand blind spots in your detector so you know what areas of improvement are needed?

If there were a CLI flag for that, I would use it, I do something similar with a python script already so it wouldn't be a priority for me, but my approach feels a bit hacky.

@AlexeyAB
Copy link
Owner Author

So you would run the detector like darknet detector map obj.data yolov4.cfg yolov4_weights.weights using a valid list of already labelled images and it would return the list of all images with false positives or false negatives?

Yes, set valid=train.txt in obj.data file, then run ./darknet detector map ... then set train=reinforcement.txt and run training again.

@glenn-jocher
Copy link

glenn-jocher commented May 20, 2020

@WongKinYiu I've been studying cd53s-yo.cfg. The new bottleneck strategy used there seems to be a good improvement over darknet53. Compared to yolov3-spp, cd53s-yolov3 has more layers and convolutions, but more importantly a 20-25% reduction in parameters and FLOPS, 10% reduction in inference time, and roughly similar mAP (or potentially slightly better).

If I'm understanding correctly the primary difference is that each bottleneck has a 1.0 expansion factor between the first and second conv in cd53s instead of a 0.5 factor, but also that there are leading and trailing convolutions (into the series of bottlenecks and exiting it) that reduce the channel count by half going in, and then double it back to normal going out (via concat with an additional residual). So taken as a unit, one of these cd53s series of bottlenecks can be a drop-in replacement for the more traditional darknet 53 series of bottlenecks.

This is all very good! This naturally leads me to wonder though if the expansion back to the original channel count is still necessary. For example, the first series of bottlenecks is a series of 2 bottlenecks with 128ch going in, then 64x64 convolutions for two bottlenecks, then additional convs to bring the channel count back to 128, before it gets passed to the same 3x3 stride 2 convolution to downsample it.

So my main question is, if we experimented with simply using 64ch throughout this series, and doing the same with the other bottlenecks, do you think the performance would be reduced significantly?

My second question is did you experiment with trying to reinvest the FLOPS/parameter 'savings' back into the network? For example perhaps you could increase the depth or width of the cd53s backbone by adding a few convolutions or increasing their channel count to bring it back up to 60M parameters, and thus capture a better performance increase compared to yolov3-spp?

@glenn-jocher
Copy link

glenn-jocher commented May 20, 2020

I've written a pytorch module that should be able to reproduce the bottleneck series in cd53s-yo.cfg:

class BottleneckSeriesCSP(nn.Module):
    def __init__(self, c1, c2, n=2, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super(BottleneckSeriesCSP, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(c_, c_, 1, 1)
        self.cv4 = Conv(c2, c2, 1, 1)
        self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])

    def forward(self, x):
        y1 = self.cv2(x)
        y2 = self.cv3(self.m(self.cv1(x)))
        return self.cv4(torch.cat((y1, y2), dim=1))

This uses instances of the Bottleneck() class, which is the normal bottleneck used in darknet53 that I'm using for my new repo.

class Bottleneck(nn.Module):
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, shortcut, groups, expansion
        super(Bottleneck, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

Which in turn uses the Conv() class, which is simply a Conv2d() + bn + leaky sequence. To create the first bottleneck series in cd53s-yo.cfg for example, you would create an instance of the module as BottleneckSeriesCSP(c1=128, c2=128, n=2). I've scanned my notes here so you can see which conv is which. It's basically 4 convolution plus a series of normal bottlenecks. Does that sounds right?

Scan

@WongKinYiu
Copy link
Collaborator

WongKinYiu commented May 20, 2020

@glenn-jocher

yes, you are right.

Darknet stage:

            x = down_layer(x) # can be included in darknet_layer
            x = darknet_layer(x) # with bottleneck
            x = tran_layer(x) # can be included in darknet_layer

CSPDarknet stage

            x = down_layer(x)
            x1, x2 = x.chunk(2, dim=1)
            x2 = darknet_layer(x2) # without bottleneck
            x = torch.cat([x1,x2], 1)
            x = tran_layer(x)

My English is not good enough to understand long paragraph in real-time, will give you feedback as soon as possible. #4346 (comment) #4346 (comment)

@WongKinYiu
Copy link
Collaborator

WongKinYiu commented May 21, 2020

@glenn-jocher

640x640 5k-set:
YOLOv3 (ultralytics): 43.1% AP
YOLOv3 (same setting as below): 43.6% AP
CD53s-YOLOv3(leaky, ultralytics): 43.7% AP
CD53s-YOLOv4(leaky, ultralytics): 44.5% AP

so the comparison would be as follows:

python test.py --cfg yolov3-spp.cfg --weights best_yolov3-spp.pt --img 608 --iou 0.7

Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients
Speed: 11.8/2.3/14.1 ms inference/NMS/total per 608x608 image at batch-size 16
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.436
python test.py  --cfg cd53s-yo.cfg --weights best_cd53s-yo.pt --img 608 --iou 0.7

Model Summary: 273 layers, 4.901e+07 parameters, 4.901e+07 gradients
Speed: 10.6/2.4/13.0 ms inference/NMS/total per 608x608 image at batch-size 16
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.437
python test.py --cfg cd53s.cfg --weights best_cd53s.pt --img 608 --iou 0.7 

Model Summary: 315 layers, 6.43421e+07 parameters, 6.43421e+07 gradients
Speed: 11.8/2.2/14.0 ms inference/NMS/total per 608x608 image at batch-size 16
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.445

CD53s-YOLOv3 gets comparable AP as D53-YOLOv3, but it is lighter and faster.
CD53s-YOLOv4 gets comparable FPS as D53-YOLOv3, but it has higher AP.

@glenn-jocher
Copy link

@WongKinYiu awesome! Ok I'm sold on the changes :). I will need to study the yolov4 head, it looks like it had quite a big impact on the results. Do you have a simple block diagram of what the new head does?

I'm training some new models with BottleneckSeriesCSP() modules. Do you use these new bottlenecks in the backbone only, or do you also use them in the head?

@glenn-jocher
Copy link

@WongKinYiu about these new results, do you want to submit a PR for https://github.com/ultralytics/yolov3/blob/master/README.md? Else I could add these new results myself if you send me your training command. I will also upload the corresponding *.cfg files so they are all available.

@WongKinYiu
Copy link
Collaborator

@glenn-jocher Hello,

i use
python train.py --img 448 768 512 --weights '' --cfg xxx.cfg --data coco2014.data --name xxx
for training.

here are all of cfg/weights, you could add them to your repo.

currently i only use BottleneckSeriesCSP in the backbone, will design new head soon.

I am in a business trip, will give you feedback about #4346 (comment) #4346 (comment) #4346 (comment) soon.

@WongKinYiu
Copy link
Collaborator

@glenn-jocher

i find there is a bug when testing with 608x608.
https://github.com/ultralytics/yolov3/blob/master/utils/datasets.py#L321
it will force to use 640x640 for testing, so above results of 608x608 should be 640x640.

@glenn-jocher
Copy link

glenn-jocher commented May 23, 2020

@WongKinYiu ah yes, this is an interesting point. test.py with rectangular inference will round to the nearest 64 size now, so you are correct that if you pass --img 608 it will actually run inference at 640. Do you get the same test results at --img 640 and --img 608?

EDIT: The purpose here was to prepare testing for a P6 model, which is now never used. A more robust approach would be to save the model strides as an attribute and then set the rounding to the max stride.

EDIT2: In the new repo I have code which handles this for train.py, but not in test.py:

    # Image sizes
    gs = int(max(model.stride))  # grid size (max stride)
    if any(x % gs != 0 for x in opt.img_size):
        print('WARNING: --img-size %g,%g must be multiple of %s max stride %g' % (*opt.img_size, opt.cfg, gs))
    imgsz, imgsz_test = [make_divisible(x, gs) for x in opt.img_size]  # image sizes (train, test)

EDIT3: I suppose a similar error check should be run on all 3 main files (train, test, detect.py)

@WongKinYiu
Copy link
Collaborator

hmm... test results at --img 640 and --img 608 are different, will check the code.

@WongKinYiu
Copy link
Collaborator

WongKinYiu commented May 28, 2020

@glenn-jocher

5k-set:
YOLOv3 (ultralytics): 43.1% AP
YOLOv3 (same setting as below): 43.6% AP
CD53s-YOLOv3(leaky, ultralytics): 43.7% AP
CD53s-YOLOv3(mish, ultralytics): 44.3% AP
CD53s-YOLOv4(leaky, ultralytics): 44.5% AP
CD53s-YOLOv4(mish, ultralytics): 45.0% AP (~YOLOv4)

python test.py --cfg yolov3-spp.cfg --weights best_yolov3-spp.pt --img 608 --iou 0.7
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.436
python test.py  --cfg cd53s-yo.cfg --weights best_cd53s-yo.pt --img 608 --iou 0.7
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.437
python test.py  --cfg cd53s-yo-csptb.cfg --weights best_cd53s-yo-csptb.pt --img 608 --iou 0.7
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.439
python test.py  --cfg cd53s-yo-mish.cfg --weights best_cd53s-yo-mish.pt --img 608 --iou 0.7
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.443
python test.py --cfg cd53s.cfg --weights best_cd53s.pt --img 608 --iou 0.7 
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.445
python test.py --cfg cd53s-mish.cfg --weights best_cd53s-mish.pt --img 608 --iou 0.7 
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.450
python test.py --cfg cd53s-cspt.cfg --weights best_cd53s-cspt.pt --img 608 --iou 0.7 
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.451
python test.py --cfg cd53s-csptb.cfg --weights best_cd53s-csptb.pt --img 608 --iou 0.7 
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.450

@WongKinYiu
Copy link
Collaborator

WongKinYiu commented Jun 1, 2020

@glenn-jocher

do you get better result after replace wh = torch.exp(p[:, 2:4]) * anchor_wh by y[..., 2:4] = (y[..., 2:4].sigmoid() * 2) ** 2 * self.anchor_grid[i]?

and i think your multi-scale training has bug, images will be resized to 640x640 no matter what is the input size. (or maybe it works with new scale hyper-parameter?)

            if True:
                imgsz = random.randrange(640, 640 + gs) // gs * gs
                sf = imgsz / max(imgs.shape[2:])  # scale factor
                if sf != 1:
                    ns = [math.ceil(x * sf / gs) * gs for x in imgs.shape[2:]]  # new shape (stretched to gs-multiple)
                    imgs = F.interpolate(imgs, size=ns, mode='bilinear', align_corners=False)

and test.py --img-size 736 actually use *768 due to the bug in dataset.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ToDo RoadMap
Projects
None yet
Development

No branches or pull requests

8 participants