Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want to figure out critical algorithm of Detect layer #471

Closed
TaoXieSZ opened this issue Jul 21, 2020 · 49 comments
Closed

Want to figure out critical algorithm of Detect layer #471

TaoXieSZ opened this issue Jul 21, 2020 · 49 comments
Assignees
Labels
question Further information is requested Stale Stale and schedule for closing soon

Comments

@TaoXieSZ
Copy link

❔Question

Hi,
I want to figure out the intuition of bbox detection.
In yolov3, we can find that the output can be write by these:
image
image

So, in yolov5,
I look into the src code:

yolov5/models/yolo.py

Lines 21 to 38 in 1e95337

def forward(self, x):
# x = x.copy() # for profiling
z = [] # inference output
self.training |= self.export
for i in range(self.nl):
bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
if not self.training: # inference
if self.grid[i].shape[2:4] != x[i].shape[2:4]:
self.grid[i] = self._make_grid(nx, ny).to(x[i].device)
y = x[i].sigmoid()
y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i] # xy
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
z.append(y.view(bs, -1, self.no))
return x if self.training else (torch.cat(z, 1), x)

And try to formularize it:
image

Am I right?

@TaoXieSZ TaoXieSZ added the question Further information is requested label Jul 21, 2020
@glenn-jocher
Copy link
Member

@ChristopherSTAN yes this looks correct! Typically this would be written as 2sigma() rather than sigma() x 2 though.

@TaoXieSZ
Copy link
Author

@glenn-jocher Awesome!
How you find out out this way to get the prediction. It is so brilliant.
image

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 21, 2020

The original yolo/darknet box equations have a serious flaw. Width and Height are completely unbounded as they are simply out=exp(in), which is dangerous, as it can lead to runaway gradients, instabilities, NaN losses and ultimately a complete loss of training. yolov3 suffers from this problem as well as yolov4.

For yolov5 I made sure to patch this error by sigmoiding all model outputs, while also ensuring that the centerpoint remained unchanged 1=fcn(0), so nominal zero outputs from the model would cause the nominal anchor size to be used. The current eqn constrains anchor multiples from a minimum of 0 to a maximum of 4, and the anchor-target matching has also been updated to be width-height multiple based, with a nominal upper threshold hyperparameter of 4.0.

The original thread is ultralytics/yolov3#168
image

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 21, 2020

@ChristopherSTAN BTW, you mentioned you were experimenting with lowering hyp['anchor_t']: 4.0, # anchor-multiple threshold paired with an increase in anchor count. This is an interesting approach, but I just realized it would make sense to take this a step further and modify the actual wh function as well to reduce the range from 0-4 to 0-2, otherwise half of your output space is unused, which is a bad design decision, as your neuron outputs may lose up to half of their precision capability.

You can accomplish this by modifying the exponent in the equation to 1.0, which is mathematically equivalent to removing it altogether:

y[..., 2:4] = (y[..., 2:4] * 2) ** 1.0 * self.anchor_grid[i]  # wh
            = y[..., 2:4] * 2 * self.anchor_grid[i]  # wh 

This change would need to occur in two places: 1) Detect() module, 2) compute_loss() box calculation:

yolov5/utils/utils.py

Lines 472 to 475 in 1e95337

# GIoU
pxy = ps[:, :2].sigmoid() * 2. - 0.5
pwh = (ps[:, 2:4].sigmoid() * 2) ** 2 * anchors[i]
pbox = torch.cat((pxy, pwh), 1).to(device) # predicted box

@TaoXieSZ
Copy link
Author

TaoXieSZ commented Jul 21, 2020

@glenn-jocher I am afraid I have not considered so much LOL. Maybe you are talking about another DL pro.

(Apparently I am not, for now.)

But I will try!
Thanks for your explanation.

@TaoXieSZ
Copy link
Author

@glenn-jocher I follow your idea, and set hyp['anchor_t'] = 3.0, will it work?

@glenn-jocher
Copy link
Member

@ChristopherSTAN don't worry, the idea is pretty simple. A neuron can control outputs in a certain range defined by the above equations, default being 0-4. If you reduce the hyperparameter that controls the matching threshold to 2.0, then boxes are only matched to anchors that are less then 2x the anchor size and greater than 1/2x the anchor size. So if an anchor size is 10 pixels, then that neuron can match labels between 5-20 pixels size, but it can output a box shape from 0-40 pixels size. So it is wasting 5/8 of it's output span. It has to fit all of it's output between 5-20, which by definition gives it less fine control for tiny corrections, which will reduce mAP.

So for best results, you want the neuron to have output authority over the entire training space you want it to predict. Even with the default settings, I see I am wasting a bit of training space. With default settings, the 10 pixel anchor neuron can output sizes between 2.5 - 40, so I am currently wasting 6% of the output space.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 21, 2020

@glenn-jocher I follow your idea, and set hyp['anchor_t'] = 3.0, will it work?

Yes, any value will work here, you just need to experiment with what produces the best mAP. If you lower these values though then it would also make sense to adjust the wh equations. For a 3.0 limit you might adjust the equation to this to fully capture the output space:
y[..., 2:4] = (y[..., 2:4] * 2) ** 1.6 * self.anchor_grid[i] # wh

@TaoXieSZ
Copy link
Author

@glenn-jocher For now I am thinking whether I can adjust it to perform well on my datasets, where there are lots of overlapping and medium objects:
image

Can I consider decreasing this parameter (2, 1.73...)is also limiting the size of outputting bounding boxes?

@glenn-jocher
Copy link
Member

You should look at your labels.png to see your size distribution. Yes, changing exponent in the box equations from 2.0 to 1.6 will limit your output space from 0-4 to 0-3. This would presumably paired with an increase in anchor count, otherwise recall would suffer.

@TaoXieSZ
Copy link
Author

@glenn-jocher Here:
image

@TaoXieSZ
Copy link
Author

And here's another dataset:
image

@glenn-jocher
Copy link
Member

@ChristopherSTAN yes these look pretty typical. You have some very large class imbalances as well. Or wait, it looks like your bar chart is plotted incorrectly, as there are 15 bins but it only goes up to 13. Looks like a plotting bug.

TODO: Fix labels.png bar chart.

@glenn-jocher glenn-jocher added the TODO High priority items label Jul 21, 2020
@glenn-jocher glenn-jocher self-assigned this Jul 21, 2020
glenn-jocher added a commit that referenced this issue Jul 21, 2020
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@glenn-jocher glenn-jocher removed the TODO High priority items label Jul 21, 2020
@glenn-jocher
Copy link
Member

Pushed a commit 4ffd977 for improved plotting. No bug found in current plotting.

@TaoXieSZ
Copy link
Author

Hi, dear Glenn,

I think it is a good time for your team to formularize, paperize your work, and SHOCK the world. It is really interesting to read your code.

@glenn-jocher
Copy link
Member

@ChristopherSTAN haha, yes we do need to produce a publication, but we are still exploring design changes. Hopefully around the end of year we can send something to arxiv.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 23, 2020

@ChristopherSTAN I have an idea, you could try modifying the L24 activation function in the Conv() layer from LeayReLU(0.1) to Swish() or Mish() to see if this helps wheat training. I've never tried this, but it may be possible to still start from pretraind weights when you do this:

yolov5/models/common.py

Lines 18 to 31 in 5e970d4

class Conv(nn.Module):
# Standard convolution
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups
super(Conv, self).__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = nn.LeakyReLU(0.1, inplace=True) if act else nn.Identity()
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def fuseforward(self, x):
return self.act(self.conv(x))

EDIT: You'll have to reduce your batch size as these will consume much greater GPU RAM when training.

@TaoXieSZ
Copy link
Author

TaoXieSZ commented Jul 23, 2020

@glenn-jocher Interesting!
I will try later. Now I am considering using coco dataset to increase training data by extracting intersecting classes. I think it will be a great trick to improve performance on custom datasets. If it works, I will apply a PR to see if you are interested.

Edit: I plan to upload some scripts. I am not sure how to name this operation. Maybe we can name it "Enriching data" or something else.

BTW, I am using EfficientDet on Wheat compete.
But I am using yolov5 on two different datasets.

@TaoXieSZ
Copy link
Author

@glenn-jocher That's my way:
image
Here I have a small dataset with 3600 images. But by extracting data from coco, we can have more than 30K. I am expecting how much it affect.

@TaoXieSZ
Copy link
Author

@glenn-jocher Now I understand your feeling when training on COCO.
I just use yolov5m and nearly 40K training images, it takes me 35min to run an epoch....

@glenn-jocher
Copy link
Member

@ChristopherSTAN intersecting classes, that's a good term. Yes this would be very useful. OpenImages V5/6 have a lot of intersecting classes with coco.

Yes, COCO can be very slow to train on unfortunately.

@dlawrences
Copy link
Contributor

@glenn-jocher That's my way:
image
Here I have a small dataset with 3600 images. But by extracting data from coco, we can have more than 30K. I am expecting how much it affect.

I would point out that this is not something you want to do on the long run, depending on the actual images of your own dataset. The COCO dataset may help the model to generalise on the objects, but usually the test dataset and the real world on which you are going to use your trained model are going to have its specifics around:

  • point of view from where the pictures/videos have been taken
  • lightning
  • overall environment conditions

For the problems I am solving, I have also used the COCO dataset for the specific classes I am training. However, I am also decreasing the COCO images in my dataset once I have a new batch of real images annotated. And, obviously, one thing you need to make sure is not happening is having any COCO images in your val/test set if these are not in accordance to your actual real scenarios. This can screw up your model evaluation pretty bad.

@TaoXieSZ
Copy link
Author

@dlawrences Thanks for your suggestions! It is my first time to add COCO images into my train set. And I have similar thought of test set to yours, I do not add extra images in to val set. Because I still want test set and dev set have same distribution.

Thanks again!

@TaoXieSZ
Copy link
Author

@glenn-jocher I plan to try what this pro said: ultralytics/yolov3#1098 (comment)

Try Leaky ReLU first and then Mish.

@TaoXieSZ
Copy link
Author

@glenn-jocher Bravo! I first train yolov5x on mixed dataset (30K of COCO + 3K of a small dataset) for nearly 50 epochs. Then train 150 epochs in only 3K dataset. It gives me 0.67 -> 0.71 mAP in test set!

@glenn-jocher
Copy link
Member

@ChristopherSTAN oh, that's a big jump! What was the increase due to? The COCO pretraining? Mish/Swish was also mentioned above, or perhaps you used your intersecting classes idea?

@TaoXieSZ
Copy link
Author

I don't see much improvement with Mish/Swish.
The story is that:
You know I am only using Colab, and my notebook just disconnected days ago. So I was angry and resumed it without coco dataset. And observed great improvement in val set.

Then I though a little bit: because the data are so important to deep learning models, the external data have improved the modeling ability of the mode. when train in the custom dataset (origin), we can see a great improvement.

Especially, this the single model on fold 0, but outperform my ensemble models on 5 folds.

With this observation, I will keep the pretrain model and resume it with k-fold, then ensemble it.

So, at last, thanks a lot to your great repo and hard working on COCO dataset.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@abhiagwl4262
Copy link

@ChristopherSTAN @glenn-jocher Can Anyone explain why setting the target confidence = 1.0 hurting the accuracy ? And why the equation below is giving better accuracy ?

tobj[b, a, gj, gi] = (1.0 - model.gr) + model.gr * giou.detach().clamp(0).type(tobj.dtype) # giou ratio

@glenn-jocher
Copy link
Member

@abhiagwl4262 you may want to experiment both ways. The current implementation sets object confidence to observed iou.

@abhiagwl4262
Copy link

@glenn-jocher I actually experimented and found that setting the target_confidence to 1.0 is giving me significant drop in accuracy. Do you have any intuition behind this observation?

@glenn-jocher
Copy link
Member

@abhiagwl4262 the intention with the current implementation is to assist NMS in reducing lower quality boxes.

@abhiagwl4262
Copy link

@glenn-jocher The predicted box can be (0-4.0)times of the anchor. You basically having a upper bound of 4.0 and lower bound of 1/4.0 or the anchor-GT ratio. Why you are you applying a lower bound? What is the significance of that?

@glenn-jocher
Copy link
Member

@abhiagwl4262 the matching algorithm is attempting to match targets with suitable anchors. The matches should be neither too large, nor too small, so we use upper and lower bounds on the ratio to achieve this. Without the lower bounds, all anchors would match with small objects (we only want the small anchors to match with small objects).

@abhiagwl4262
Copy link

@glenn-jocher Can you give a little idea of how you chose the values for yolo layer loss balacing as [4.0 - small object layer, 1.0 - medium object layer, 0.4- for large object layer] ?

@glenn-jocher
Copy link
Member

empirical results

@violet17
Copy link

violet17 commented Jan 11, 2021

The original yolo/darknet box equations have a serious flaw. Width and Height are completely unbounded as they are simply out=exp(in), which is dangerous, as it can lead to runaway gradients, instabilities, NaN losses and ultimately a complete loss of training. yolov3 suffers from this problem as well as yolov4.

For yolov5 I made sure to patch this error by sigmoiding all model outputs, while also ensuring that the centerpoint remained unchanged 1=fcn(0), so nominal zero outputs from the model would cause the nominal anchor size to be used. The current eqn constrains anchor multiples from a minimum of 0 to a maximum of 4, and the anchor-target matching has also been updated to be width-height multiple based, with a nominal upper threshold hyperparameter of 4.0.

Thanks for the great repo!
Can I have some base questions on the width-height method?

  1. Why multiply 2 to σ(x) and subtract 0.5 in x,y coords of bounding box? Should 2σ(x)-0.5 be in range [-0.5,0.5]?

  2. Why multiply 2 to σ(x) in width,height of bounding box?

  3. Why the centerpoint should be 1, i.e. f(0)=1?

Thanks!

@glenn-jocher
Copy link
Member

@violet17 the equation and intersection points were chosen for stability and for its suitability in replacing the unstable yolov3/v4 wh method.

@joelcma
Copy link

joelcma commented Oct 24, 2021

@glenn-jocher But what is the purpose of offsetting by -0.5? If you are expanding the output space from 0-1 to 0-2 and you offset by -0.5 the mid-point will be equal to 0.5 and not 0. So given x = 0 you get 0.5 output. Wouldn't it be more logical to offset by -1 to get 0 when given 0?

With the current formulation, if the network predicts t_x = 0 and the cell offset is 0.5, then the output will be 1, while intuitively it seems like it should perhaps be 0.5, the 0 value of the cell. Perhaps I misunderstand?

@glenn-jocher
Copy link
Member

@joelcma you want a reference input to create a reference output for stability and ease of training. The average input (due to batchnorm) will be zero, and the average object will be in the middle (i.e. at 0.5) of a grid cell. We expand the output space to allow for predictions near 0 and 1 without stressing the sigmoid inputs to extremes.

@joelcma
Copy link

joelcma commented Oct 25, 2021

@glenn-jocher Thank you for taking the time to answer! And sorry because I have another question :D So what is the benefit of using a sigmoid over a bounded relu in this case?

@glenn-jocher
Copy link
Member

@joelcma the benefit of any model architecture selection would be driven by empirical results, i.e. 'it works better'.

@dhiman10
Copy link

I have tried it for export !python export.py --weights /content/drive/MyDrive/best.pt --include "coreml"

Could any one know how can I convert correctly and get the bounding box, score, and other things?

@glenn-jocher
Copy link
Member

@dhiman10 after running export.py, you can get the bounding boxes, scores, etc. by using the CoreML framework to load the exported model and perform inference with it. You can refer to the CoreML documentation or examples for guidance on how to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

7 participants