Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on Private Dataset #72

Closed
ofekp opened this issue Aug 23, 2020 · 25 comments
Closed

Training on Private Dataset #72

ofekp opened this issue Aug 23, 2020 · 25 comments

Comments

@ofekp
Copy link

ofekp commented Aug 23, 2020

Thank you for sharing this code with us.

Can I train on my own dataset with only 11 classes?
I am overriding config.num_classes with 11 and overriding config.image_size with 512 but I get very bad results, it is almost like the model is not even aware of the image..

I made sure to put in the boxes as yxyx in the dataset and also made sure the classes start from 1 as I think it is needed due to how fast_collate is working.

Would appreciate your kind help, thank you.

@pasandrei
Copy link

I successfully trained this network on a custom dataset and I had a similar problem. For me, the main problem was the fact that the images height and width in the annotation file were wrong.

@ofekp
Copy link
Author

ofekp commented Aug 24, 2020

Thanks for replying.
The annotations that I provide are in yxyx format. I've read through Alex'es example kernel and he seems to be doing that too.. was I wrong?
I am also plotting the images and the bounding boxes right from the dataset I've created and it seems to be in the correct spots.

I reached a step now that the model is finally aware of the image (after I used Alex'es network setup) but the model seems to be placing so many boxes on the person in the image (100 since this is the limit in the code). It seems the scores are not decreasing below .001 threshold fast enough (also hard coded in effdet repo, not sure why).
So even after 60 epochs with 50 images I get 100 boxes on the person with very few outside the person. I was expecting it to learn to put fewer boxes after 60 epochs. Faster RCNN did so much better after just 10 epochs on the same data and I am perplexed by this. I surely did something wrong. Important to add, I am trying to find clothing items on people images.

Edit: Elaborating on yxyx:
As far as I can tell, the method batch_label_anchors in anchors.py is expecting a BoxList which is boxes in yxyx format. And I made sure this is what it gets.

Need help figuring this out please.

@pasandrei
Copy link

What annotations format are you using: COCO or PASCAL VOC?
Also, are you using the training code provided by this repo, or the one in Alex's note book?

reached a step now that the model is finally aware of the image (after I used Alex'es network setup) but the model seems to be placing so many boxes on the person in the image (100 since this is the limit in the code)

This can be normal if you are not using NMS. It is known that in order to get a better score on the COCO mAP metric it's better to predict more boxes (probably because the metric is skewed towards considering that False Negatives are worse than False Positives). The easiest thing to do, would be to increase the confidence threshold and/or, depending on your specific problem, using NMS.

@ofekp
Copy link
Author

ofekp commented Aug 24, 2020

I am using training code from torchvision which makes use of coco utils.
The dataset is images of size 512x512 which are normalized, labels starting from 1 (no class assigned for background). Boxes in yxyx.
I use fast_collate to feed this data to the model which is based on Alex'es implementation. I am not near the code but will attach it here in few hours.

I was under the impression that the model already has NMS in it. Think I saw it somewhere.
How can I try what you suggested please?

  1. Increase confidence threshold (is the the .001?)
  2. Use NMS

Thanks a lot kind sir!

Here's what I want vs what I get from the mode:

Screenshot from 2020-08-24 21-23-12
Screenshot from 2020-08-24 21-23-22

Another example:
Screenshot from 2020-08-24 21-24-56
Screenshot from 2020-08-24 21-25-08

@pasandrei
Copy link

Increase confidence threshold (is the the .001?)
Yes. I recommend you to set it to at least .1, but it could vary depending on multiple factor.
In case there are still a lot of bboxes, increase the threshold. If there aren't for all the objects, increase it.

@ofekp
Copy link
Author

ofekp commented Aug 24, 2020

  • Is it "standard procedure" to change this .001 value?
  • I really start thinking I did something wrong.
  • I found that nms is being called here.
  • Is it OK my images are normalized?
  • Also saw this Inference with custom image shape  #40 (comment) from Ross and decided to adjust my data to have the image aligned to the upper left corner.
  • After 40 epochs with 50 images with batch size of 2, lr 0.001, AdamW:
    Screenshot from 2020-08-24 22-20-55
    and the scores are:
tensor([0.2032, 0.1991, 0.1856, 0.1484, 0.1351, 0.1144, 0.1144, 0.1002, 0.0892,
        0.0879, 0.0866, 0.0849, 0.0829, 0.0807, 0.0724, 0.0713, 0.0662, 0.0657,
        0.0645, 0.0621, 0.0611, 0.0594, 0.0593, 0.0586, 0.0572, 0.0571, 0.0559,
        0.0553, 0.0551, 0.0540, 0.0535, 0.0524, 0.0522, 0.0509, 0.0507, 0.0500,
        0.0499, 0.0497, 0.0497, 0.0495, 0.0491, 0.0487, 0.0485, 0.0482, 0.0474,
        0.0465, 0.0461, 0.0456, 0.0456, 0.0454, 0.0453, 0.0446, 0.0445, 0.0444,
        0.0443, 0.0441, 0.0441, 0.0432, 0.0431, 0.0431, 0.0428, 0.0428, 0.0427,
        0.0427, 0.0424, 0.0422, 0.0420, 0.0419, 0.0416, 0.0414, 0.0414, 0.0414,
        0.0411, 0.0408, 0.0407, 0.0405, 0.0402, 0.0401, 0.0397, 0.0395, 0.0395,
        0.0394, 0.0391, 0.0391, 0.0389, 0.0389, 0.0386, 0.0385, 0.0383, 0.0382,
        0.0382, 0.0382, 0.0381, 0.0380, 0.0378, 0.0378, 0.0378, 0.0377, 0.0375,
        0.0374], device='cuda:0')

@pasandrei
Copy link

After 40 epochs with 50 images with batch size of 2, lr 0.001, AdamW:

  1. 50 images is WAY too few images. I would suggest using at least a few thousand.
  2. 40 epochs is too little for efficientdet
  3. Because of the way BatchNorm works, a batch size of 2 is also too little. I think 32 is the hotspot.
  4. You could also increase the learning rate to something like 0.01

Just as a reference, I trained EfficientDet-Lite0 with ~12000 images, 300 epochs, batch size of 40 and with lr ~0.05

@ofekp
Copy link
Author

ofekp commented Aug 24, 2020

  1. I know, I have more just wanted to see if the model can overfit on a small set of the data. That is a good indication that the model works, it does not ensure a good model. Given that I though I had an issue in the setup, I needed a small dataset.
  2. Copy that, but how did faster RCNN did better after just 10 epochs? I know small dataset, so it means nothing probably.
  3. For now I only have 8 GB in my GPU, I use tf_efficientdet_d5, and already with batch of 3 I get Cuda out of memory error. When I will start training in Colab, I will have 15GB, so I will increase the batch size.
  4. I tried, and got some images without any predictions which caused issues in Coco eval method. Is this normal? I am guessing no prediction should be a valid result of the model.

Do you recon when I increase the data set, I will need to change .001 back to normal?

Thank you so much for the help, I appreciate it, I will keep trying to train with what you told me, and will play with the .001 value too. Will give some results soon, hopefully good ones 🤞

@pasandrei
Copy link

pasandrei commented Aug 24, 2020

  1. Trying to overfit can seem like a good idea, however I had some problems in the past trying to overfit MobileNetv2 on a small portion of COCO (64-128 images)
  2. I experienced something similar with SSD and YOLOv3: the mAP would grow logarithmically. However, for this network I observed that the mAP increases somewhat sigmoidally (if that's a word)
  3. You definitely won't need d5. Try to go with a model for which you can fit at least 16 images in a batch.
  4. Increasing the threshold will, most likely, decrease the mAP. However, if you plot the predictions, you will definitely say that a higher threshold yield better results (visually).

Do you recon when I increase the data set, I will need to change .001 back to normal?

When you evaluate with the COCO mAP metric, you should leave it .001. But, while visually inspecting the output, you should increase it a few order of magnitudes

Thank you so much for the help, I appreciate it

I'm glad to help! Good luck!

@ofekp
Copy link
Author

ofekp commented Aug 25, 2020

Me again :)

  • Since I have a network where I also train for segmentation based on the output of the detection network I've set the threshold to 0.1 for both training and visualizing the result, which is contrary to what you suggested, can you please explain why it is beneficial to leave it 0.001 for training and on increase it for visualizing?
  • BTW, I actually made a change to make it a parameter to the model method and this way I was able to check the visualization on few values quickly, why is this not a parameter which is learned by the network?
  • Took your advice and used d1 instead of d5 with a batch of 4 which is what I could fit on my 8GB GPU
  • Increased the lr to 0.01 as you suggested
  • I also used gradient accumulation of 2, so you could say my batch size is effectively 8
  • I've had some success with this setup on my own data which I am pretty happy about (first line for every image is the ground truth) see images below.
  • Thank you so much for helping me, having someone to talk to about this is greatly beneficial for improving.

Screenshot from 2020-08-25 21-52-06
Screenshot from 2020-08-25 21-51-48
Screenshot from 2020-08-25 21-51-11
Screenshot from 2020-08-25 21-50-55

@sadransh
Copy link

nice discussion.

I've got a question about training on my own dataset. I have about 12K images, and I am trying to train with

./distributed_train.sh 3 ./data --model efficientdet_d1 -b 15 --amp --lr .1 --sync-bn --opt fusedmomentum --warmup-epochs 5 --lr-noise 0.4 0.9 --model-ema --model-ema-decay 0.99995

is it a good start point? if I use a bigger batch size it won't fit in my GPUs.

after 26 epochs the ap is around 0.00037 is this normal? isn't it taking too long?

@rwightman
Copy link
Owner

Woah, quite the discussion here... few points

Batch size is pretty important for training these models, especially since I don't use any BN freezing by default (you're welcome to implement). I don't know what the lower limit is, but it seems to become pretty unstable below a global batch (across all distributed nodes) of 8. Gradient accumulation doesn't really solve the issue because a big part of it is the BN stability, which is why I recommend keeping sync-bn on if you're doing distributed training.

These models do take time to train properly, you won't be happy looking at the results too early.

You can align your images however you want, just be sure to match training augmentation pipeline with your eval pipeline. I based this on the practice in the official impl so the pretrained weights would work properly.

For the last comment, with EMA enabled, you won't see good eval much above 0 for a while (depends on the number of steps per epoch and the effective window size ove your ema decay). I think my defaults were around 20 epochs before it starts looking like it's not broken. But that is for COCO and the number of steps / epoch you get with that dataset and my default batch sizes. You need to adjust for your setup. With 12K images. you probably won't see anything happen for > 120 epochs.

@ofekp
Copy link
Author

ofekp commented Aug 28, 2020

Thank you for your reply, clarifications and suggestions @rwightman

I'd love to try to implement the batch freeze if it can be done in < 5 days, time constraints and all...
I am a beginner in deep learning so not quite sure yet what it means, I will try to read on that, do you have any good resource to direct me to?

BTW our network is built like so, since we try to do both instance segmentation in conjunction with detection, as evident from the images we attached above:

                            +---> MaskRCNN
                            |
EfficientNet +--> BiFPN +--->
                            |
                            +---> EfficientDet

Where MaskRCNN actually makes use of the EfficientDet network output.

During the training we saw that increasing the value of 0.001 here (to .1 😬) was very beneficial to get rid of all the "redundant" boxes (seen in this #72 (comment)) in as early as 40 epoch (with 1K images and batch size of 4) we were wondering, why is the value so low? or to better phrase, how was this value picked? and is there a way that it will actually be learned by the network instead?

Thank you again Ross.

@rwightman
Copy link
Owner

@ofekp I think I just ran through some validations and picked a value that reduced the number of detections passed eval slightly (to speed up a little) but didn't reduce the AP/AR scores... values between .05 to 0 are fairly common for evaluation, I think the original doesn't bother filtering at all. You should use a much higher value for visualization (typically .1-.4).

@ofekp
Copy link
Author

ofekp commented Aug 29, 2020

@rwightman, sorry it is long 🙏

.001

Thanks Ross, since I have MaskRCNN using the output from effdet during the learning process, setting the value to 0.1 was good for the training as well as for the visualization since maskRCNN performed much better when it was given the actual correct bounding boxes. I am not sure this is the right thing to do, and only just started tweaking stuff recently since I got it working.

Batch normalization freeze - possible code addition

After reading these:

I think something of this sort should work for freezing the BN weights:

def set_bn_eval(m):
    classname = m.__class__.__name__
    if "BatchNorm2d" in classname:
        m.affine = False
        m.weight.requires_grad = False
        m.bias.requires_grad = False
        m.eval()

def freeze_bn(model):
    model.apply(set_bn_eval)

Then call:

freeze_bn(model)

If I understand correctly, you want to freeze those weights to their pre-trained values, so that we will have better statistics when training? Meaning we freeze them at the beginning of the training and never "unfreeze" them again. Did I get this right?

Batch normalization freeze - validation of freeze_bn method

I checked and this seems to work fine. The weights don't change. Checked using this, if anyone is interested:

# THIS CODE IS ONLY FOR VALIDATION
wl1 = []
bl1 = []
wl2 = []
bl2 = []

def set_bn_eval1(m):
    classname = m.__class__.__name__
    if "BatchNorm2d" in classname:
        wl1.append(m.weight.clone())
        bl1.append(m.bias.clone())

def get_bn_weights1(model):
    model.apply(set_bn_eval1)
    
def set_bn_eval2(m):
    classname = m.__class__.__name__
    if "BatchNorm2d" in classname:
        wl2.append(m.weight.clone())
        bl2.append(m.bias.clone())

def get_bn_weights2(model):
    model.apply(set_bn_eval2)

Then I called this before starting to train:

# THIS CODE IS ONLY FOR VALIDATION
get_bn_weights1(model)

And called this after every epoch:

# THIS CODE IS ONLY FOR VALIDATION
print("asserting nb")
print(wl1[0])
print(wl2[0])
print(torch.eq(wl1[0], wl2[0]))
print(torch.all(torch.eq(wl1[0], wl2[0])))
get_bn_weights2(model)
for w1, w2 in zip(wl1, wl2):
    assert torch.all(torch.eq(w1, w2))
for b1, b2 in zip(bl1, bl2):
    assert torch.all(torch.eq(b1, b2))

Batch normalization is turned off for resampling

Question: In the effdet structure, why did you turn off Batch Normalization by default for the resampling?

Effnet backbone layers

In the paper, they use layers 3-7, while in the effdet implementation you use 2-4, I would love to know what were the considerations in doing so. I think I am overlooking some detail.

Should I pre-train on Clothing-1M instead of ImageNet

Since I am training on clothes, wouldn't I want something pre-trained on something like Clothing1M to get better BN weights? Would that be a good thing to do? I will have to train on Clothing1M which has label noise in it, and I am not sure if I should train only efficient net or all of effdet (if it should be all of effdet, we can train with our data and not even use Clothing1M).

Decrease in precision

I am also experiencing a decrease in precision the more I train:

Epoch 10:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.018
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.038
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.015

Epoch 20:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.012
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.026
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.009

Epoch 40:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.007
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.015
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.006

Is it normal?

@sadransh
Copy link

sadransh commented Aug 29, 2020

./distributed_train.sh 4 ./data --model tf_efficientdet_d2 -b 8 --amp --lr .1 --sync-bn --opt fusedmomentum --warmup-epochs 5 --lr-noise 0.4 0.9 --model-ema --model-ema-decay 0.99995


Train: 125 [   0/344 (  0%)]  Loss:  0.597892 (0.5979)  Time: 2.282s,   14.02/s  (2.282s,   14.02/s)  LR: 8.976e-02  Data:0.897 (0.897)

loss on validations set is about 3. 

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.004

I am not sure what's is the problem with my training.
my dataset contains 12k images 800by 800. I have used the same annotations with detectron and got good results.

any suggestions? I think I am missing sth important but not sure what it is.

UPDATE: It seems that in previous run my LR with considering batch size was too high.

@ofekp
Copy link
Author

ofekp commented Aug 30, 2020

@sadransh Is this epoch 125? Though I never got to 125, this seems very low, I updated my comment (which has become a full blown story by now) with some AP results on 1K images so you can compare to these. I am also experiencing weird stuff so take it with a grain of salt. This seems like maybe you feed it with wrong data, did you try to visualize your data to make sure it is OK? I know I had issues with the coding of the bounding boxes in my dataset, had to covert xyxy to yxyx for it to work better, just as an example. Anyway, just trying to help, you probably better wait for someone more knowledgeable to answer.

@sadransh
Copy link

@ofekp at first I thought it is a problem with my dataset. However, if you provided a standard coco annotation to the code. with having this set to True, here the loader correctly prepares data for training.
I think my previous training params were incorrect I decreased the LR in another try and with ~200 epochs I reached similar results to detectron faster-rcnn which shows the training is correct.

@Naivepro1990
Copy link

@ofekp

Do you have a success while training on the custom data. I would like to get your code and train my own dataset.

@ofekp
Copy link
Author

ofekp commented Aug 31, 2020

@Naivepro1990 Some success, though I am seeing the loss is going down, my precision is not improving. Trying to take Ross advise and train > 120 epochs now. We'll see.
I suggest first that you take a look at (did you already?) https://www.kaggle.com/shonenkov/training-efficientdet (taken from the README) since it is simpler than my code. I am also doing segmentation with another network from torchvision.

@ylmzkaan
Copy link

ylmzkaan commented Sep 11, 2020

First of all I am really thankful for this repository.

@ofekp I am also taking Alex's kernel as reference but can you share your inference script? It will be extremely beneficial if you do.

I also have couple of question.

1- By confidence threshold do you mean discarding bboxes with probability less than 0.1?
2- For inference do you use DetBenchPredict or DetBenchTrain?
3- is it a must to resize all input images to a common size? I assume it is and I resize the largest edge of image to 512 pixels and pad the image to make it 512x512 without changing the aspect ratio. Is this a good practice?

@lucasjinreal
Copy link

@ofekp cascased maskrcnn's Mask Head with single stage detector seems not a good practise, since the bboxes are too much and not high quality as 2 stages does. Instead, if you try Yolact++ style, you can get a very nice && very easy to deploy instance segmentation model.

@ofekp
Copy link
Author

ofekp commented Sep 29, 2020

Thanks @rwightman,
We did get 33% AP in segmenattion and 42% AP for bbox with just 10K of the images (out of 40K) with d2 model, slightly surpassing the AP we got with the original Faster R-CNN model.
We used a box threshold value of 0.3 (and not 0.001) so much less boxes are processed by the mask head.
We might try Yolact in a future project, but this one is coming to an end and we have to warp it up.
I also wrote the code for batch norm weights freeze, if you want, it is in a previous comment I made, and I can add it to this repo.
Let me know,
Thanks again.

@lucasjinreal
Copy link

@ofekp How do u get your model precision normal eventually? How many epochs did u train?

@ofekp
Copy link
Author

ofekp commented Oct 10, 2020

@jinfagang @ylmzkaan
We recently completed our work, you can see everything here:
https://github.com/ofekp/imat
All the information you want is found either in the code or in our paper (mind you, it is not a published paper, but for a course project paper).
If you have any question after that, I will be happy to answer everything I can.
If you like what we did, I'd love it if you can leave a star on the repo, it will help us a lot.
Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants