-
-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training on Private Dataset #72
Comments
I successfully trained this network on a custom dataset and I had a similar problem. For me, the main problem was the fact that the images height and width in the annotation file were wrong. |
Thanks for replying. I reached a step now that the model is finally aware of the image (after I used Alex'es network setup) but the model seems to be placing so many boxes on the person in the image (100 since this is the limit in the code). It seems the scores are not decreasing below .001 threshold fast enough (also hard coded in effdet repo, not sure why). Edit: Elaborating on yxyx: Need help figuring this out please. |
What annotations format are you using: COCO or PASCAL VOC?
This can be normal if you are not using NMS. It is known that in order to get a better score on the COCO mAP metric it's better to predict more boxes (probably because the metric is skewed towards considering that False Negatives are worse than False Positives). The easiest thing to do, would be to increase the confidence threshold and/or, depending on your specific problem, using NMS. |
I am using training code from torchvision which makes use of coco utils. I was under the impression that the model already has NMS in it. Think I saw it somewhere.
Thanks a lot kind sir! Here's what I want vs what I get from the mode: |
|
|
Just as a reference, I trained EfficientDet-Lite0 with ~12000 images, 300 epochs, batch size of 40 and with lr ~0.05 |
Do you recon when I increase the data set, I will need to change .001 back to normal? Thank you so much for the help, I appreciate it, I will keep trying to train with what you told me, and will play with the .001 value too. Will give some results soon, hopefully good ones 🤞 |
When you evaluate with the COCO mAP metric, you should leave it .001. But, while visually inspecting the output, you should increase it a few order of magnitudes
I'm glad to help! Good luck! |
Me again :)
|
nice discussion. I've got a question about training on my own dataset. I have about 12K images, and I am trying to train with ./distributed_train.sh 3 ./data --model efficientdet_d1 -b 15 --amp --lr .1 --sync-bn --opt fusedmomentum --warmup-epochs 5 --lr-noise 0.4 0.9 --model-ema --model-ema-decay 0.99995 is it a good start point? if I use a bigger batch size it won't fit in my GPUs. after 26 epochs the ap is around 0.00037 is this normal? isn't it taking too long? |
Woah, quite the discussion here... few points Batch size is pretty important for training these models, especially since I don't use any BN freezing by default (you're welcome to implement). I don't know what the lower limit is, but it seems to become pretty unstable below a global batch (across all distributed nodes) of 8. Gradient accumulation doesn't really solve the issue because a big part of it is the BN stability, which is why I recommend keeping sync-bn on if you're doing distributed training. These models do take time to train properly, you won't be happy looking at the results too early. You can align your images however you want, just be sure to match training augmentation pipeline with your eval pipeline. I based this on the practice in the official impl so the pretrained weights would work properly. For the last comment, with EMA enabled, you won't see good eval much above 0 for a while (depends on the number of steps per epoch and the effective window size ove your ema decay). I think my defaults were around 20 epochs before it starts looking like it's not broken. But that is for COCO and the number of steps / epoch you get with that dataset and my default batch sizes. You need to adjust for your setup. With 12K images. you probably won't see anything happen for > 120 epochs. |
Thank you for your reply, clarifications and suggestions @rwightman I'd love to try to implement the batch freeze if it can be done in < 5 days, time constraints and all... BTW our network is built like so, since we try to do both instance segmentation in conjunction with detection, as evident from the images we attached above:
Where MaskRCNN actually makes use of the EfficientDet network output. During the training we saw that increasing the value of 0.001 here (to .1 😬) was very beneficial to get rid of all the "redundant" boxes (seen in this #72 (comment)) in as early as 40 epoch (with 1K images and batch size of 4) we were wondering, why is the value so low? or to better phrase, how was this value picked? and is there a way that it will actually be learned by the network instead? Thank you again Ross. |
@ofekp I think I just ran through some validations and picked a value that reduced the number of detections passed eval slightly (to speed up a little) but didn't reduce the AP/AR scores... values between .05 to 0 are fairly common for evaluation, I think the original doesn't bother filtering at all. You should use a much higher value for visualization (typically .1-.4). |
@rwightman, sorry it is long 🙏 .001Thanks Ross, since I have MaskRCNN using the output from effdet during the learning process, setting the value to 0.1 was good for the training as well as for the visualization since maskRCNN performed much better when it was given the actual correct bounding boxes. I am not sure this is the right thing to do, and only just started tweaking stuff recently since I got it working. Batch normalization freeze - possible code additionAfter reading these:
I think something of this sort should work for freezing the BN weights:
Then call:
If I understand correctly, you want to freeze those weights to their pre-trained values, so that we will have better statistics when training? Meaning we freeze them at the beginning of the training and never "unfreeze" them again. Did I get this right? Batch normalization freeze - validation of freeze_bn methodI checked and this seems to work fine. The weights don't change. Checked using this, if anyone is interested:
Then I called this before starting to train:
And called this after every epoch:
Batch normalization is turned off for resamplingQuestion: In the effdet structure, why did you turn off Batch Normalization by default for the resampling? Effnet backbone layersIn the paper, they use layers 3-7, while in the effdet implementation you use 2-4, I would love to know what were the considerations in doing so. I think I am overlooking some detail. Should I pre-train on Clothing-1M instead of ImageNetSince I am training on clothes, wouldn't I want something pre-trained on something like Clothing1M to get better BN weights? Would that be a good thing to do? I will have to train on Clothing1M which has label noise in it, and I am not sure if I should train only efficient net or all of effdet (if it should be all of effdet, we can train with our data and not even use Clothing1M). Decrease in precisionI am also experiencing a decrease in precision the more I train: Epoch 10:
Epoch 20:
Epoch 40:
Is it normal? |
I am not sure what's is the problem with my training. any suggestions? I think I am missing sth important but not sure what it is. UPDATE: It seems that in previous run my LR with considering batch size was too high. |
@sadransh Is this epoch 125? Though I never got to 125, this seems very low, I updated my comment (which has become a full blown story by now) with some AP results on 1K images so you can compare to these. I am also experiencing weird stuff so take it with a grain of salt. This seems like maybe you feed it with wrong data, did you try to visualize your data to make sure it is OK? I know I had issues with the coding of the bounding boxes in my dataset, had to covert xyxy to yxyx for it to work better, just as an example. Anyway, just trying to help, you probably better wait for someone more knowledgeable to answer. |
@ofekp at first I thought it is a problem with my dataset. However, if you provided a standard coco annotation to the code. with having this set to True, here the loader correctly prepares data for training. |
Do you have a success while training on the custom data. I would like to get your code and train my own dataset. |
@Naivepro1990 Some success, though I am seeing the loss is going down, my precision is not improving. Trying to take Ross advise and train > 120 epochs now. We'll see. |
First of all I am really thankful for this repository. @ofekp I am also taking Alex's kernel as reference but can you share your inference script? It will be extremely beneficial if you do. I also have couple of question. 1- By confidence threshold do you mean discarding bboxes with probability less than 0.1? |
@ofekp cascased maskrcnn's Mask Head with single stage detector seems not a good practise, since the bboxes are too much and not high quality as 2 stages does. Instead, if you try Yolact++ style, you can get a very nice && very easy to deploy instance segmentation model. |
Thanks @rwightman, |
@ofekp How do u get your model precision normal eventually? How many epochs did u train? |
@jinfagang @ylmzkaan |
Thank you for sharing this code with us.
Can I train on my own dataset with only 11 classes?
I am overriding
config.num_classes
with 11 and overridingconfig.image_size
with 512 but I get very bad results, it is almost like the model is not even aware of the image..I made sure to put in the boxes as yxyx in the dataset and also made sure the classes start from 1 as I think it is needed due to how fast_collate is working.
Would appreciate your kind help, thank you.
The text was updated successfully, but these errors were encountered: