-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EfficientDet: Scalable and Efficient Object Detection - 51.0% mAP@0.5...0.95 COCO #4346
Comments
Looks really promising! The GPU latencies given are very low but it uses efficientnet as the backbone - how could that be? |
So this could be implemented in this darknet repository??? I'm a little confused. |
How to get the best mAP@50 until now? Can I use EfficientDet-D0 - D6? I use yolov3-voc.cfg to train myself dataset and get mAP@50=80. on myself test set. I just add three lines: |
@AlexeyAB @WongKinYiu guys I might have an interesting clue about increasing mAP. Efficientdet has 5 outputs (P3-P7) compared to 3 (P3-P5) for yolov3, but these extra 2 are for larger objects, not smaller objects. In the past I've added a 4th layer to yolov3, with the same or slightly worse results, but this was for smaller objects. On the same topic, I recently added test-time augmentation to my repo ultralytics/yolov3#931, which increased mAP from 42.4 to 44.7. I tested many different options, and settled on 2 winners: a left-right flip, and a 0.70 scale image. The highest mAP increase was coming from the larger objects. I think the 0.70 scale made these large objects smaller so they could fit in the P5 layer (whereas maybe before they would have needed to be in the P6 layer which doesn't exist). So my proposal is (since the darknet53-bifpn cfg did not help), to simply add P6 and P7 outputs to yolov3-spp.cfg and test this out (with the same anchors redistributed among the layers I suppose). What do you think? |
@glenn-jocher Hello, Yes, from my previous analysis, I think we need at least one more scale (P6). I saw your modification yesterday, and I already integrated YOLOv3-SPP into mmdetection. By the way, CSPDarkNet53 has also be integrated with TTFNet (anchor-free object detection), MSRCNN (instance segmentation), and JDE (simultaneously detect and track). OK, I am training YOLOv3-SPP with almost same setting as CSPResNeXt50-PANet-SPP(optimal) using ultralytics. I think it can be baseline of your new model which integrate P3-P7. |
Yes, it seems that https://arxiv.org/abs/1909.00700v3 and https://github.com/ZJULearning/ttfnet (Training-Time-Friendly Network for Real-Time Object Detection) is a good way. While I think we must move on:
Yes the same result as for CenterNet, they achieve 40.3% AP / 14 FPS, then with Flip - they achieve 42.2% AP / 7.8 FPS, and with Multi-scale they achieve 45.1% AP / 1.4 FPS. https://github.com/xingyizhou/CenterNet#object-detection-on-coco-validation
May be yes.
Yes, it is because, they increased input network resolution from 512x512 for D0 (where are P3-P5 for big objects) to 1536x1536 for D7 (where are P3-P5 for small objects), so we should add P6-P7 for big objects. Because receptive field NxN of P5 don't depend on network resolution and remains the same in pixels (look below for yolo cfg files), so NxN for 512x512 is big, while the same NxN for 1536x1536 is small. So may be we should:
Yes, try 4 ways:
I added receptive field calculation - usage:
While input network size if just 608x608.
|
@WongKinYiu ah great! Lots of integrations going on. I have not looked at ATSS yet, I will check it out. TTFNet looks refreshingly simple. Yes that's a good chart you have there. How do you calculate the receptive field exactly? I saw Efficientdet updated their anchor ratios to (1.0, 1.0), (1.4, 0.7), (0.7, 1.4). I'm not sure exactly how these work. Do you think these create anchors based on multiplying grid cells or the receptive field? Yes a P6 layer requires 64-multiple size images, and a P7 layer would require 128-multiple size images, but its not a huge problem. |
there is a 3x3 convolutional layer just before prediction layer, so i simply multiple size of grid by 3. |
from JDE paper, they provide results of different loss for embedding. Also there are some issues to be solved. for example, it only support single class tracking and different anchors in same scale share same embedded feature. the code is mainly based on ultralytics, so i think it can be start point for developing triplet-loss based tracker. https://github.com/Zhongdao/Towards-Realtime-MOT |
@WongKinYiu so you simply take a 3x3 grid as the receptive field. Ok. Do you think it might be beneficial to have the anchors be fixed in units of gridspace instead of image space? Maybe this is what EfficientDet is doing with their (1,1), (1.4, 0.7), (0.7, 1.4) anchor multiples (I don't know what they do with these multiples). Right now the anchors are fixed/defined in imagespace (pixels) rather than grid space, so the same anchor would take up varying gridpoints depending on the output layer (if it was applied to different layers). What do you think of the idea defining the anchors as (1,1), (1.4, 0.7), (0.7, 1.4) local gridpoints, and then maybe testing out say a 2x and 3x multiple of that? I implemented my yolov3-spp-p6 in other news, I'm training it now. I trimmed some of the convolutions to keep the size maneagable, its 81M params now and training about 25% slower than normal. Early mAP was lower, but seems to be crossing yolov3-spp and going higher at around 50 epochs. I'll keep my fingers crossed. |
@WongKinYiu ah yes that makes sense! Also (1.4, 0.7) IOU is about 0.55, close to 0.5. From your earlier plots though it looks like the current anchors correspond much better to about 3x3 gridpoints than 1x1 gridpoints. At 512x512, the P3 grid is 64x64, P4 is 32x32, P5 is 16x16, and P6 is 8x8. If we had a P7 that would be 4x4, and 3 gridpoints at that scale would take up almost the entire image (which sounds about right). At the smaller scale though the P3 stride is 8, and we currently have anchors about that size (smallest is 10x13 pixels). I'm worried my existing anchors are causing tension in the P6 model, as the GIoU loss is higher than normal. I simply spread out the 12 anchors I was using for yolov4.cfg (which has P3-P5, 4 at each level) to yolov4-p6 (which has P3-P6, 3 anchors at each level). |
@AlexeyAB @WongKinYiu ok, my P6 experiment was tracking worse than yolov3-spp after about 150 epochs so I cancelled it. I'm not sure why exactly. If I look at the yolov3-spp receptive field, at P5, stride 32, the largest anchor is (373,326), or 10 grids, which would be 3X the receptive field according to @WongKinYiu P6 has stride 64, so only 1.5X receptive field for the largest anchor, yet overall mAP is worse. I did trim some convolution operations to keep the parameter count reasonable, so this could be the cause. Back to the drawing board I guess. @WongKinYiu how did your P6 experiment go? |
Also you should take into account that You can see receptive field in the Darknet by using:
|
Could you provide your cfg file and training command? by the way, do you train/val on coco2014, or on coco2017? |
@WongKinYiu yes here is the p6 cfg with 12 anchors, and a modified version of yolov3-spp called yolov4 that has the same 12 anchors, which trains to slightly above yolov3-spp (+0.1mAP). I had to add a lot of convolutions to p6, so it has 81M params. I doubled the width of the stem convolutions (which use few params), but reduced the width of the largest head convolutions (i.e. 1024 -> 640 channels). Overall the result was slightly negative though, so you may want to adjust the cfg.
|
@glenn-jocher Try to train and test this model with network resolution 832x832 (with random shapes). |
@AlexeyAB yes maybe I should put the SPP block back in on the P6 layer, and return the dn53 stem convolutions to their original sizes. When I changed dn53 I saw that there were 8, 8 and 4 blocks in the last 3 downsamples. For p6 I changed this to 8, 8, 8 and 8 (no spp). Maybe I should update to 8, 8, 8, 4+spp, which would more closely mimic yolov3-spp. |
start training yolov3-spp and yolov3-spp-p6. |
Is this just a regular training |
yes, currently i just use |
Something like that is integrated with CVAT It seems to only support tensorflow models and is oriented towards labelling video frames (interpolation), I haven't actually tried it myself yet. I want to do something along these soon but haven't quite decided what appoach to take. I would prefer to maybe script the interpolation myself so that I'm not restricted to tensorflow and I can have more flexibility over the process. I do like CVAT though because it has a REST API and the fact that it runs in a web browser makes it easy to manage centrally and distribute batch jobs amongst many people. |
Have you ever tried direct GPU rig renting like on vast.ai ? https://vast.ai/console/create/ |
Is it something like
Is it aggregator of different clouds: Vast, GCP, AWS, Paperspace, ... ? |
It's individual people with big rigs of GPUs (probably mostly people who used to use them for mining cryptocurrencies, which is now not profitable on GPUs). The site vets them and requires certain standards for hardware, uptime, internet speed etc. and charges a commission for connecting them to buyers. Not sure exactly how the cvat interpolation mode works, it requires a tensorflow model for detection. idk. if it also tries to leverage optical flow or some other tracking method. Maybe the best automated video labeler would be some dedicated integrated video object-detector / tracker? I know some video objected detectors exist that try to leverage recent frames to refine predictions for current frame, but maybe better results could be had by some CNN that also used future frames? So you manually label every 10th frame, train a model and try to interpolate the rest? |
ok thanks, good to know. |
How can I use hard-examples-mining? So it returns a list of images with only low confidence detections? |
Un-comment these lines: https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1125-L1132 If someone need it, I can add flag for enabling this feature, wihtout changing source code. |
Ah, so refining coco labels and then getting mAP improvement is not a publishable result you are saying, even though you start from the same dataset and don't add any images to it? I'd be a little surprised if that were the case, as we augment the images themselves currently, which is also modifying the original dataset. In any case, if it provided repeatably improved results on custom datasets, it would be adding value to the product your company might provide in an automl type solution. I suppose the best implementation would be to allow this functionality with a simple argument during training, i.e. @AlexeyAB the label refinement you have creates new labels, but are these fed back into the training set during training, or is that a manual step one would do? |
@LukeAI about video labelling, I've never heard of cvat, but I've done a lot of object tracking, both visual and radar with kalman filters, and yes, if you can detect an object in one video frame, you can track it using a KLT tracker easily for a few more frames before drift becomes an issue. So in this sense you could use yolov3/4 to detect an object in a video frame, then a KLT tracker to update your boxes in the remaining 29 frames, and repeat. Google has a demo app that's supposed to do this called ODT, which is pretty bad in real life, and Swift also has track object requests (you supply the initial box): I haven't used vast.ai, but yes the prices are much better! |
@AlexeyAB @WongKinYiu also guys, are you sure that using refined labels for training is not fair, because the noisy student paper uses 300M unlabelled images (!). So they are going out and finding new images no one else has, which is a step beyond what I was proposing, which uses the same images and no more. I would tend to view it as fair, because you are starting with the same dataset and labels as everyone else, and not adding new images. I suppose the rules of the game depend on the game you are playing... |
@WongKinYiu ah, so you are saying that refining the labels would count as 'extra training data', and then be classed togethor with methods which use unlabelled data. In this case then yes, you are competing with a much wider range of possible competitors. Perhaps an alternate method then, leaving the labels alone, would be to update the objectness loss to account for the unequal probability of human-labelled FP and FN (false negative) mistakes. I'd suspect a human labeller would cause many more FN's than FP's, so we may want to apply an apriori distribution to the obj loss to reflect this. In practice this might be accomplished by reducing the FP losses, i.e. I think multiplying them by a gain of 0.9 for example would imply that human labelers are only labelling 90% of the objects correctly. Does this make sense? It might be a more universal way to account for human labeling errors without having to check the 'extra training data' box. EDIT: I realized I've just described class label smoothing. The only differences are that I proposed it for objectness and I proposed a positive-negative imbalance. Unfortunately I tried smoothing objectness before with poor results, and came to the conclusion that it is best applied only to classification loss, if at all. |
No.
If we will try to refine labels of MSCOCO, then person-labels will be improved, but labels of all other 79 classes will be degraded. Since our goal is ti find or create the best model rather than the win some challenge by using tricks. So we can use this model for real product, f.e. auto-labeling as you suggested.
This is a manual step curretly. |
So you would run the detector like And the purpose would be to understand blind spots in your detector so you know what areas of improvement are needed? If there were a CLI flag for that, I would use it, I do something similar with a python script already so it wouldn't be a priority for me, but my approach feels a bit hacky. |
Yes, set |
@WongKinYiu I've been studying cd53s-yo.cfg. The new bottleneck strategy used there seems to be a good improvement over darknet53. Compared to yolov3-spp, cd53s-yolov3 has more layers and convolutions, but more importantly a 20-25% reduction in parameters and FLOPS, 10% reduction in inference time, and roughly similar mAP (or potentially slightly better). If I'm understanding correctly the primary difference is that each bottleneck has a 1.0 expansion factor between the first and second conv in cd53s instead of a 0.5 factor, but also that there are leading and trailing convolutions (into the series of bottlenecks and exiting it) that reduce the channel count by half going in, and then double it back to normal going out (via concat with an additional residual). So taken as a unit, one of these cd53s series of bottlenecks can be a drop-in replacement for the more traditional darknet 53 series of bottlenecks. This is all very good! This naturally leads me to wonder though if the expansion back to the original channel count is still necessary. For example, the first series of bottlenecks is a series of 2 bottlenecks with 128ch going in, then 64x64 convolutions for two bottlenecks, then additional convs to bring the channel count back to 128, before it gets passed to the same 3x3 stride 2 convolution to downsample it. So my main question is, if we experimented with simply using 64ch throughout this series, and doing the same with the other bottlenecks, do you think the performance would be reduced significantly? My second question is did you experiment with trying to reinvest the FLOPS/parameter 'savings' back into the network? For example perhaps you could increase the depth or width of the cd53s backbone by adding a few convolutions or increasing their channel count to bring it back up to 60M parameters, and thus capture a better performance increase compared to yolov3-spp? |
I've written a pytorch module that should be able to reproduce the bottleneck series in cd53s-yo.cfg: class BottleneckSeriesCSP(nn.Module):
def __init__(self, c1, c2, n=2, shortcut=True, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
super(BottleneckSeriesCSP, self).__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c1, c_, 1, 1)
self.cv3 = Conv(c_, c_, 1, 1)
self.cv4 = Conv(c2, c2, 1, 1)
self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])
def forward(self, x):
y1 = self.cv2(x)
y2 = self.cv3(self.m(self.cv1(x)))
return self.cv4(torch.cat((y1, y2), dim=1)) This uses instances of the class Bottleneck(nn.Module):
def __init__(self, c1, c2, shortcut=True, g=1, e=0.5): # ch_in, ch_out, shortcut, groups, expansion
super(Bottleneck, self).__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c_, c2, 3, 1, g=g)
self.add = shortcut and c1 == c2
def forward(self, x):
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x)) Which in turn uses the |
yes, you are right. Darknet stage:
CSPDarknet stage
My English is not good enough to understand long paragraph in real-time, will give you feedback as soon as possible. #4346 (comment) #4346 (comment) |
640x640 5k-set: so the comparison would be as follows:
CD53s-YOLOv3 gets comparable AP as D53-YOLOv3, but it is lighter and faster. |
@WongKinYiu awesome! Ok I'm sold on the changes :). I will need to study the yolov4 head, it looks like it had quite a big impact on the results. Do you have a simple block diagram of what the new head does? I'm training some new models with BottleneckSeriesCSP() modules. Do you use these new bottlenecks in the backbone only, or do you also use them in the head? |
@WongKinYiu about these new results, do you want to submit a PR for https://github.com/ultralytics/yolov3/blob/master/README.md? Else I could add these new results myself if you send me your training command. I will also upload the corresponding *.cfg files so they are all available. |
@glenn-jocher Hello, i use here are all of cfg/weights, you could add them to your repo. currently i only use BottleneckSeriesCSP in the backbone, will design new head soon. I am in a business trip, will give you feedback about #4346 (comment) #4346 (comment) #4346 (comment) soon. |
i find there is a bug when testing with 608x608. |
@WongKinYiu ah yes, this is an interesting point. test.py with rectangular inference will round to the nearest 64 size now, so you are correct that if you pass --img 608 it will actually run inference at 640. Do you get the same test results at --img 640 and --img 608? EDIT: The purpose here was to prepare testing for a P6 model, which is now never used. A more robust approach would be to save the model strides as an attribute and then set the rounding to the max stride. EDIT2: In the new repo I have code which handles this for train.py, but not in test.py: # Image sizes
gs = int(max(model.stride)) # grid size (max stride)
if any(x % gs != 0 for x in opt.img_size):
print('WARNING: --img-size %g,%g must be multiple of %s max stride %g' % (*opt.img_size, opt.cfg, gs))
imgsz, imgsz_test = [make_divisible(x, gs) for x in opt.img_size] # image sizes (train, test) EDIT3: I suppose a similar error check should be run on all 3 main files (train, test, detect.py) |
hmm... test results at |
5k-set: python test.py --cfg yolov3-spp.cfg --weights best_yolov3-spp.pt --img 608 --iou 0.7
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.436 python test.py --cfg cd53s-yo.cfg --weights best_cd53s-yo.pt --img 608 --iou 0.7
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.437 python test.py --cfg cd53s-yo-csptb.cfg --weights best_cd53s-yo-csptb.pt --img 608 --iou 0.7
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.439 python test.py --cfg cd53s-yo-mish.cfg --weights best_cd53s-yo-mish.pt --img 608 --iou 0.7
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.443 python test.py --cfg cd53s.cfg --weights best_cd53s.pt --img 608 --iou 0.7
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.445 python test.py --cfg cd53s-mish.cfg --weights best_cd53s-mish.pt --img 608 --iou 0.7
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.450 python test.py --cfg cd53s-cspt.cfg --weights best_cd53s-cspt.pt --img 608 --iou 0.7
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.451 python test.py --cfg cd53s-csptb.cfg --weights best_cd53s-csptb.pt --img 608 --iou 0.7
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.450 |
do you get better result after replace and i think your multi-scale training has bug, images will be resized to 640x640 no matter what is the input size. (or maybe it works with new scale hyper-parameter?) if True:
imgsz = random.randrange(640, 640 + gs) // gs * gs
sf = imgsz / max(imgs.shape[2:]) # scale factor
if sf != 1:
ns = [math.ceil(x * sf / gs) * gs for x in imgs.shape[2:]] # new shape (stretched to gs-multiple)
imgs = F.interpolate(imgs, size=ns, mode='bilinear', align_corners=False) and |
EfficientDet: Scalable and Efficient Object Detection
The text was updated successfully, but these errors were encountered: