-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: simultaneously train yolo for object detection and tracking using auxiliary layers and cosine similarity #6004
Comments
This is in progress: Conv-LSTM + Constrastive loss (Embeddings for cosine similarity) + Yolo
So you are not crazy, this is the most obvious way to do Detection & Tracking on Video ) Experimental YOLOv4-tiny-contrastive model that is trained on MSCOCO:
|
Do you need the lstm bit ? Are you trying to do re-identification inside the network? I was thinking doing re-identification post processing using the feature set of each detected object with those of the previous frame. Essentially doing deepsort but recycling the features calculated by the detection network, and letting the hungarian algorithm do the re-identification by maximising similarity. |
i was thinking the features could be added to the list of yolo features. So instead of having features |
Similarly to normal yolo logic, the cell responsible for the detection, is also responsible for having the correct set of features, somehow... |
actually that might only make sense for training end-to-end detection + tracking. I was hoping to just fork off some features, and use them as input to another 'tracking' network with a very minimal set of layers, which wouldn't require retraining the detection network. |
Use the same features to track that were used to detect. This makes so much sense. Very excited that this is being worked on. Alexy, thanks for your excellent work and dedication to this technology. |
I personally think that a detection network tries to generalize the characteristics of a class, as all people have to have similar features. On the other hand, tracking requires contrastiveness, a person's feature needs to be as far as possible from the rest of the people. Training a network serving those conflict targets might be unrealistic. That's why we haven't found such a network like that. |
Yeah. I think that's why you might need to attach an auxiliary network that uses some of the features output by the early inner layers of the detection network, that is trained separately using similarity loss or contrastive loss. Basically you end up with deepsort, but instead of using a completely separate VGG16 network, you're using X% of the detection network and (100-X)% of a new auxiliary network. It feels slightly more efficient and faster to recycle features. I like deepsort because all the re-id is undertaken by the hungarian algorithm and all you need is a good metric, and in this case, some good tracking features to evaluate the metric on. The only problem with deepsort is that it is super slow. So recyling features from a detection network seems like a good way to solve that. But it would be interesting to see if @AlexeyAB's idea of a fully blown LTSM is better and more accurate. |
Yes, these 2 tasks (Detection and Reidentification) partially contradict each other. Therefore, is expected: a slight decrease in accuracy in images, but a large increase in accuracy in video. But perhaps the task of re-identification will make network remember more details of objects, that theoretically can improve detection even in images for large networks (networks with high capacity). |
you make a good point. Maybe re-id using LSTM will make the detection network pay special attention to object features. If you're right, then maybe you need to write a paper with title "LSTM is all the attention you need" :) |
This is done, just we should improve it and test it. |
Awesome! Have you posted some weights? |
Not yet. |
There is the Proof of concept cfg-file, you can try to train Train as usual, without pre-trained weights or with https://drive.google.com/file/d/18v36esoXCh-PsOKwyP2GWrpYDptDY8Zf/view?usp=sharing This model will count your objects, when you will run it on Video You can play with parameters for Detection (after training):
|
Thanks @AlexeyAB |
I re-upload contrastive detection model yolov3-tiny_contrastive.cfg.txt with [contrastive] cls_normalizer=1.0 - use this |
Currently there is used very simple and inaccurate model: Simple test on small dataset:
|
This is good news |
What could be the problem? Warning: in txt-labels class_id=-2147483648 >= classes=1 in cfg-file. In txt-labels class_id should be [from 0 to 0] truth.x = 0.000000, truth.y = -nan, truth.w = 0.000000, truth.h = 0.000000, class_id = -2147483648 Wrong label: truth.x = -0.000000, truth.y = 0.000000, truth.w = -0.000000, truth.h = 0.000000 Standard Labeling, 1 class (0) 0 0.241 0.854 0.149 0.285 With other cfg there is no problem. |
Look at bad.list and bad_label.list files. Try to download the latest Darknet version, and recompile. I added minor fix. What dataset do you use and how many images? |
Thank. Yes, that helped. I need to find and track cars. I use COCO-17 + my data (~ 15000 photos). |
Yes, it will work. But it will work much better when yolov4-tiny-contrastive.cfg and full yolov4-contrastive.cfg models will be implemented. |
Are you training these models on MOT datasets? Not quite sure how the contrastive loss works on COCO if no two objects are related. Doesn't that mean you only ever have negative samples. Do you need tracked objects to get positive samples? This is probably a stupid question. |
|
I see, so the key here is augmentation |
If the constrastive loss depends highly on augmentation, should there be other augmentation transformations like stretching, warping, etc? |
What is the warping? It will use:
It depends on how much the same object can differ in two frames:
Also it depends on model:
So we should use strong data augmentation only for big |
Hello @MsWik ... Did you get contrastive to work? Can you share your test result? I'm also wondering how to add contrastive layer to tkdnn / cudnn. |
Experimental YOLOv4-tiny-contrastive model that is trained on MSCOCO:
|
thats really cool. Great fps and solid tracking at This stage. frozen offscreen dets were picked up on reentry. Coco Will as always tell me a scooter is a flower pot but thats not the mot's fault. This is awesome, thanks for sharing! |
I added some hyper-parameters:
This is just a proof of concept, so tracking is very poor. |
Hi! @AlexeyAB If we have video with 1-5 FPS for tracking fast moving cars / birds, or we want to track one object on two cameras, then the same car can be very different on two frames/cameras - we need to use strong data augmentation training https://caffe.berkeleyvision.org/tutorial/layers/contrastiveloss.html |
hi, @AlexeyAB |
@AlexeyAB Hi, Can we use the tracking in Open-CV? Thanks |
@zpmmehrdad Yes you can, it's called Optiflow and it works quite well. |
Do you have a pre-trained model for yolov4-tiny-contrastive? |
@Goru1890 The model and .cfg given in this same issue. Check a few last comments. |
For anyone looking to train the contrastive model and avoid a segfault, I've been able to train for 7k+ iterations with no issues after initializing the variables in the for loops in softmax_layer.c starting on line 502. The "#pragma omp parallel for" caused the variables to go out of bounds, making the z_index larger than z. Specifically, I defined int b3, n3, h3, and w3 in the respective for statement then changed the corresponding variables in the loop. Lines 502 to 511 in e2a1287
@AlexeyAB could you point me in the right direction for implementing the contrastive layer for more than one yolo head? |
I'm trying to train "yolov4-tiny_contrastive.cfg" on my own data set and can't pass the first round, (without any specific error it's just stopped) can someone help me please? I change the number of classes to 4 and calculated the number of filter in the upper layer to be (4+5)x9 = 81 do I need to change something else? |
https://yadi.sk/i/WpZQhZYtUd7A4ghttps://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu https://disk.yandex.by/i/kWlOMt9r8xEYMA 12.05.2021, 12:17, "slidespike" ***@***.***>:
I'm trying to train on my own data set and can't pass the first round, (without any specific error it's just stopped) can someone help me please?
I change the number of classes to 4 and calculated the number of filter in the upper layer to be (4+5)x9 = 81
do I need to change something else?
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.
|
I can confirm that this solves the segfault problem, but I couldn't reconstruct your solution from the description. Removed the omp pragma completely for now and it works without problems. Could you please share the code or even create a PR for that? |
Below is a screenshot of the changes I made to fix the segfault. If you have any other issues let me know! |
@AdamCuellar So if it solves the issue (training goes well, and trained model detects correctly) you can create a Pull request for these changes. |
@AdamCuellar - have you had a chance to see how well the tracking performs? |
@AlexeyAB I haven't had a chance to train it on a good dataset. Should I train on COCO to verify it works well enough? If so, I'll train and report the results + make a pull request. @rbgreenway I trained on a small private dataset and the tracking performed well but due to the limited data the training/testing set were similar so I don't want to make any definitive statement. If you, or anyone, knows of a good dataset to test this on then I can train and post results. |
I added a fix: f5007cd I think we should define |
Yes I agree, that's much better. I've started training on COCO. I will also start training later today on the small driving dataset you've used for the YOLO w/ LSTM so I can report back with some results. |
@AdamCuellar Thanks, Adam. I'm very interested in how the tracking performs. I won't have time for a couple weeks, but I'll certainly train it on a dataset that I have access to...particularly if you report some good results. Thanks again. |
@AlexeyAB Edit: |
@AlexeyAB Looks like the changes you made work. For the small driving dataset you've used before, I trained for 20k iterations with no issues. Here is the loss chart: On COCO, it's gotten to 17k+ iterations with no issues. It will probably take a really long time though as I only had 1 gpu available to train. |
@AlexeyAB @AdamCuellar can we use this same concept for different models such as yolov4x-mish,yolov4-csp,swish? we can train object detection model & if we want to do object tracking we can simply change parameters inside cfg , we can add both parameters like: (this is just my thought process.it would be easier to use.) i saw there are only two cfg availables yolov3,yolov4 contrasive for object tracking . im looking for mish,csp,swish -model |
@akashAD98 You can technically use the contrastive layer for any model, you just need to change the model to have 1 yolo head and use the features prior to the yolo head as input to the contrastive layer. If you read through this thread, you'll get a better understanding of how to do this.
I can see how this may seem like it would be easier to use; however, I think it would make things more obscure. An object_detection parameter is a bit redundant since the yolo networks were designed for object detection. An object_tracking parameter wouldn't provide enough information for tracking such as which layer is used for the contrastive loss, the number of detections for tracking, track history size, etc. |
@AdamCuellar for training custom model should we need to use this command? & few changes in cfg which has explained by Alexyab right ? & it will apply tracker on all classes which is present in the dataset? |
Can you fork some of the intermediate outputs of the backbone network into a few auxiliary layers which can be trained, using a cosine similarity loss function (or something similar), to output features suitable for tracking objects. Maybe this could be a two-stage training process, or this could be done end to end.
I've been working on some object tracking using detections. Kalman filter solutions don't always work and it's the same story with optfow methods. The best i've seen so far, is to use cross correlation techniques (like the correlation tracker in the dlib library) but they are quite slow and scales with the number of detections. Methods like deepsort use a separate network to extract features which can be used to compare objects. But the yolo network is already calculating features. Surely, you only need to fork off some of those features, maybe pass them through a few additional conv layers, into a cosine similarity loss function, and voila.
Am I crazy or does this sound sensible?
The text was updated successfully, but these errors were encountered: