Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: simultaneously train yolo for object detection and tracking using auxiliary layers and cosine similarity #6004

Open
pfeatherstone opened this issue Jun 18, 2020 · 75 comments
Labels
ToDo RoadMap

Comments

@pfeatherstone
Copy link

Can you fork some of the intermediate outputs of the backbone network into a few auxiliary layers which can be trained, using a cosine similarity loss function (or something similar), to output features suitable for tracking objects. Maybe this could be a two-stage training process, or this could be done end to end.

I've been working on some object tracking using detections. Kalman filter solutions don't always work and it's the same story with optfow methods. The best i've seen so far, is to use cross correlation techniques (like the correlation tracker in the dlib library) but they are quite slow and scales with the number of detections. Methods like deepsort use a separate network to extract features which can be used to compare objects. But the yolo network is already calculating features. Surely, you only need to fork off some of those features, maybe pass them through a few additional conv layers, into a cosine similarity loss function, and voila.

Am I crazy or does this sound sensible?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 18, 2020

This is in progress: Conv-LSTM + Constrastive loss (Embeddings for cosine similarity) + Yolo

Am I crazy or does this sound sensible?

So you are not crazy, this is the most obvious way to do Detection & Tracking on Video )


Experimental YOLOv4-tiny-contrastive model that is trained on MSCOCO:

./darknet detector demo cfg/coco.data cfg/yolov4-tiny_contrastive.cfg yolov4-tiny_contrastive_last.weights test.avi -out_filename out_test.avi -ext_output

@pfeatherstone
Copy link
Author

Do you need the lstm bit ? Are you trying to do re-identification inside the network? I was thinking doing re-identification post processing using the feature set of each detected object with those of the previous frame. Essentially doing deepsort but recycling the features calculated by the detection network, and letting the hungarian algorithm do the re-identification by maximising similarity.

@pfeatherstone
Copy link
Author

i was thinking the features could be added to the list of yolo features. So instead of having features
[tx,ty,tw,th,p,c0,...,c79], have [tx,ty,tw,th,p,c1,...,c79,f0,....,fN-1] at each yolo layer. The features aren't taken from the last conv layer the yolo layer saw, they are taken from earlier on.

@pfeatherstone
Copy link
Author

Similarly to normal yolo logic, the cell responsible for the detection, is also responsible for having the correct set of features, somehow...

@pfeatherstone
Copy link
Author

actually that might only make sense for training end-to-end detection + tracking. I was hoping to just fork off some features, and use them as input to another 'tracking' network with a very minimal set of layers, which wouldn't require retraining the detection network.

@rbgreenway
Copy link

Use the same features to track that were used to detect. This makes so much sense. Very excited that this is being worked on. Alexy, thanks for your excellent work and dedication to this technology.

@gameliee
Copy link

gameliee commented Jun 22, 2020

I personally think that a detection network tries to generalize the characteristics of a class, as all people have to have similar features. On the other hand, tracking requires contrastiveness, a person's feature needs to be as far as possible from the rest of the people. Training a network serving those conflict targets might be unrealistic. That's why we haven't found such a network like that.

@pfeatherstone
Copy link
Author

Yeah. I think that's why you might need to attach an auxiliary network that uses some of the features output by the early inner layers of the detection network, that is trained separately using similarity loss or contrastive loss. Basically you end up with deepsort, but instead of using a completely separate VGG16 network, you're using X% of the detection network and (100-X)% of a new auxiliary network. It feels slightly more efficient and faster to recycle features. I like deepsort because all the re-id is undertaken by the hungarian algorithm and all you need is a good metric, and in this case, some good tracking features to evaluate the metric on. The only problem with deepsort is that it is super slow. So recyling features from a detection network seems like a good way to solve that. But it would be interesting to see if @AlexeyAB's idea of a fully blown LTSM is better and more accurate.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 22, 2020

Yes, these 2 tasks (Detection and Reidentification) partially contradict each other. Therefore, is expected: a slight decrease in accuracy in images, but a large increase in accuracy in video.
I think we can regulate this through the normalization factor.

But perhaps the task of re-identification will make network remember more details of objects, that theoretically can improve detection even in images for large networks (networks with high capacity).

@pfeatherstone
Copy link
Author

you make a good point. Maybe re-id using LSTM will make the detection network pay special attention to object features. If you're right, then maybe you need to write a paper with title "LSTM is all the attention you need" :)

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 7, 2020

Conv-LSTM + Constrastive loss (Embeddings for cosine similarity) + Yolo

This is done, just we should improve it and test it.

@AlexeyAB AlexeyAB added the ToDo RoadMap label Jul 7, 2020
@pfeatherstone
Copy link
Author

Awesome! Have you posted some weights?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 7, 2020

Not yet.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 7, 2020

There is the Proof of concept cfg-file, you can try to train Constrastive loss (Embeddings for cosine similarity) + Yolo (without Conv-LSTM) on any non-sequence-datasest (MSCOCO, BDD, OpenImages, PascalVOC, ...):
yolov3-tiny_contrastive.cfg.txt

Train as usual, without pre-trained weights or with https://drive.google.com/file/d/18v36esoXCh-PsOKwyP2GWrpYDptDY8Zf/view?usp=sharing

This model will count your objects, when you will run it on Video
./darknet detector demo data/sobj.data yolov3-tiny_contrastive.cfg backup/yolov3-tiny_contrastive_last.weights video.avi -out_filename out_video.avi

You can play with parameters for Detection (after training):

[yolo]
# for tracking
track_history_size = 5 - find similiraty on 5 previous frames
sim_thresh = 0.8 - similarity threshold to consider an object on two frames the same
dets_for_show = 2 - number of frames with this object before Show it
dets_for_track = 8 - number of frames with this object before Track it
track_ciou_norm = 0.3 - take into account CIoU (0.0 to 1.0)

@pfeatherstone
Copy link
Author

Thanks @AlexeyAB

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 8, 2020

I re-upload contrastive detection model yolov3-tiny_contrastive.cfg.txt with [contrastive] cls_normalizer=1.0 - use this
(previously I uploaded [contrastive] cls_normalizer=0.0 – so contrastive loss was disabled)

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 8, 2020

Currently there is used very simple and inaccurate model:
https://drive.google.com/file/d/1g18BgkIRbZGykHYxKvWgH1_UUTQnONwG/view?usp=sharing

Simple test on small dataset:

  • Contrastive loss doesn't degrade mAP of Detector
  • Contrastive loss improves accuracy of Tracking (so we can build DeepSort on top of Yolo with much higher accuracy than without Contrastive loss)
Contrastive loss is enabled [contrastive] cls_normalizer=1.0 Contrastive loss is disabled [contrastive] cls_normalizer=0.0 Contrastive, flip and jitter are enabled [net] contrastive_jit_flip=1
det_cl_fwonly_mi_b64 det_cl_fwonly_mi_b64_cl-disabled det_cl_fwonly_mi_b64_jitter

@pfeatherstone
Copy link
Author

This is good news

@MsWik
Copy link

MsWik commented Jul 9, 2020

What could be the problem?

Warning: in txt-labels class_id=-2147483648 >= classes=1 in cfg-file. In txt-labels class_id should be [from 0 to 0]

truth.x = 0.000000, truth.y = -nan, truth.w = 0.000000, truth.h = 0.000000, class_id = -2147483648

Wrong label: truth.x = -0.000000, truth.y = 0.000000, truth.w = -0.000000, truth.h = 0.000000
Wrong label: truth.x = -nan, truth.y = -885731875545466011648.000000, truth.w = -0.000000, truth.h = 0.000000
Wrong label: truth.x = 0.000000, truth.y = 0.000000, truth.w = -0.000000, truth.h = 0.000000
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 38 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 3, class_loss = 0.750000, iou_loss = -nan, total_loss = -nan
Contrast accuracy = 0.000000 %
Error: N == 0 || temperature == 0 || vec_len == 0. N=1.000000, temperature=nan, vec_len=0.000000

Standard Labeling, 1 class (0)

0 0.241 0.854 0.149 0.285
0 0.518 0.529 0.126 0.24

With other cfg there is no problem.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 9, 2020

Look at bad.list and bad_label.list files.

Try to download the latest Darknet version, and recompile. I added minor fix.

What dataset do you use and how many images?

@MsWik
Copy link

MsWik commented Jul 9, 2020

Thank. Yes, that helped. I need to find and track cars. I use COCO-17 + my data (~ 15000 photos).

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 9, 2020

Yes, it will work. But it will work much better when yolov4-tiny-contrastive.cfg and full yolov4-contrastive.cfg models will be implemented.
Currently there is used very simple and inaccurate model:
https://drive.google.com/file/d/1g18BgkIRbZGykHYxKvWgH1_UUTQnONwG/view?usp=sharing

@pfeatherstone
Copy link
Author

Are you training these models on MOT datasets? Not quite sure how the contrastive loss works on COCO if no two objects are related. Doesn't that mean you only ever have negative samples. Do you need tracked objects to get positive samples? This is probably a stupid question.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 10, 2020

  • Contrastive loss gets at least 1 positive by using 2 copies of the same images which are augmented differently. So can be used any dataset for training.

  • Contrastive loss can use many positives and negatives unlike Triplet loss

@pfeatherstone
Copy link
Author

I see, so the key here is augmentation

@pfeatherstone
Copy link
Author

pfeatherstone commented Jul 10, 2020

If the constrastive loss depends highly on augmentation, should there be other augmentation transformations like stretching, warping, etc?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 10, 2020

@pfeatherstone

other augmentation transformations like stretching, warping, etc?

What is the warping?

It will use:

  • scaleup/down image
  • stretching
  • moving
  • flipping
[net]
contrastive_jit_flip=1

[yolo]
jitter=0.3

It depends on how much the same object can differ in two frames:

  • If we have video with 25-60 fps for tracking slow moving persons, then the same person will not very different on two frames - we don't need to use strong data augmentation

  • If we have video with 1-5 FPS for tracking fast moving cars / birds, or we want to track one object on two cameras, then the same car can be very different on two frames/cameras - we need to use strong data augmentation


Also it depends on model:

  • Contrastive-learning require strong data augmentation
  • Tiny-model require weak data augmentation for good AP

So we should use strong data augmentation only for big yolov4-contrastive model rather than for yolov4-tiny-contrastive.

@javier-box
Copy link

Hello @MsWik ... Did you get contrastive to work? Can you share your test result?

I'm also wondering how to add contrastive layer to tkdnn / cudnn.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Oct 19, 2020

Experimental YOLOv4-tiny-contrastive model that is trained on MSCOCO:

./darknet detector demo cfg/coco.data cfg/yolov4-tiny_contrastive.cfg yolov4-tiny_contrastive_last.weights test.avi -out_filename out_test.avi -ext_output

@haviduck
Copy link

thats really cool. Great fps and solid tracking at This stage. frozen offscreen dets were picked up on reentry. Coco Will as always tell me a scooter is a flower pot but thats not the mot's fault. This is awesome, thanks for sharing!

@AlexeyAB
Copy link
Owner

I added some hyper-parameters:

This is just a proof of concept, so tracking is very poor.

@Hwijune
Copy link

Hwijune commented Oct 22, 2020

Hi! @AlexeyAB

image

If we have video with 1-5 FPS for tracking fast moving cars / birds, or we want to track one object on two cameras, then the same car can be very different on two frames/cameras - we need to use strong data augmentation

training
pic1 origin image -> embedding -> normalize -> feat1 |
pic2 using augment image -> embedding -> normalize -> feat2 |
simliarity(feat1, feat2) #similarity between to images

https://caffe.berkeleyvision.org/tutorial/layers/contrastiveloss.html
Can I apply it to the caffe layer?

image
what is [local_avgpool] layer??

@syjeon121
Copy link

hi, @AlexeyAB
is this feature (tracking using contrastive loss) available in c++ api, yolo_v2_class.hpp ??

@sctrueew
Copy link

@AlexeyAB Hi,

Can we use the tracking in Open-CV?

Thanks

@arnaud-nt2i
Copy link

@zpmmehrdad Yes you can, it's called Optiflow and it works quite well.
Lot's of other topics about this ( you have to compile OpenCV with Cuda support (from source))

@Goru1890
Copy link

Do you have a pre-trained model for yolov4-tiny-contrastive?

@MuhammadAsadJaved
Copy link

@Goru1890 The model and .cfg given in this same issue. Check a few last comments.

@AdamCuellar
Copy link

For anyone looking to train the contrastive model and avoid a segfault, I've been able to train for 7k+ iterations with no issues after initializing the variables in the for loops in softmax_layer.c starting on line 502. The "#pragma omp parallel for" caused the variables to go out of bounds, making the z_index larger than z. Specifically, I defined int b3, n3, h3, and w3 in the respective for statement then changed the corresponding variables in the loop.

darknet/src/softmax_layer.c

Lines 502 to 511 in e2a1287

for (b = 0; b < l.batch; ++b) {
for (n = 0; n < l.n; ++n) {
for (h = 0; h < l.h; ++h) {
for (w = 0; w < l.w; ++w)
{
const int z_index = b*l.n*l.h*l.w + n*l.h*l.w + h*l.w + w;
const size_t step = l.batch*l.n*l.h*l.w;
if (l.labels[z_index] < 0) continue;
const int delta_index = b*l.embedding_size*l.n*l.h*l.w + n*l.embedding_size*l.h*l.w + h*l.w + w;

@AlexeyAB could you point me in the right direction for implementing the contrastive layer for more than one yolo head?

@slidespike
Copy link

slidespike commented May 12, 2021

I'm trying to train "yolov4-tiny_contrastive.cfg" on my own data set and can't pass the first round, (without any specific error it's just stopped) can someone help me please?

I change the number of classes to 4 and calculated the number of filter in the upper layer to be (4+5)x9 = 81

do I need to change something else?

@MsWik
Copy link

MsWik commented May 12, 2021 via email

@lukastruemper
Copy link

For anyone looking to train the contrastive model and avoid a segfault, I've been able to train for 7k+ iterations with no issues after initializing the variables in the for loops in softmax_layer.c starting on line 502. The "#pragma omp parallel for" caused the variables to go out of bounds, making the z_index larger than z. Specifically, I defined int b3, n3, h3, and w3 in the respective for statement then changed the corresponding variables in the loop.

darknet/src/softmax_layer.c

Lines 502 to 511 in e2a1287

for (b = 0; b < l.batch; ++b) {
for (n = 0; n < l.n; ++n) {
for (h = 0; h < l.h; ++h) {
for (w = 0; w < l.w; ++w)
{
const int z_index = b*l.n*l.h*l.w + n*l.h*l.w + h*l.w + w;
const size_t step = l.batch*l.n*l.h*l.w;
if (l.labels[z_index] < 0) continue;
const int delta_index = b*l.embedding_size*l.n*l.h*l.w + n*l.embedding_size*l.h*l.w + h*l.w + w;

@AlexeyAB could you point me in the right direction for implementing the contrastive layer for more than one yolo head?

@AdamCuellar

I can confirm that this solves the segfault problem, but I couldn't reconstruct your solution from the description. Removed the omp pragma completely for now and it works without problems. Could you please share the code or even create a PR for that?

@AdamCuellar
Copy link

@lukastruemper

Below is a screenshot of the changes I made to fix the segfault. If you have any other issues let me know!

image

@AlexeyAB
Copy link
Owner

@AdamCuellar So if it solves the issue (training goes well, and trained model detects correctly) you can create a Pull request for these changes.

@rbgreenway
Copy link

@AdamCuellar - have you had a chance to see how well the tracking performs?

@AdamCuellar
Copy link

@AdamCuellar So if it solves the issue (training goes well, and trained model detects correctly) you can create a Pull request for these changes.

@AlexeyAB I haven't had a chance to train it on a good dataset. Should I train on COCO to verify it works well enough? If so, I'll train and report the results + make a pull request.

@rbgreenway I trained on a small private dataset and the tracking performed well but due to the limited data the training/testing set were similar so I don't want to make any definitive statement. If you, or anyone, knows of a good dataset to test this on then I can train and post results.

@AlexeyAB
Copy link
Owner

@AdamCuellar

I added a fix: f5007cd
Please, try to test it by your small dataset or large COCO-dataset.

I think we should define for-loop-variables (nd, hd, wd) inside the for-statement for the inner loops, and outside for the OpenMP-for-loop variable nb https://docs.microsoft.com/en-us/cpp/parallel/openmp/reference/openmp-directives?view=msvc-160#for-openmp

@AdamCuellar
Copy link

@AlexeyAB

Yes I agree, that's much better. I've started training on COCO. I will also start training later today on the small driving dataset you've used for the YOLO w/ LSTM so I can report back with some results.

@rbgreenway
Copy link

@AdamCuellar Thanks, Adam. I'm very interested in how the tracking performs. I won't have time for a couple weeks, but I'll certainly train it on a dataset that I have access to...particularly if you report some good results. Thanks again.

@AdamCuellar
Copy link

AdamCuellar commented Jul 28, 2021

@AlexeyAB Sadly, both experiments ended with a segfault. On the small driving set it got to almost 14k iterations and on COCO it got to about 5k iterations. I think declaring the iterator variable inside the parallelized for-loop might be necessary when there are nested for-loops. I will modify the code and try again to see if it still segfaults.

Edit:
Actually, I just realized I made a mistake so I need to test again. Please disregard.

@AdamCuellar
Copy link

@AlexeyAB Looks like the changes you made work. For the small driving dataset you've used before, I trained for 20k iterations with no issues. Here is the loss chart:

image

On COCO, it's gotten to 17k+ iterations with no issues. It will probably take a really long time though as I only had 1 gpu available to train.

@akashAD98
Copy link

akashAD98 commented Sep 6, 2021

@AlexeyAB @AdamCuellar can we use this same concept for different models such as yolov4x-mish,yolov4-csp,swish?
I want to do both task, .first it will detect objects & even we can add tracker on the top of detected object.

we can train object detection model & if we want to do object tracking we can simply change parameters inside cfg , we can add both parameters like:
object_detection=1 "it will do only object detection"
object_tracking=1 "it will do object tracking"

(this is just my thought process.it would be easier to use.)

i saw there are only two cfg availables yolov3,yolov4 contrasive for object tracking . im looking for mish,csp,swish -model
i dont know if its possible to do tracking on this model, Please let us know. Thank you so much .

@AdamCuellar
Copy link

@AlexeyAB @AdamCuellar can we use this same concept for different models such as yolov4x-mish,yolov4-csp,swish?
I want to do both task, .first it will detect objects & even we can add tracker on the top of detected object.

@akashAD98 You can technically use the contrastive layer for any model, you just need to change the model to have 1 yolo head and use the features prior to the yolo head as input to the contrastive layer. If you read through this thread, you'll get a better understanding of how to do this.

we can train object detection model & if we want to do object tracking we can simply change parameters inside cfg , we can add both parameters like:
object_detection=1 "it will do only object detection"
object_tracking=1 "it will do object tracking"

(this is just my thought process.it would be easier to use.)

I can see how this may seem like it would be easier to use; however, I think it would make things more obscure. An object_detection parameter is a bit redundant since the yolo networks were designed for object detection. An object_tracking parameter wouldn't provide enough information for tracking such as which layer is used for the contrastive loss, the number of detections for tracking, track history size, etc.

@akashAD98
Copy link

akashAD98 commented Sep 30, 2021

@AdamCuellar for training custom model should we need to use this command? & few changes in cfg which has explained by Alexyab right ?
!./darknet detector train cfg/coco.data cfg/yolov4-tiny_contrastive.cfg yolov4-tiny_contrastive_last.weights

& it will apply tracker on all classes which is present in the dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ToDo RoadMap
Projects
None yet
Development

No branches or pull requests