Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conv Bottleneck-LSTM gives +3-5 AP and very cheap ~+1% BFLOPS #5774

Open
AlexeyAB opened this issue May 28, 2020 · 19 comments
Open

Conv Bottleneck-LSTM gives +3-5 AP and very cheap ~+1% BFLOPS #5774

AlexeyAB opened this issue May 28, 2020 · 19 comments
Labels
Feature-request Any feature-request

Comments

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 28, 2020

  • paper: https://arxiv.org/abs/1711.06368

  • +5 AP for small model, and + 2.8 AP for big model

  • Implement conv Bottleneck-LSTM that gives +3-5 AP and very cheap ~+1% BFLOPS

image


image


image


image


image


Non-bottleneck conv-LSTM looks like:
Detailed-architecture-of-the-peephole-LSTM


0-399867-614444

@i-chaochen
Copy link

i-chaochen commented May 28, 2020

Cool!

Actually, there is a follow-up work based on this paper that I mentioned it in the beginning of #3114

Looking Fast and Slow: Memory-Guided Mobile Video Object Detection Mason
https://arxiv.org/pdf/1903.10172.pdf

This one directly uses ConvLSTM to replace Bottleneck LSTM, which is better than the original bottleneck LSTM.

Moreover, to further improve LSTM, it makes 3 modifications on the original bottleneck LSTM:

  1. a skip connection
  2. divide LSTM state into groups and use grouped conv to process each one separetely.
  3. concatenate the channel-wise to get the c_t, h_t and M_t.

Screenshot 2020-05-28 at 20 04 11

@AlexeyAB
Copy link
Owner Author

What do you think about DETR (transformer for object detection)?

But for object detection on video (to predict objects on next frame) rather than for object detection on MSCOCO. Since Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence.

Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence. When applied to object detection, a Transformer is able to cut out steps to building a model, such as the need to create spatial anchors and customized layers.

DETR achieves results comparable to Faster R-CNN, an object detection model created primarily by Microsoft Research that’s earned nearly 10,000 citations since it was introduced in 2015, according to arXiv.


It isn't SOTA:

image

@i-chaochen
Copy link

i-chaochen commented May 29, 2020

What do you think about DETR (transformer for object detection)?

But for object detection on video (to predict objects on next frame) rather than for object detection on MSCOCO. Since Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence.

Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence. When applied to object detection, a Transformer is able to cut out steps to building a model, such as the need to create spatial anchors and customized layers.
DETR achieves results comparable to Faster R-CNN, an object detection model created primarily by Microsoft Research that’s earned nearly 10,000 citations since it was introduced in 2015, according to arXiv.

It isn't SOTA:

image

Yes, I think in general, all RNN/LSTM can be replaced by Transformers, and Transformers should always outperform to RNN/LSTM.

My considerations is that the input of this seq2seq network is a sequence, which is designed for NLP, but in the video scenario, the input is one frame, only. (I am not 100% sure what input of LSTM in darknet's implementation)

Ideally, a sequence of frames will be perfect but it cannot be true in the real-time scenario (You can't have the next frame until you really in the next second). Since the input is simpler than NLP sequence, I am not sure how much benefit the model can get from the multi-attention head (Transformers). It seems a bit of overkill.

Also, it could cost more GPU memory and it cannot be trained at 1 GPU.

@i-chaochen
Copy link

i-chaochen commented May 31, 2020

Also, according to the original paper (DETR). It seems not very good at small objects detection. I think they probably tried FPN to solve this but still failed, probably not enough GPU memory or something else?

@AlexeyAB
Copy link
Owner Author

I think we should use Transformer+Memory to predict intermediate features for the current frame of video, based on several memorized previous frames.
Then we should mix (concatenate + conv) predicted by Transformer features for the current + features for the current frame.


Also I don't understand, what activation they actually use between C and H, is it TANH, ReLU, nothing ... ?
And what do the call b, is it blue conv?

aaa


Default Conv-LSTM uses TANH as described in their paper page 4: https://arxiv.org/pdf/1711.06368v2.pdf
image

@i-chaochen
Copy link

i-chaochen commented May 31, 2020

I think we should use Transformer+Memory to predict intermediate features for the current frame of video, based on several memorized previous frames.
Then we should mix (concatenate + conv) predicted by Transformer features for the current + features for the current frame.

Also I don't understand, what activation they actually use between C and H, is it TANH, ReLU, nothing ... ?
And what do the call b, is it blue conv?

aaa

Default Conv-LSTM uses TANH as described in their paper page 4: https://arxiv.org/pdf/1711.06368v2.pdf
image

They use ReLU6 as you can see in the paper and source code
Screenshot 2020-05-31 at 23 51 49

https://github.com/tensorflow/models/blob/master/research/lstm_object_detection/lstm/lstm_cells.py

https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/models/lstm_ssd_mobilenet_v1_feature_extractor.py#L106

@AlexeyAB
Copy link
Owner Author

Thanks, yes, they use ReLU for c too, although they did not draw it in the picture.

https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L559

@i-chaochen
Copy link

i-chaochen commented May 31, 2020

@AlexeyAB
Copy link
Owner Author

activation=tf.tanh is a default value for BottleneckConvLSTMCell as for regular LSTM: https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L44

But there the parameter is explicitly overwritten activation=tf.nn.relu6 when we call lstm_cells.BottleneckConvLSTMCell() https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/models/lstm_ssd_mobilenet_v1_feature_extractor.py#L106

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Jun 1, 2020

I implemented BottleneckConvLSTM

[conv_lstm]
batch_normalize=1
size=3
pad=1
output=64
groups=2
peephole=0
bottleneck=1
#shortcut=1
time_normalizer=1.0
lstm_activation=tanh
activation=leaky

Just shortcut=1 uses partial n/2 channels residual connection (per-element addition) instead of concatenation for the skip-connection.

lstm_activation=tanh instead of ReLU.

And I think on GPU should be used groups=1 or 2, not higher than 2.

yolo_v3_tiny_lstm_bottleneck_shortcut.cfg.txt


image

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Jun 3, 2020

@i-chaochen In my experiments shortcut=1 that uses partial n/2 channels residual connection (per-element addition) degrades accuracy very much.

So don't use it, until I change it to concatenate.


Also I don't know what is the optimal value of time_normalizer=0.5

@i-chaochen
Copy link

@i-chaochen In my experiments shortcut=1 that uses partial n/2 channels residual connection (per-element addition) degrades accuracy very much.

So don't use it, until I change it to concatenate.

Also I don't know what is the optimal value of time_normalizer=0.5

Thanks for your update and sharing!

May I ask what is time_normalizer in the conv_lstm used for?

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Jun 3, 2020

time_normalizer is a coefficient for deltas of time-backpropagation in lstm.

  • higher time_normalizer - it learns time-dependencies more
  • lower time_normalizer - it learns spatial-dependencies more

@qingchunlizhi
Copy link

when I train the yolov4-tiny with Conv Bottleneck-LSTM,it always shows cuda out of memory in Tesla V100 no matter what batch-size I set:
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05 [7/1845]
nms_kind: greedynms (1), beta = 0.600000
Total BFLOPS 10.741
avg_outputs = 226932
Allocate additional workspace_size = 26.22 MB
yolov4-tiny-lstm
2 : compute_capability = 700, cudnn_half = 1, GPU: Tesla V100-SXM2-16GB
net.optimized_memory = 0
mini_batch = 1024, batch = 1024, time_steps = 16, train = 1
layer filters size/strd(dil) input output
0 Try to set subdivisions=64 in your cfg-file.
CUDA status Error: file: /home/darknet0808/src/dark_cuda.c : () : line: 373 : build time: Aug 8 2020 - 14:55:17

CUDA Error: out of memory
CUDA Error: out of memory: File exists

this is my cfg:
yolov4-lstm.cfg.txt

Please help me, thank you!

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Aug 9, 2020

mini_batch = time_steps * batch / subdivisions

So set time_steps = 4 or 3

@qingchunlizhi
Copy link

thanks, it works!
Em, could tell me the meaning of "track=1" and "time_steps=16"?

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Aug 9, 2020

time_steps = 4 - number of sequential frames from video
track=1 - it will use sequential frames instead of random frames.
Read: https://github.com/AlexeyAB/darknet/wiki/CFG-Parameters-in-the-%5Bnet%5D-section

@qingchunlizhi
Copy link

Thank you for very much, This is great!

@HaolyShiit
Copy link

HaolyShiit commented Aug 12, 2020

@AlexeyAB
I traine https://github.com/AlexeyAB/darknet/files/4746552/yolo_v3_tiny_lstm_bottleneck_shortcut.cfg.txt with my own dataset, but it seems don't work well.

The loss drops as normal (I can't upload loss chart successfully ).
The results on valid dataset is very bad.

Some information as follows:

3768: 0.925613, 0.354083 avg loss, 0.000913 rate, 3.341086 seconds, 482304 images, 5.769756 hours left
sequential_subdivisions = 8, sequence = 1
Loaded: 0.000037 seconds
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 33 Avg (IOU: 0.734949, GIOU: 0.724360), Class: 0.998445, Obj: 0.961654, No Obj: 0.000375, .5R: 1.000000, .75R: 0.437500, count: 48, class_loss = 0.061803, iou_loss = 0.556009, total_loss = 0.617812
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 37 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000702, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.786307, iou_loss = 0.000000, total_loss = 0.786307
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 41 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000301, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000442, iou_loss = 0.000000, total_loss = 0.000442
total_bbox = 968006, rewritten_bbox = 0.000000 %
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 33 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000169, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000073, iou_loss = 0.000000, total_loss = 0.000073
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 37 Avg (IOU: 0.859079, GIOU: 0.857110), Class: 0.998864, Obj: 0.994741, No Obj: 0.000915, .5R: 1.000000, .75R: 1.000000, count: 32, class_loss = 0.009228, iou_loss = 0.052743, total_loss = 0.061971
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 41 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000301, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000442, iou_loss = 0.000000, total_loss = 0.000442
total_bbox = 968038, rewritten_bbox = 0.000000 %

Is it normal? How to improve?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature-request Any feature-request
Projects
None yet
Development

No branches or pull requests

4 participants