Conv Bottleneck-LSTM gives +3-5 AP and very cheap ~+1% BFLOPS #5774

AlexeyAB · 2020-05-28T18:27:47Z

paper: https://arxiv.org/abs/1711.06368
+5 AP for small model, and + 2.8 AP for big model
Implement conv Bottleneck-LSTM that gives +3-5 AP and very cheap ~+1% BFLOPS

Non-bottleneck conv-LSTM looks like:

i-chaochen · 2020-05-28T18:48:31Z

Cool!

Actually, there is a follow-up work based on this paper that I mentioned it in the beginning of #3114

Looking Fast and Slow: Memory-Guided Mobile Video Object Detection Mason
https://arxiv.org/pdf/1903.10172.pdf

This one directly uses ConvLSTM to replace Bottleneck LSTM, which is better than the original bottleneck LSTM.

Moreover, to further improve LSTM, it makes 3 modifications on the original bottleneck LSTM:

a skip connection
divide LSTM state into groups and use grouped conv to process each one separetely.
concatenate the channel-wise to get the c_t, h_t and M_t.

AlexeyAB · 2020-05-29T02:19:11Z

What do you think about DETR (transformer for object detection)?

But for object detection on video (to predict objects on next frame) rather than for object detection on MSCOCO. Since Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence.

Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence. When applied to object detection, a Transformer is able to cut out steps to building a model, such as the need to create spatial anchors and customized layers.

DETR achieves results comparable to Faster R-CNN, an object detection model created primarily by Microsoft Research that’s earned nearly 10,000 citations since it was introduced in 2015, according to arXiv.

It isn't SOTA:

i-chaochen · 2020-05-29T09:25:08Z

What do you think about DETR (transformer for object detection)?

But for object detection on video (to predict objects on next frame) rather than for object detection on MSCOCO. Since Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence.

paper: https://arxiv.org/abs/2005.12872

code: https://github.com/facebookresearch/detr

news: https://venturebeat.com/2020/05/28/facebook-ai-research-applies-transformer-architecture-to-streamline-object-detection-models/

Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence. When applied to object detection, a Transformer is able to cut out steps to building a model, such as the need to create spatial anchors and customized layers.
DETR achieves results comparable to Faster R-CNN, an object detection model created primarily by Microsoft Research that’s earned nearly 10,000 citations since it was introduced in 2015, according to arXiv.

It isn't SOTA:

Yes, I think in general, all RNN/LSTM can be replaced by Transformers, and Transformers should always outperform to RNN/LSTM.

My considerations is that the input of this seq2seq network is a sequence, which is designed for NLP, but in the video scenario, the input is one frame, only. (I am not 100% sure what input of LSTM in darknet's implementation)

Ideally, a sequence of frames will be perfect but it cannot be true in the real-time scenario (You can't have the next frame until you really in the next second). Since the input is simpler than NLP sequence, I am not sure how much benefit the model can get from the multi-attention head (Transformers). It seems a bit of overkill.

Also, it could cost more GPU memory and it cannot be trained at 1 GPU.

i-chaochen · 2020-05-31T22:16:50Z

Also, according to the original paper (DETR). It seems not very good at small objects detection. I think they probably tried FPN to solve this but still failed, probably not enough GPU memory or something else?

AlexeyAB · 2020-05-31T22:35:13Z

I think we should use Transformer+Memory to predict intermediate features for the current frame of video, based on several memorized previous frames.
Then we should mix (concatenate + conv) predicted by Transformer features for the current + features for the current frame.

Also I don't understand, what activation they actually use between C and H, is it TANH, ReLU, nothing ... ?
And what do the call b, is it blue conv?

Default Conv-LSTM uses TANH as described in their paper page 4: https://arxiv.org/pdf/1711.06368v2.pdf

i-chaochen · 2020-05-31T22:50:14Z

I think we should use Transformer+Memory to predict intermediate features for the current frame of video, based on several memorized previous frames.
Then we should mix (concatenate + conv) predicted by Transformer features for the current + features for the current frame.

Also I don't understand, what activation they actually use between C and H, is it TANH, ReLU, nothing ... ?
And what do the call b, is it blue conv?

Default Conv-LSTM uses TANH as described in their paper page 4: https://arxiv.org/pdf/1711.06368v2.pdf

They use ReLU6 as you can see in the paper and source code

https://github.com/tensorflow/models/blob/master/research/lstm_object_detection/lstm/lstm_cells.py

https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/models/lstm_ssd_mobilenet_v1_feature_extractor.py#L106

AlexeyAB · 2020-05-31T23:02:02Z

Thanks, yes, they use ReLU for c too, although they did not draw it in the picture.

https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L559

i-chaochen · 2020-05-31T23:18:38Z

Thanks, yes, they use ReLU for c too, although they did not draw it in the picture.

https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L559

No, I think they use tanh for c.
https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L279

AlexeyAB · 2020-05-31T23:39:38Z

activation=tf.tanh is a default value for BottleneckConvLSTMCell as for regular LSTM: https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L44

But there the parameter is explicitly overwritten activation=tf.nn.relu6 when we call lstm_cells.BottleneckConvLSTMCell() https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/models/lstm_ssd_mobilenet_v1_feature_extractor.py#L106

AlexeyAB · 2020-06-01T15:07:27Z

I implemented BottleneckConvLSTM

[conv_lstm]
batch_normalize=1
size=3
pad=1
output=64
groups=2
peephole=0
bottleneck=1
#shortcut=1
time_normalizer=1.0
lstm_activation=tanh
activation=leaky

Just shortcut=1 uses partial n/2 channels residual connection (per-element addition) instead of concatenation for the skip-connection.

lstm_activation=tanh instead of ReLU.

And I think on GPU should be used groups=1 or 2, not higher than 2.

yolo_v3_tiny_lstm_bottleneck_shortcut.cfg.txt

AlexeyAB · 2020-06-03T17:24:05Z

@i-chaochen In my experiments shortcut=1 that uses partial n/2 channels residual connection (per-element addition) degrades accuracy very much.

So don't use it, until I change it to concatenate.

Also I don't know what is the optimal value of time_normalizer=0.5

i-chaochen · 2020-06-03T20:44:51Z

@i-chaochen In my experiments shortcut=1 that uses partial n/2 channels residual connection (per-element addition) degrades accuracy very much.

So don't use it, until I change it to concatenate.

Also I don't know what is the optimal value of time_normalizer=0.5

Thanks for your update and sharing!

May I ask what is time_normalizer in the conv_lstm used for?

AlexeyAB · 2020-06-03T21:00:24Z

time_normalizer is a coefficient for deltas of time-backpropagation in lstm.

higher time_normalizer - it learns time-dependencies more
lower time_normalizer - it learns spatial-dependencies more

qingchunlizhi · 2020-08-09T12:36:48Z

when I train the yolov4-tiny with Conv Bottleneck-LSTM，it always shows cuda out of memory in Tesla V100 no matter what batch-size I set:
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05 [7/1845]
nms_kind: greedynms (1), beta = 0.600000
Total BFLOPS 10.741
avg_outputs = 226932
Allocate additional workspace_size = 26.22 MB
yolov4-tiny-lstm
2 : compute_capability = 700, cudnn_half = 1, GPU: Tesla V100-SXM2-16GB
net.optimized_memory = 0
mini_batch = 1024, batch = 1024, time_steps = 16, train = 1
layer filters size/strd(dil) input output
0 Try to set subdivisions=64 in your cfg-file.
CUDA status Error: file: /home/darknet0808/src/dark_cuda.c : () : line: 373 : build time: Aug 8 2020 - 14:55:17

CUDA Error: out of memory
CUDA Error: out of memory: File exists

this is my cfg:
yolov4-lstm.cfg.txt

Please help me, thank you!

AlexeyAB · 2020-08-09T12:50:59Z

mini_batch = time_steps * batch / subdivisions

So set time_steps = 4 or 3

qingchunlizhi · 2020-08-09T13:15:55Z

thanks, it works!
Em, could tell me the meaning of "track=1" and "time_steps=16"?

AlexeyAB · 2020-08-09T13:19:35Z

time_steps = 4 - number of sequential frames from video
track=1 - it will use sequential frames instead of random frames.
Read: https://github.com/AlexeyAB/darknet/wiki/CFG-Parameters-in-the-%5Bnet%5D-section

qingchunlizhi · 2020-08-09T13:41:47Z

Thank you for very much, This is great!

HaolyShiit · 2020-08-12T08:42:52Z

@AlexeyAB
I traine https://github.com/AlexeyAB/darknet/files/4746552/yolo_v3_tiny_lstm_bottleneck_shortcut.cfg.txt with my own dataset, but it seems don't work well.

The loss drops as normal (I can't upload loss chart successfully ).
The results on valid dataset is very bad.

Some information as follows:

3768: 0.925613, 0.354083 avg loss, 0.000913 rate, 3.341086 seconds, 482304 images, 5.769756 hours left
sequential_subdivisions = 8, sequence = 1
Loaded: 0.000037 seconds
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 33 Avg (IOU: 0.734949, GIOU: 0.724360), Class: 0.998445, Obj: 0.961654, No Obj: 0.000375, .5R: 1.000000, .75R: 0.437500, count: 48, class_loss = 0.061803, iou_loss = 0.556009, total_loss = 0.617812
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 37 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000702, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.786307, iou_loss = 0.000000, total_loss = 0.786307
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 41 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000301, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000442, iou_loss = 0.000000, total_loss = 0.000442
total_bbox = 968006, rewritten_bbox = 0.000000 %
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 33 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000169, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000073, iou_loss = 0.000000, total_loss = 0.000073
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 37 Avg (IOU: 0.859079, GIOU: 0.857110), Class: 0.998864, Obj: 0.994741, No Obj: 0.000915, .5R: 1.000000, .75R: 1.000000, count: 32, class_loss = 0.009228, iou_loss = 0.052743, total_loss = 0.061971
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 41 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000301, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000442, iou_loss = 0.000000, total_loss = 0.000442
total_bbox = 968038, rewritten_bbox = 0.000000 %

Is it normal? How to improve?

AlexeyAB added the Feature-request Any feature-request label May 28, 2020

AlexeyAB mentioned this issue May 28, 2020

Implement Yolo-LSTM (~+4-9 AP) for detection on Video with high mAP and without blinking issues #3114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conv Bottleneck-LSTM gives +3-5 AP and very cheap ~+1% BFLOPS #5774

Conv Bottleneck-LSTM gives +3-5 AP and very cheap ~+1% BFLOPS #5774

AlexeyAB commented May 28, 2020 •

edited

Loading

i-chaochen commented May 28, 2020 •

edited

Loading

AlexeyAB commented May 29, 2020

i-chaochen commented May 29, 2020 •

edited

Loading

i-chaochen commented May 31, 2020 •

edited

Loading

AlexeyAB commented May 31, 2020

i-chaochen commented May 31, 2020 •

edited

Loading

AlexeyAB commented May 31, 2020

i-chaochen commented May 31, 2020 •

edited

Loading

AlexeyAB commented May 31, 2020

AlexeyAB commented Jun 1, 2020 •

edited

Loading

AlexeyAB commented Jun 3, 2020 •

edited

Loading

i-chaochen commented Jun 3, 2020

AlexeyAB commented Jun 3, 2020

qingchunlizhi commented Aug 9, 2020

AlexeyAB commented Aug 9, 2020

qingchunlizhi commented Aug 9, 2020

AlexeyAB commented Aug 9, 2020

qingchunlizhi commented Aug 9, 2020

HaolyShiit commented Aug 12, 2020 •

edited

Loading

Conv Bottleneck-LSTM gives +3-5 AP and very cheap ~+1% BFLOPS #5774

Conv Bottleneck-LSTM gives +3-5 AP and very cheap ~+1% BFLOPS #5774

Comments

AlexeyAB commented May 28, 2020 • edited Loading

i-chaochen commented May 28, 2020 • edited Loading

AlexeyAB commented May 29, 2020

i-chaochen commented May 29, 2020 • edited Loading

i-chaochen commented May 31, 2020 • edited Loading

AlexeyAB commented May 31, 2020

i-chaochen commented May 31, 2020 • edited Loading

AlexeyAB commented May 31, 2020

i-chaochen commented May 31, 2020 • edited Loading

AlexeyAB commented May 31, 2020

AlexeyAB commented Jun 1, 2020 • edited Loading

AlexeyAB commented Jun 3, 2020 • edited Loading

i-chaochen commented Jun 3, 2020

AlexeyAB commented Jun 3, 2020

qingchunlizhi commented Aug 9, 2020

AlexeyAB commented Aug 9, 2020

qingchunlizhi commented Aug 9, 2020

AlexeyAB commented Aug 9, 2020

qingchunlizhi commented Aug 9, 2020

HaolyShiit commented Aug 12, 2020 • edited Loading

AlexeyAB commented May 28, 2020 •

edited

Loading

i-chaochen commented May 28, 2020 •

edited

Loading

i-chaochen commented May 29, 2020 •

edited

Loading

i-chaochen commented May 31, 2020 •

edited

Loading

i-chaochen commented May 31, 2020 •

edited

Loading

i-chaochen commented May 31, 2020 •

edited

Loading

AlexeyAB commented Jun 1, 2020 •

edited

Loading

AlexeyAB commented Jun 3, 2020 •

edited

Loading

HaolyShiit commented Aug 12, 2020 •

edited

Loading