-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conv Bottleneck-LSTM gives +3-5 AP and very cheap ~+1% BFLOPS #5774
Comments
Cool! Actually, there is a follow-up work based on this paper that I mentioned it in the beginning of #3114 Looking Fast and Slow: Memory-Guided Mobile Video Object Detection Mason This one directly uses ConvLSTM to replace Bottleneck LSTM, which is better than the original bottleneck LSTM. Moreover, to further improve LSTM, it makes 3 modifications on the original bottleneck LSTM:
|
What do you think about DETR (transformer for object detection)? But for object detection on video (to predict objects on next frame) rather than for object detection on MSCOCO. Since Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence.
It isn't SOTA: |
Yes, I think in general, all RNN/LSTM can be replaced by Transformers, and Transformers should always outperform to RNN/LSTM. My considerations is that the input of this seq2seq network is a sequence, which is designed for NLP, but in the video scenario, the input is one frame, only. (I am not 100% sure what input of LSTM in darknet's implementation) Ideally, a sequence of frames will be perfect but it cannot be true in the real-time scenario (You can't have the next frame until you really in the next second). Since the input is simpler than NLP sequence, I am not sure how much benefit the model can get from the multi-attention head (Transformers). It seems a bit of overkill. Also, it could cost more GPU memory and it cannot be trained at 1 GPU. |
Also, according to the original paper (DETR). It seems not very good at small objects detection. I think they probably tried FPN to solve this but still failed, probably not enough GPU memory or something else? |
I think we should use Transformer+Memory to predict intermediate features for the current frame of video, based on several memorized previous frames. Also I don't understand, what activation they actually use between C and H, is it TANH, ReLU, nothing ... ? Default Conv-LSTM uses TANH as described in their paper page 4: https://arxiv.org/pdf/1711.06368v2.pdf |
They use ReLU6 as you can see in the paper and source code |
Thanks, yes, they use ReLU for |
No, I think they use tanh for |
But there the parameter is explicitly overwritten |
I implemented BottleneckConvLSTM [conv_lstm]
batch_normalize=1
size=3
pad=1
output=64
groups=2
peephole=0
bottleneck=1
#shortcut=1
time_normalizer=1.0
lstm_activation=tanh
activation=leaky Just lstm_activation=tanh instead of ReLU. And I think on GPU should be used groups=1 or 2, not higher than 2. yolo_v3_tiny_lstm_bottleneck_shortcut.cfg.txt |
@i-chaochen In my experiments So don't use it, until I change it to concatenate. Also I don't know what is the optimal value of |
Thanks for your update and sharing! May I ask what is |
|
when I train the yolov4-tiny with Conv Bottleneck-LSTM,it always shows cuda out of memory in Tesla V100 no matter what batch-size I set: CUDA Error: out of memory this is my cfg: Please help me, thank you! |
So set |
thanks, it works! |
|
Thank you for very much, This is great! |
@AlexeyAB The loss drops as normal (I can't upload loss chart successfully ). Some information as follows:
Is it normal? How to improve? |
paper: https://arxiv.org/abs/1711.06368
+5 AP for small model, and + 2.8 AP for big model
Implement conv Bottleneck-LSTM that gives +3-5 AP and very cheap ~+1% BFLOPS
Non-bottleneck conv-LSTM looks like:
The text was updated successfully, but these errors were encountered: