TSM: Temporal Shift Module for Efficient Video Understanding
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN's complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: it ranks the first place on the Something-Something leaderboard upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition.
frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | testing protocol | FLOPs | params | config | ckpt | log |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1x1x8 | 224x224 | 8 | ResNet50 | ImageNet | 73.18 | 90.56 | 8 clips x 10 crop | 32.88G | 23.87M | config | ckpt | log |
1x1x8 | 224x224 | 8 | ResNet50 | ImageNet | 73.22 | 90.22 | 8 clips x 10 crop | 32.88G | 23.87M | config | ckpt | log |
1x1x16 | 224x224 | 8 | ResNet50 | ImageNet | 75.12 | 91.55 | 16 clips x 10 crop | 65.75G | 23.87M | config | ckpt | log |
1x1x8 (dense) | 224x224 | 8 | ResNet50 | ImageNet | 73.38 | 90.78 | 8 clips x 10 crop | 32.88G | 23.87M | config | ckpt | log |
1x1x8 | 224x224 | 8 | ResNet50 (NonLocalDotProduct) | ImageNet | 74.49 | 91.15 | 8 clips x 10 crop | 61.30G | 31.68M | config | ckpt | log |
1x1x8 | 224x224 | 8 | ResNet50 (NonLocalGauss) | ImageNet | 73.66 | 90.99 | 8 clips x 10 crop | 59.06G | 28.00M | config | ckpt | log |
1x1x8 | 224x224 | 8 | ResNet50 (NonLocalEmbedGauss) | ImageNet | 74.34 | 91.23 | 8 clips x 10 crop | 61.30G | 31.68M | config | ckpt | log |
1x1x8 | 224x224 | 8 | MobileNetV2 | ImageNet | 68.71 | 88.32 | 8 clips x 3 crop | 3.269G | 2.736M | config | ckpt | log |
1x1x16 | 224x224 | 8 | MobileOne-S4 | ImageNet | 74.38 | 91.71 | 16 clips x 10 crop | 48.65G | 13.72M | config | ckpt | log |
frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | testing protocol | FLOPs | params | config | ckpt | log |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1x1x8 | 224x224 | 8 | ResNet50 | ImageNet | 62.72 | 87.70 | 8 clips x 3 crop | 32.88G | 23.87M | config | ckpt | log |
1x1x16 | 224x224 | 8 | ResNet50 | ImageNet | 64.16 | 88.61 | 16 clips x 3 crop | 65.75G | 23.87M | config | ckpt | log |
1x1x8 | 224x224 | 8 | ResNet101 | ImageNet | 63.70 | 88.28 | 8 clips x 3 crop | 62.66G | 42.86M | config | ckpt | log |
- The gpus indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set
--auto-scale-lr
when callingtools/train.py
, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size. - The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at Kinetics400-Validation. The corresponding data list (each line is of the format 'video_id, num_frames, label_index') and the label map are also available.
- MoibleOne backbone supports reparameterization during inference. You can use the provided reparameterize tool to convert the checkpoint and switch to the deploy config file.
For more details on data preparation, you can refer to Kinetics400.
You can use the following command to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train TSM model on Kinetics-400 dataset in a deterministic option.
python tools/train.py configs/recognition/tsm/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py \
--seed=0 --deterministic
For more details, you can refer to the Training part in the Training and Test Tutorial.
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test TSM model on Kinetics-400 dataset and dump the result to a pkl file.
python tools/test.py configs/recognition/tsm/tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_kinetics400-rgb.py \
checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
For more details, you can refer to the Test part in the Training and Test Tutorial.
@inproceedings{lin2019tsm,
title={TSM: Temporal Shift Module for Efficient Video Understanding},
author={Lin, Ji and Gan, Chuang and Han, Song},
booktitle={Proceedings of the IEEE International Conference on Computer Vision},
year={2019}
}
@article{Nonlocal2018,
author = {Xiaolong Wang and Ross Girshick and Abhinav Gupta and Kaiming He},
title = {Non-local Neural Networks},
journal = {CVPR},
year = {2018}
}