This repository is the official implementation of Side4Video, which significantly reduces the training memory cost for action recognition and text-video retrieval tasks.
-
Feb 28, 2024.
We release our code for Action Recognition and Text-Video Retrieval. -
Nov 28, 2023.
We release our paper in arxiv.
For training and testing our model, please refer to the Recognition and Retrieval folders.
Our best model can achieve an accuracy of 67.3% & 74.6 on Something-Something V1 & V2, 88.6% on Kinetics-400 and a Recall@1 of 52.3% on MSR-VTT, 56.1% on MSVD, 68.8% on VATEX.If you find this repository is useful, please star🌟 this repo and cite🖇️ our paper.
@article{yao2023side4video,
title={Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning},
author={Yao, Huanjin and Wu, Wenhao and Li, Zhiheng},
journal={arXiv preprint arXiv:2311.15769},
year={2023}
}
Our implementation is mainly based on the following codebases. We are sincerely grateful for their work.
- Text4Vis: Revisiting Classifier: Transferring Vision-Language Models for Video Recognition.
- CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.
If you have any questions about this repository, please file an issue or contact Huanjin Yao or Wenhao Wu .