This repo holds the codes and models for the temporal action localization framework presented on ICIP 2019.
Exploring Feature Representation and Training strategies in Temporal Action Localization Tingting Xie, Xiaoshan Yang, Tianzhu Zhang, Changsheng Xu, Ioannis Patras, ICIP 2019
If you find this helps your research, please cite:
@article{xie2019exploring,
title={Exploring Feature Representation and Training strategies in Temporal Action Localization},
author={Xie, Tingting and Yang, Xiaoshan and Zhang, Tianzhu and Xu, Changsheng and Patras, Ioannis},
journal={arXiv preprint arXiv:1905.10608},
year={2019}
}
In this paper, unit-level two-stream feature was using in thumos14 dataset. The RGB feature could be downloaded here: val set, test set; the denseflow features can be downloaded here: val set, test set. Note that, val set is used for training, as the train set for THUMOS-14 does not contain untrimmed videos.
The training and testing in the work is implemented in Tensorflow for ease of use. We need the following software mainly to run it.
- Python3
- Tensorflow1.14
GPUs are required for running this code. Usually 1 GPU and 3~4GB of the memory would ensure a smooth training experience.
Then clone this repo with git.
git clone git@github.com:June01/icip19-tad.git
Note: Before running the code, please remember to change the path of the features(named byself.prefix
) in config.py
.
The test action proposals are provided in ./props/test_proposals_from_TURN.txt
. If you want to generate your own proposals, please go to TURN repository. Also, in this paper we report the performance according to different Average Number(AN) proposals, which are also provided in ./props/
.
In the original paper, we train the network with the following command.
python main.py --pool_level=k --fusion_type=fusion_type
k
is the granularity we used to divide each proposal into units. Mostly, we usek=5
by default. fusion_type
represents the way we deal with two-stream features, such as RGB, Flow, early fusion. As to the late fusion, please turn to postprocessing.
Note: All the results in the paper was reported on THUMOS14 evaluation 2014. However, there is another one THUMOS14 evaluation 2015, which is not obviously stated on the website even though it should have been done years ago. (We figured out the differences between these two evaluation codes, please file an issue if any explanation about it needed.) Based on the new evaluation metric, we make some changes during training, you can train your own model with the following command. Also, the results on it could be found in the next section.
python main.py --pool_level=k --fusion_type=fusion_type --dropout=True --opm_type='adam_wd' --l1_loss=True
We provide the pretrained reference models in tensorflow ckpt
format, which could be downloaded here. And the results correspond to each model could be found here.
First, you need to get the detection scores for all proposals by running:
python main.py --pool_level=k --fusion_type=fusion_type --mode=test --cas_step=3 --test_model_path=MODEL_PATH
Then, the result pickle file PKL_FILE
will be saved in ./eval/test_results/
, and it could be used to compute the class it belongs to and the corresponding offsets.
python gen_prop_outputs.py PKL_FILE_1 PKL_FILE_2 T
For rgb, flow and early fusion results, PKL_FILE_1
and PKL_FILE_2
should be set the same; while for late fusion, PKL_FILE_1
should be set to be the rgb pkl file and PKL_FILE_2
should be set to be the flow pkl file. After this step, you may get the FUSION_PKL_FILE
. Note:T=1
should be set to the baseline method in the paper and T=3
to the improved version.
Finally, NMS is used to supppress the redundant proposals. The final predicted actions list will be saved in ./eval/after_postprocessing/
.
python postproc.py FUSION_PKL_FILE 0.5
The mAP@0.5 performance of the baseline model we provide is 44.85%
under the evaluation method 2014. Based on evaluation method 2015, we also report some important results on it as follows, which is also comparable with the state-of-the-art 36.9%
.
| mAP@IoU (%) | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 |
---------------------------------------------------------------
| STPP(L=3) | 52.08 | 45.11 | 35.32 | 23.62 | 11.61 |
| BSP(2/4/2) | 51.17 | 43.92 | 34.59 | 22.02 | 10.94 |
| Ours(k=1) | 46.69 | 40.48 | 31.23 | 19.95 | 9.78 |
| Ours(k=2) | 50.20 | 43.67 | 34.31 | 23.77 | 10.83 |
| Ours(k=5) | 51.66 | 46.56 | 36.83 | 25.39 | 12.69 |
| Ours(k=10) | 52.49 | 46.58 | 37.37 | 24.54 | 12.43 |
---------------------------------------------------------------
| mAP@IoU (%) | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 |
---------------------------------------------------------------
| RGB | 39.07 | 33.67 | 23.55 | 13.15 | 5.70 |
| Flow | 47.12 | 42.05 | 33.80 | 22.89 | 12.13 |
| Early Fusion | 51.66 | 46.56 | 36.83 | 25.39 | 12.69 |
| Late Fusion | 49.77 | 44.45 | 34.98 | 21.33 | 10.36 |
---------------------------------------------------------------
- Anet-2016: The two-stream based feature extractor used in this paper.
- CBR: The foundmental network we based on.
- TURN-TAP: The first stage proposals generated from.
For any question, please file an issue or contact
Tingting Xie: t.xie@qmul.ac.uk
Also, I would like to thank Yu-le Li and Christos Tzelepis for his valuable suggestions and discussions both in this project and the paper.