This repository contains the implementation of the model presented in the following paper:
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, Chiara Plizzari, Marco Cannici, Matteo Matteucci, ArXiv
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, Chiara Plizzari, Marco Cannici, Matteo Matteucci, Pattern Recognition. ICPR International Workshops and Challenges, 2021, Proceedings
Skeleton-based action recognition via spatial and temporal transformer networks, Chiara Plizzari, Marco Cannici, Matteo Matteucci, Computer Vision and Image Understanding, Volumes 208-209, 2021, 103219, ISSN 1077-3142, CVIU
The heatmaps are 25 x 25 matrices, where each row and each column represents a body joint. An element in position (i, j) represents the correlation between joint i and joint j, resulting from self-attention.
- Python3
- PyTorch
- All the libraries in
requirements.txt
python3 main.py
Training:
Set in /config/st_gcn/nturgbd/train.yaml
:
Training
: True
Testing:
Set in /config/st_gcn/nturgbd/train.yaml
:
Training
: False
We performed our experiments on three datasets: NTU-RGB+D 60, NTU-RGB+D 120 and Kinetics.
The data can downloaded from their website. You need to download 3D Skeletons only (5.8G (NTU-60) + 4.5G (NTU-120)). Once downloaded, use the following to generate joint data for NTU-60:
python3 ntu_gendata.py
If you want to generate data and preprocess them, use directly:
python3 preprocess.py
In order to generate bones, you need to run:
python3 ntu_gen_bones.py
The joint information and bone information can be merged through:
python3 ntu_merge_joint_bones.py
For NTU-120, the samples are divided between training and testing in a different way. Thus, you need to run:
python3 ntu120_gendata.py
If you want to generate data and process them directly, use:
python3 preprocess_120.py
Kinetics is a dataset for video action recognition, consisting of raw video data only. The corresponding skeletons are extracted using Openpose, and are available for download at GoogleDrive (7.5G). From raw skeletons, generate the dataset by running:
python3 kinetics_gendata.py
Spatial Transformer implementation corresponds to ST-TR/code/st_gcn/net/spatial_transformer.py
.
Set in /config/st_gcn/nturgbd/train.yaml
:
attention: True
tcn_attention: False
only_attention: True
all_layers: False
to run the spatial transformer stream (S-TR-stream).
Temporal Transformer implementation corresponds to ST-TR/code/st_gcn/net/temporal_transformer.py
.
Set in /config/st_gcn/nturgbd/train.yaml
:
attention: False
tcn_attention: True
only_attention: True
all_layers: False
to run the temporal transformer stream (T-TR-stream).
The score resulting from the S-TR stream and T-TR stream are combined to produce the final ST-TR score by:
python3 ensemble.py
In order to run T-TR-agcn and ST-TR-agcn configurations, please set agcn: True
.
Set in /config/st_gcn/nturgbd/train.yaml
:
only_attention: False
, to use ST-TR as an augmentation procedure to ST-GCN (refer to Sec. V(E) "Effect of Augmenting Convolution with Self-Attention")all_layers: True
, to apply ST-TR on all layers, otherwise it will be applied from the 4th layer on (refer to Sec. V(D) "Effect of Applying Self-Attention to Feature Extraction")- Set both
attention: True
andtcn_attention: True
to combine both SSA and TSA on a unique stream (refer to Sec. V(F) "Effect of combining SSA and TSA on one stream") more_channels: True
, to assign to each head more channels than dk/Nh.n
: used ifmore_channels
is set to True, in order to assign to each head dk*num/Nh channels
To set the block dimensions of the windowed version of Temporal Transformer:
dim_block1, dim_block2, dim_block3
, respectively to set block dimension where the output channels are equal to 64, 128 and 256.
Set in /config/st_gcn/nturgbd/train.yaml
:
channels: 6
, because on channels dimension we have both the coordinates of joint (3), and coordinates of bones(3)double_channel: True
, since in this configuration we also doubled the channels in each layer.
Please notice I have attached pre-trained models of the configurations presented in the paper in the checkpoint_ST-TR
folder. Please note that the *bones*.pth configurations correspond to the models trained with joint+bones information, while the others are trained with joints only.
Please cite one of the following papers if you use this code for your researches:
@article{plizzari2021skeleton,
title={Skeleton-based action recognition via spatial and temporal transformer networks},
author={Plizzari, Chiara and Cannici, Marco and Matteucci, Matteo},
journal={Computer Vision and Image Understanding},
volume={208},
pages={103219},
year={2021},
publisher={Elsevier}
}
@inproceedings{plizzari2021spatial,
title={Spatial temporal transformer network for skeleton-based action recognition},
author={Plizzari, Chiara and Cannici, Marco and Matteucci, Matteo},
booktitle={Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10--15, 2021, Proceedings, Part III},
pages={694--701},
year={2021},
organization={Springer}
}
If you have any question, do not hesitate to contact me at chiara.plizzari@mail.polimi.it
. I will be glad to clarify your doubts!
Note: we include LICENSE, LICENSE_1 and LICENSE_2 in this repository since part of the code has been derived respectively from https://github.com/yysijie/st-gcn, https://github.com/leaderj1001/Attention-Augmented-Conv2d and https://github.com/kenziyuliu/Unofficial-DGNN-PyTorch/blob/master/README.md