To create a training or evaluation set for action recognition, the ground truth start/end position of actions in videos needs to be annotated. We looked into various tools for this and the tool we liked most (by far) is called VGG Image Annotator (VIA) written by the VGG group at Oxford.
We will now provide a few tips/steps how to use the VIA tool. A fully functioning live demo of the tool can be found here.
Screenshot of VIA Tool
How to use the tool for action recognition:
- Step 1: Download the zip file from the link here.
- Step 2: Unzip the tool and open the via_video_annotator.html to open the annotation tool. Note: support for some browsers seems not fully stable - we found Chrome to work best.
- Step 3: Import the video file(s) from local using or from url using .
- Step 4: Use to create a new attribute for action annotation. Select Temporal Segment in Video or Audio for Anchor. To see the created attribute, click again.
- Step 5: Update the Timeline List with the actions you want to track. Split different actions by e.g "1. eat, 2. drink" for two tracks separately for eat and drink. Click update to see the updated tracks.
- Step 6: Click on one track to add segment annotations for a certain action. Use key
a
to add the temporal segment at the current time andShift + a
to update the edge of the temporal segment to the current time. - Step 7: Export the annotations using . Select Only Temporal Segments as CSV if you only have temporal segments annotations.
The VIA tool outputs annotations as a csv file. Often however we need each annotated action to be written as its own clip and into separate files. These clips can then serve as training examples for action recognition models. We provide some scripts to aid in the construction of such datasets:
- video_conversion.py - Conversion of the video clips to a format which the VIA tool knows how to read.
- clip_extraction.py - Extraction of each annotated action as a separate clip. Optionally, "negative" clips can be generated, in which no action-of-interest occurs. Negative clips can be extracted in two ways: either all contiguous non-overlapping negative clips can be extracted or a specified number of negative examples can be randomly sampled. This behaviour can be controlled using the
contiguous
flag. The script outputs clips into directories specific to each class and generates a label file that maps each filename to the clip's class label. - split_examples.py - Splits generated example clips into training and evaluation sets. Optionally, a negative candidate set and negative test set can be generated for hard negative mining.
Below is a list of alternative UIs for annotating actions, however in our opinion the VIA tool is the by far best performer. We distinguish between:
- Fixed-length clip annotation: where the UI splits the video into fixed-length clips, and the user then annotates the clips.
- Segmentations annotation: where the user annotates the exact start and end position of each action directly. This is more time-consuming compared to fixed-length clip annotation, however comes with higher localization accuracy.
See also the HACS Dataset web page for some examples showing these two types of annotations.
Tool Name | Annotation Type | Pros | Cons | Whether Open Source |
---|---|---|---|---|
MuViLab | Fixed-length clips annotation |
|
|
Open source on Github |
VIA (VGG Image Annotator) | segmentations annotation |
|
|
Open source on Gitlab |
ANVIL | Segmentations annotation |
|
|
Not open source, but free to download |
Action Annotation Tool | Segmentations annotation |
|
|
Open source on Github |
- Deep Learning for Videos: A 2018 Guide to Action Recognition.
- Zhao, H., et al. "Hacs: Human action clips and segments dataset for recognition and temporal localization." arXiv preprint arXiv:1712.09374 (2019).
- Kay, Will, et al. "The kinetics human action video dataset." arXiv preprint arXiv:1705.06950 (2017).
- Abhishek Dutta and Andrew Zisserman. 2019. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia (MM ’19), October 21–25, 2019, Nice, France. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3343031.3350535.