A curated list of vision-language model based video action recognition resources, inspired by awesome-computer-vision.
- Dataset for VLM pretraining and video action recognition
- SOTA Result On Different Datasets And Task Settings
- Related Survey
- Pretrained Vision-Language Model
- Adaption From Image-Language Model To Video Model
- VLM-Based Few-Shot Video Action Recognition
- Video Dataset Overview from Antoine Miech
- HACS
- Moments in Time, paper
- AVA, paper, [INRIA web] for missing videos
- Kinetics, paper, download toolkit
- OOPS - A dataset of unintentional action, paper
- COIN - a large-scale dataset for comprehensive instructional video analysis, paper
- YouTube-8M, technical report
- YouTube-BB, technical report
- DALY Daily Action Localization in Youtube videos. Note: Weakly supervised action detection dataset. Annotations consist of start and end time of each action, one bounding box per each action per video.
- 20BN-JESTER, 20BN-SOMETHING-SOMETHING
- ActivityNet Note: They provide a download script and evaluation code here .
- Charades
- Charades-Ego, paper - First person and third person video aligned dataset
- EPIC-Kitchens, paper - First person videos recorded in kitchens. Note they provide download scripts and a python library here
- Sports-1M - Large scale action recognition dataset.
- THUMOS14 Note: It overlaps with UCF-101 dataset.
- THUMOS15 Note: It overlaps with UCF-101 dataset.
- HOLLYWOOD2: Spatio-Temporal annotations
- UCF-101, annotation provided by THUMOS-14, and corrupted annotation list, UCF-101 corrected annotations and different version annotaions. And there are also some pre-computed spatiotemporal action detection results
- UCF-50.
- UCF-Sports, note: the train/test split link in the official website is broken. Instead, you can download it from here.
- HMDB
- J-HMDB
- LIRIS-HARL
- KTH
- MSR Action Note: It overlaps with KTH datset.
- Sports Videos in the Wild
- NTU RGB+D
- Mixamo Mocap Dataset
- UWA3D Multiview Activity II Dataset
- Northwestern-UCLA Dataset
- SYSU 3D Human-Object Interaction Dataset
- MEVA (Multiview Extended Video with Activities) Dataset
- Panda-70
🔨TODO: Collect SOTA result and merge into one table
- Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
- Vision-Language Models for Vision Tasks: A Survey
- Vision Transformers for Action Recognition: A Survey
- Human Action Recognition and Prediction: A Survey
- Deep Video Understanding with Video-Language Model
- Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
- A Comprehensive Study of Deep Video Action Recognition
- [CLIP] Learning Transferable Visual Models From Natural Language Supervision [code]
- [ALIGN] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- [Florence] Florence: A New Foundation Model for Computer Vision
- [UniCL] Unified Contrastive Learning in Image-Text-Label Space [code]
🔨TODO: collect video-language pre-trained model and attach links here
- AIM: Adapting Image Models for Efficient Video Action Recognition [code]
- Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition
- Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning [code]
- Dual-path Adaptation from Image to Video Transformers [code]
- Frozen CLIP Models are Efficient Video Learners [code]
- Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding [code]
- Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning [code]
- ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning [code]
- Video Action Recognition with Attentive Semantic Units
- What Can Simple Arithmetic Operations Do for Temporal Modeling? [code]
- Prompting Visual-Language Models for Efficient Video Understanding [code]
- Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting [code]
- Fine-tuned CLIP Models are Efficient Video Learners [code]
- ActionCLIP: A New Paradigm for Video Action Recognition [code]
- Expanding Language-Image Pretrained Models for General Video Recognition [code]
- Implicit Temporal Modeling with Learnable Alignment for Video Recognition [code]
- Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning
- Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models [code]
- Fine-tuned CLIP Models are Efficient Video Learners [code]
- Revisiting Classifier: Transferring Vision-Language Models for Video Recognition [code]
- CLIP-guided Prototype Modulating for Few-shot Action Recognition [code]
- D2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition
- Few-shot Action Recognition with Captioning Foundation Models
- GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph [code]
- Knowledge Prompting for Few-shot Action Recognition
- Multimodal Adaptation of CLIP for Few-Shot Action Recognition
- [3D ResNet PyTorch]
- [PyTorch Video Research]
- [M-PACT: Michigan Platform for Activity Classification in Tensorflow]
- [Inflated models on PyTorch]
- [I3D models transfered from Tensorflow to PyTorch]
- [A Two Stream Baseline on Kinectics dataset]
- [MMAction]
- [MMAction2]
- [PySlowFast]
- [Decord] Efficient video reader for python
- [I3D models converted from Tensorflow to Core ML]
- [Extract frame and optical-flow from videos, #docker]
- [NVIDIA-DALI, video loading pipelines]
- [NVIDIA optical-flow SDK]
- What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment - P. Parma and B. T. Morris. CVPR2019.
- PathTrack: Fast Trajectory Annotation with Path Supervision - S. Manen et al., ICCV2017.
- CortexNet: a Generic Network Family for Robust Visual Temporal Representations A. Canziani and E. Culurciello - arXiv2017. [code] [project web]
- Slicing Convolutional Neural Network for Crowd Video Understanding - J. Shao et al, CVPR2016. [code]
- Two-Stream (RGB and Flow) pretrained model weights
License