Awesome Transfer CLIP to Video Recognition:

A curated list of vision-language model based video action recognition resources, inspired by awesome-computer-vision.

Dataset for VLM pretraining and video action recognition

Video Dataset Overview from Antoine Miech
HACS
Moments in Time, paper
AVA, paper, [INRIA web] for missing videos
Kinetics, paper, download toolkit
OOPS - A dataset of unintentional action, paper
COIN - a large-scale dataset for comprehensive instructional video analysis, paper
YouTube-8M, technical report
YouTube-BB, technical report
DALY Daily Action Localization in Youtube videos. Note: Weakly supervised action detection dataset. Annotations consist of start and end time of each action, one bounding box per each action per video.
20BN-JESTER, 20BN-SOMETHING-SOMETHING
ActivityNet Note: They provide a download script and evaluation code here .
Charades
Charades-Ego, paper - First person and third person video aligned dataset
EPIC-Kitchens, paper - First person videos recorded in kitchens. Note they provide download scripts and a python library here
Sports-1M - Large scale action recognition dataset.
THUMOS14 Note: It overlaps with UCF-101 dataset.
THUMOS15 Note: It overlaps with UCF-101 dataset.
HOLLYWOOD2: Spatio-Temporal annotations
UCF-101, annotation provided by THUMOS-14, and corrupted annotation list, UCF-101 corrected annotations and different version annotaions. And there are also some pre-computed spatiotemporal action detection results
UCF-50.
UCF-Sports, note: the train/test split link in the official website is broken. Instead, you can download it from here.
HMDB
J-HMDB
LIRIS-HARL
KTH
MSR Action Note: It overlaps with KTH datset.
Sports Videos in the Wild
NTU RGB+D
Mixamo Mocap Dataset
UWA3D Multiview Activity II Dataset
Northwestern-UCLA Dataset
SYSU 3D Human-Object Interaction Dataset
MEVA (Multiview Extended Video with Activities) Dataset
Panda-70

SOTA Result On Different Datasets And Task Settings

🔨TODO: Collect SOTA result and merge into one table

Related Survey

Pretrained Vision-Language Model

Image-Language Model

Video-Language Model

🔨TODO: collect video-language pre-trained model and attach links here

Adaption From Image-Language Model To Video Model

Pure Adapter-Based

Pure Prompt-Based

Mixture Of Adapter And Prompt

Full-Finetuning-Based

VLM-Based Few-Shot Video Action Recognition

Useful Code Repos on Video Representation Learning

Miscellaneous

What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment - P. Parma and B. T. Morris. CVPR2019.
PathTrack: Fast Trajectory Annotation with Path Supervision - S. Manen et al., ICCV2017.
CortexNet: a Generic Network Family for Robust Visual Temporal Representations A. Canziani and E. Culurciello - arXiv2017. [code] [project web]
Slicing Convolutional Neural Network for Crowd Video Understanding - J. Shao et al, CVPR2016. [code]
Two-Stream (RGB and Flow) pretrained model weights

Licenses

License

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Transfer CLIP to Video Recognition:

Contents

Dataset for VLM pretraining and video action recognition

SOTA Result On Different Datasets And Task Settings

Related Survey

Pretrained Vision-Language Model

Image-Language Model

Video-Language Model

Adaption From Image-Language Model To Video Model

Pure Adapter-Based

Pure Prompt-Based

Mixture Of Adapter And Prompt

Full-Finetuning-Based

VLM-Based Few-Shot Video Action Recognition

Useful Code Repos on Video Representation Learning

Miscellaneous

Licenses

About

Releases

Packages

Contributors 2

XiaoBuL/awesome-transfer-clip-to-action-recognition

Folders and files

Latest commit

History

Repository files navigation

Awesome Transfer CLIP to Video Recognition:

Contents

Dataset for VLM pretraining and video action recognition

SOTA Result On Different Datasets And Task Settings

Related Survey

Pretrained Vision-Language Model

Image-Language Model

Video-Language Model

Adaption From Image-Language Model To Video Model

Pure Adapter-Based

Pure Prompt-Based

Mixture Of Adapter And Prompt

Full-Finetuning-Based

VLM-Based Few-Shot Video Action Recognition

Useful Code Repos on Video Representation Learning

Miscellaneous

Licenses

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages