Skip to content

[ACM MM 2024] Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Notifications You must be signed in to change notification settings

sejong-rcv/PVLR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PVLR


Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization
Geuntaek Lim (Sejong Univ.), Hyunwoo Kim (Sejong Univ.), Joonsoo Kim (ETRI), and Yukyung Choi† (Sejong Univ.)

Abstract: Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods.

Prerequisites

Recommended Environment

  • We strongly recommend following the environment, which is very important as to whether it's reproduced or not.

    • OS : Ubuntu 18.04
    • CUDA : 10.2
    • Python 3.7.16
    • Pytorch 1.7.1 Torchvision 0.8.2
    • GPU : NVIDA-Tesla V100 (32G)
  • Required packages are listed in environment.yaml. You can install by running:

conda env create -f environment.yaml
conda activate PVLR

Data Preparation

  • For convenience, we provide the features we used. You can find them here.
  • The feature directory should be organized as follows:
├── PVLR
   ├── data
      ├── thumos
          ├── Thumos14_CLIP
          ├── Thumos14-Annotations
          ├── Thumos14reduced
          └── Thumos14reduced-Annotations
      ├── annet
          ├── Anet_CLIP
          ├── ActivityNet1.2-Annotations
          └── ActivityNet1.3
  • Considering the difficulty in achieving perfect reproducibility due to different model initializations depending on the experimental device (e.g., different GPU setup), we provide the initialized model parameters we used.

  • Please note that the parameters provided are the initial parameters before any training has been conducted.

  • The checkpoint file should be organized as follows:

├── PVLR
   ├── data
      ├── ...
      ├── ...
      ├── init_thumos.pth
      └── init_annet.pth

Run

Training

OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0 python main.py --model-name PVLR

Inference

OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0 python eval/inference.py --pretrained-ckpt output/ckpt/PVLR/Best_model.pkl

References

We referenced the repos below for the code.

✉ Contact

If you have any question or comment, please contact using the issue.

About

[ACM MM 2024] Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages