The dataset is collected from YouTube, you can find the ID of each video in annotation files.
We use VGGish to extract audio features, use ResNet18 and R(2+1)D-18 to extract visual features.
VGGish feature: Google Drive, Baidu Drive (pwd: lfav), (~662M).
ResNet18 feature: Google Drive, Baidu Drive (pwd: lfav), (~2.6G).
R(2+1)D-18feature: Google Drive, Baidu Drive (pwd: lfav), (~2.6G).
Label files are in the folder LFAV_dataset
.
# LFAV training set annotations
cd LFAV_dataset
cd ./train
train_audio_weakly.csv: video-level audio annotaions of training set
train_visual_weakly.csv: video-level visual annotaions of training set
train_weakly.csv: video-level annotations (union of video-level audio annotations and visual annotations) of training set
# LFAV validation set annotations
cd LFAV_dataset
cd ./val
val_audio_weakly.csv: video-level audio annotaions of validation set
val_visual_weakly.csv: video-level visual annotaions of validation set
val_weakly_av.csv: video-level annotations (union of video-level audio annotations and visual annotations) of validation set
val_audio.csv: event-level audio annotaions of validation set
val_visual.csv: event-level visual annotaions of validation set
# LFAV testing set annotations
cd LFAV_dataset
cd ./test
test_audio_weakly.csv: video-level audio annotaions of testing set
test_visual_weakly.csv: video-level visual annotaions of testing set
test_weakly_av.csv: video-level annotations (union of video-level audio annotations and visual annotations) of testing set
test_audio.csv: event-level audio annotaions of testing set
test_visual.csv: event-level visual annotaions of testing set
Source code is in the folder src
.
The script of training all three phases is in:
src/scripts/train_s3.sh
If you want to train one or two phases, just edit the arg "num_stages" to 1 or 2.
The script of testing all three phases is in:
src/scripts/test_s3.sh
We also provide our trained weights of the complete method (three phases): Google Drive, Baidu Drive (pwd: lfav).
If you find our work useful in your research, please cite our paper.
@article{hou2024toward,
title={Toward Long Form Audio-Visual Video Understanding},
author={Hou, Wenxuan and Li, Guangyao and Tian, Yapeng and Hu, Di},
journal={ACM Transactions on Multimedia Computing, Communications and Applications},
volume={20},
number={9},
pages={1--26},
year={2024},
publisher={ACM New York, NY}
}
This research was supported by National Natural Science Foundation of China (NO.62106272), and Public Computing Cloud, Renmin University of China.
The source code referenced AVVP-ECCV20.
This project is released under the CC BY-NC 4.0 License.