Skip to content

Official Implementation of "AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description". Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

License

Notifications You must be signed in to change notification settings

Jyxarthur/AutoAD-Zero

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Junyu Xie1, Tengda Han1, Max Bain1, Arsha Nagrani1, Gül Varol1 2, Weidi Xie1 3, Andrew Zisserman1

1 Visual Geometry Group, Department of Engineering Science, University of Oxford
2 LIGM, École des Ponts, Univ Gustave Eiffel, CNRS
3 CMIC, Shanghai Jiao Tong University

Project page Dataset

Requirements

  • Basic Dependencies: pytorch=2.0.0, Pillow, pandas, decord, opencv, moviepy=1.0.3 transformers=4.37.2 accelerate==0.26.1

  • VideoLLaMA2: After installation, modify the sys.path.append("/path/to/VideoLLaMA2") in stage1/main.py and stage1/utils.py. Please download the VideoLLaMA2-7B checkpoint here.

  • Set up cache model path (for LLaMA3, etc.) by modifying os.environ['TRANSFORMERS_CACHE'] = "/path/to/cache/" in stage1/main.py and stage2/main.py

Datasets

In this work, we evaluate our model on CMD-AD, MAD-Eval, and TV-AD.

Video Frames

  • CMD-AD can be downloaded here.
  • MAD-Eval can be downloaded here.
  • TV-AD adopts a subset of TV-QA as visual sources (3fps), and can be downloaded here. Each folder containing .jpg video frames needs to be converted to a .tar file. This can be done by the code provided in tools/compress_subdir.py.
    For example,
    python tools/compress_subdir.py \
    --root_dir="resources/example_file_structures/tvad_raw/" \   # for downloaded raw (.jpg folders) files from TVQA
    --save_dir="resources/example_file_structures/tvad/"         # for compressed tar files
    

Ground Truth AD Annotations

  • All annotations can be found in resources/annotations

Results

  • The AutoAD-Zero predictions can be downloaded here.

Inference

Stage I: VLM-Based Dense Video Description

python stage1/main.py \
--dataset={dataset} \                  #e.g. "cmdad"
--video_dir={video_dir} \
--anno_path={anno_path} \              #e.g. "resources/annotations/cmdad_anno_with_face_0.2_0.4.csv"
--charbank_path={charbank_path} \      #e.g. "resources/charbanks/cmdad_charbank.json" 
--model_path={videollama2_ckpt_path} \
--output_dir={output_dir}

--dataset: choices are cmdad, madeval, and tvad.
--video_dir: directory of video datasets, example file structures can be found in resources/example_file_structures (files are empty, for references only).
--anno_path: path to AD annotations (with predicted face IDs and bboxes), available in resources/annotations.
--charbank_path: path to external character banks, available in resources/charbanks.
--model_path: path to videollama2 checkpoint.
--output_dir: directory to save output csv.

Stage II: LLM-Based AD Summary

python stage2/main.py \
--dataset={dataset} \             #e.g. "cmdad"
--pred_path={stage1_result_path} 

--dataset: choices are cmdad, madeval, and tvad.
--pred_path: path to the stage1 saved csv file.

Citation

If you find this repository helpful, please consider citing our work:

@article{xie2024autoad0,
	title={AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description},
	author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and G\"ul Varol and Weidi Xie and Andrew Zisserman},
	journal={arXiv preprint arXiv:2407.15850},
	year={2024}
}

References

VideoLLaMA2: https://github.com/DAMO-NLP-SG/VideoLLaMA2
LLaMA3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

About

Official Implementation of "AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description". Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages