Skip to content

[EMNLP 2024] TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering

Notifications You must be signed in to change notification settings

traveler-framework/TraveLER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering

This repo contains the code for the paper TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering, published at EMNLP 2024. Check out our project page here!

Setup

Follow these steps to setup the repo.

  1. Setup a Conda environment with the required packages.
conda create -n traveler python=3.9
conda activate traveler
pip install -r requirements.txt
  1. Create a .env file that contains your OPENAI_API_KEY.

  2. Login to wandb account.

wandb login

Datasets

For each dataset, ensure that the videos are moved into the same directory. This will be the data_path in the config files.

Models

We use a LaViLa checkpoint that is trained on the Ego4D videos that don't overlap with EgoSchema, provided by LLoVi. You can find the model checkpoint here. In order to serve LaViLa, clone the repo and copy in launch/launch_lavila.py to the root directory of the cloned repo.

We serve LLaVA-1.6 using SGLang, and Llama 3 using vLLM.

Experiments

Config

The config files are found under each experiment. These config files determine the dataset being evaluated, the models, various paths, and other hyperparameters.

If you want to evaluate TraveLER on a new dataset, be sure to set the correct dataset info, including the name, path, and query file. You will also need to create your own query file, examples can be found under data/.

Launching Servers

Before we run experiments, we need to launch the servers for the VLM/LLMs.

  1. Launch VLM server. The port number can be changed to your preference.
# LLaVA-1.6
CUDA_VISIBLE_DEVICES=0 bash launch/launch_llava.sh --port 30000

# LaViLa (from root of LaViLa repo)
CUDA_VISIBLE_DEVICES=0 python3 launch_lavila.py --port 30000
  1. For models served by SGLang (LLaVA-1.6, not LaViLa), we need to launch a wrapper script to forward incoming asynchronous requests to SGLang. The sglang_port has to match the port number we set earlier, but the wrapper_port can be changed to your preference. We will send requests to the VLM using the wrapper_port instead of sglang_port.
python3 launch/launch_wrapper.py --sglang_port 30000 --wrapper_port 8000
  1. (Optional) - Launch LLM server for local LLMs.
# Llama 3
CUDA_VISIBLE_DEVICES=1 vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123

Benchmark

In general, this is the command to use for running experiments.

python3 main.py --exp <experiment_name> --start_sample <start sample> --max_samples <max samples> --outfile_name <batch_name> --vlm_port 8000

Note: The vlm_port should match the port number we set for the wrapper_port, instead of sglang_port. If a local LLM is also used, we need to set the port number of the LLM using llm_port.

For example, to evaluate NExT-QA:

# first 100 examples
python3 main.py --exp nextqa --start_sample 0 --max_samples 100 --outfile_name batch_0 --vlm_port 8000

# next 100 examples
python3 main.py --exp nextqa --start_sample 100 --max_samples 100 --outfile_name batch_1 --vlm_port 8000

...

We manually define the start sample and max number of samples to allow fine-grain control over how we partition the dataset across GPUs. For our workload on NVIDIA RTX 6000 Ada's, we could have 5 processes sharing the same GPU for the VLM. This number will change depending on what GPU you have due to different memory amounts for the KV cache.

After the output files are generated, we can find the accuracy by running the eval script.

# python3 eval.py --exp <experiment_name>
python3 eval.py --exp nextqa

Acknowledgements

Code for configs, dataloaders, and query files adapted from RVP and ViperGPT.

Citation

Our paper can be cited as:

@misc{shang2024traveler,
    title={TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering},
    author={Chuyi Shang and Amos You and Sanjay Subramanian and Trevor Darrell and Roei Herzig},
    year={2024},
    eprint={2404.01476},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2404.01476}, 
}

About

[EMNLP 2024] TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published