Seeking the limitations of "LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models"

A. Tragadouras, E.G. Lionis, T. Aslanidis, V. Karlis, O. Neut

In this repository, we have code to reproduce and extend the findings of the paper titled "LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models".

Structure of Repository

The structure of the repository has tried to be as modularized as possible and stay as original structure of the original authors to not cause any problems with dependencies. The repository must follow the following structure:

repo
|-- readme.md  
|
|-- blogpost.md
|
|-- src
|   |-- llamavid
|   |-- scripts
|   |-- work_dirs
|   │   |-- llama-vid
|   │   │   |-- llama-vid-13b-full-336
|   │   │   |-- ...
|   |   |-- llama-vid-7b-full-224-video-fps-1_grounding_finetuning
|   |   |-- llama-vid-7b-full-224-video-fps-1-small-ucf-finetuning
|   |   |-- ... (more folders with the output)
|   |-- model_zoo
|   |   |-- LLM
|   |   │   |-- vicuna
|   |   │   │   |-- 7B-V1.5
|   |   │   │   |-- 13B-V1.5
|   |   |-- LAVIS
|   |   │   |-- eva_vit_g.pth
|   |   │   |-- instruct_blip_vicuna7b_trimmed.pth
|   |   │   |-- instruct_blip_vicuna13b_trimmed.pth
|   |-- new_datasets_processing
|   |   |-- animal_dataset
|   |   |-- ucf_dataset
|   |-- data  
|   |   |   |-- LLaMA-VID-Eval
|   |   |   |   |-- animal-kingdom
|   |   |   |   |-- ucf-crime
|   |   |   |   |-- gqa
|   |   |   |   |-- ... (more folders)
|   |   |   |-- LLaMA-VID-Pretrain
|   |   |   |   |-- animal-kingdom
|   |   |   |   |-- ucf-crime  
|   |   |   |   |-- gqa
|   |   |   |   |-- ... (more folders)
|   |   |   |-- LLaMA-VID-Finetune
|   |   |   |   |-- animal-kingdom
|   |   |   |   |-- ucf-crime  
|   |   |   |   |-- gqa
|   |   |   |   |-- ... (more folders)
|   |--out 
|   |     | (output files of job scripts) 
|   |-- pyproject.toml
|   |-- requirements_llama_three.txt
|   |-- install_llama_environment.job
|   |-- install_environment.job

More details for some files/folders are the following:

readme.md: Description of the repository
blogpost.md: Blogpost report
src: contains the main project files
- model_zoo: Folder with all imported weights for the LLM and vision encoder
- llamavid: The code that has the model architecture and will be executed by the scripts
- scripts: The executable code
- work_dirs: The output of training, fine-tuning and evaluation. If a pre-train model is used, then this folder has to have the weights.
- llama-vid: The weights of the llama-vid model from the original authors
- llama-vid-7b-full-224-video-fps-1_grounding_finetuning: The weights for the finetuning with the animal grounding dataset
- llama-vid-7b-full-224-video-fps-1-small-ucf-finetuning: The weights for the finetuning with the small-ucf dataset
- new_datasets_processing: The code for processing the new dataset
- data: The folder with all data
  - LLaMA-VID-Eval: The data for evaluation purposes
  - LLaMA-VID-Pretrain: The data for pre-train purposes
  - LLaMA-VID-Finetune: The data for finetuning
- requirements_llama_three.txt:
- install_llama_environment.job: Executable to install the conda environment in a server
- install_environment.job: Executable to install the conda environment in a server

Download Data

The data that has been used in this project, can be split into reproducible data and the extension data.

Original benchmarks

The reproducible data is the benchmarks that the authors used and can be found in the original repository. In order to download this dataset, one has to follow the steps that the original authors describe for the images and short video evaluations.

Extended data

The extended data are data from two different datasets. The one dataset is animal kingdom dataset which we took the Video Grounding dataset, where we used it for fine-tuning and inference purposes. This data contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos. In order to download them you need to request access through the form that they have in their page and get the sharepoint link. For a similar reason, the second dataset is UCF-Crime dataset and it consists of footage from 1900 long and untrimmed real-world surveillance videos, with 13 activities. In order to download this benchmark you can execute scripts/download_ucf_crime.sh script or manually download it from the Dropbox URL that they provide on their page.

Download Pre-trained Weights

The steps to download the pre-trained weights are split into two 2 parts. The weights are needed for the performing both training and inference on the model. There also the weights have passed through finetuning and are not necessary if you are planning to retrain from the start of the model.

Necessary pre-trained weights

There is a section from the original authors to describe how to download the necessary weights. You must put them under model_zoo as defined in the structure. We used the vicuna-7b-v1.5 data out of the vicuna and have to be placed under src/model_zoo/LLM/vicuna. Also we downloaded the EVA-ViT-G, QFormer-7b, QFormer-13b under src/model_zoo/LAVIS.

Pretrained finetuned original models

The offered trained model weights can be found under the table of the original author's repository. You can choose one of those models and download them under the src/work_dirs. If work_dirs folder is not created then you have to create it.

Pretrained finetuned extension models

The weights of the finetuned extension models can be found in the sharepoint and have to be downloaded in the src/work_dirs. If there is a problem with the folder, send an email to akis.lionis at student.uva.nl to request access to the folder.

Conda Environments

In this project, we had to create two different conda environments. Those are llamavid and llama_3_instruct. This necessity initially arose to reproduce the performance of LLaMA_VID in the zero-shot video QA (Question Answering) benchmarks MSVD-QA and MSRVTT-QA. The authors utilize 'gpt-3.5-turbo' capabilities in text generation to evaluate the predictions made by the LLaMA-VID model in these Visual Question-answering tasks. Given an annotated pair of a question and a single-word answer related to the video, we need to evaluate if the prediction made by the proposed model agrees with the annotated single-word answer used as a label. Since 'gpt-3.5-turbo' is a closed model and we did not have access to an API key, we substituted it with Meta-Llama-3-8B-Instruct, one of the newest open-source LLMs released by Meta that has text-generation capabilities. The caveat is that the latter LLM depends on a newer transformer version that is not compatible with the transformer version used in LLAVA, a backbone that LLaMA-VID builds upon, which justifies our decision to use two different environments. So whenever we need to evaluate the correctness of the LLaMA-VID model predictions in QA tasks, we use the *llama_3_instruct environment.

Both environments have a job script for building the environments in a cluster, which can be found under src named src/install_environment.job and src/install_llama3_instruct_environment.job. To install them locally, one has to follow the following commands:

llamavid installation

cd src

conda create -n llamavid python=3.10 -y
source activate llamavid

pip install --upgrade pip
pip install -e .

pip install ninja
pip install flash-attn==2.1.2.post3 --no-build-isolation

llama_3_instruction installation

cd src

conda create -n llama_3_instruction python=3.10 -y
source activate llama_3_instruction

pip install --upgrade pip
pip install -r requirements_llama_three.txt
pip install openpyxl
pip install transformers==4.40.0
conda install -c conda-forge code-server

How to reproduce the author's findings

The authors have used numerous benchmarks to have their findings. Precise steps have been given in the evaluation part of their repository. One has to execute the bash script job from scripts/{image_only/video}/eval/{benchmark}.sh by selecting either image_only or video which the benchmark is located and replace the name of the benchmark in the {benchmark} location.

How to reproduce the extension

To reproduce the extension the reader has to start with requesting access for the respective datasets.

Animal dataset

For the animal dataset the data can be requested from here: https://sutdcv.github.io/Animal-Kingdom/ as described in down In this work we focus on the grounding sub-part of animals dataset, so the coresponding data should be placed in LLaMA-VID/data/LLaMA-VID-Finetune/animal-grounding. Following that, inference_data_grounding_v2.ipynb should be run for the development of inference dataset and create_inputs_grounding_v2.ipynb for the finutuning data. Within this notebooks the required split for train and test data is already considered.

Finetuning animal dataset

Once the data is developed, for finetuning the reader should run src/scripts/video/train/stage_instruction_tuning_grounding_animals_fps1.job and src/scripts/video/train/stage_instruction_tuning_grounding_animals_fps6.job to perform instruction tuning for fps1 and fps6 respectively.

Inference animal dataset

Finally, to run inference, the reader should run the src/scripts/eval/ft_animal_ground_s1.job to test the results of the finetuned model and zero_shot_animal_ground_s2.job

Ucf-crime dataset

For the Ucf can be requested here: https://paperswithcode.com/dataset/ucf-crime

Inference for UCF crime

We created a new inference and evaluation file for this dataset and our experiments. You can run those files through these automated job scripts: In LLaMa-VID/scripts/video/eval there are all the scripts crime_eval_*.job to run the UCF zero shot crime inference, and crime_ft_eval_.job for finetuned inference, as well as crime_llama*.job for the llama evaluator.

Finetune UCF crime

In LLaMa-VID/scripts/video/train we have jobs like stage_instruction_tuning_ucf_crime_small_fps*.job to use deepspeed in order to fine-tune the model on the UCF train set.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
figures		figures
src		src
.gitignore		.gitignore
README.md		README.md
blogpost.md		blogpost.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seeking the limitations of "LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models"

A. Tragadouras, E.G. Lionis, T. Aslanidis, V. Karlis, O. Neut

Structure of Repository

Download Data

Original benchmarks

Extended data

Download Pre-trained Weights

Necessary pre-trained weights

Pretrained finetuned original models

Pretrained finetuned extension models

Conda Environments

llamavid installation

llama_3_instruction installation

How to reproduce the author's findings

How to reproduce the extension

Animal dataset

Finetuning animal dataset

Inference animal dataset

Ucf-crime dataset

Inference for UCF crime

Finetune UCF crime

About

Releases

Packages

Contributors 5

Languages

taslanidis/LLama-VID

Folders and files

Latest commit

History

Repository files navigation

Seeking the limitations of "LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models"

A. Tragadouras, E.G. Lionis, T. Aslanidis, V. Karlis, O. Neut

Structure of Repository

Download Data

Original benchmarks

Extended data

Download Pre-trained Weights

Necessary pre-trained weights

Pretrained finetuned original models

Pretrained finetuned extension models

Conda Environments

llamavid installation

llama_3_instruction installation

How to reproduce the author's findings

How to reproduce the extension

Animal dataset

Finetuning animal dataset

Inference animal dataset

Ucf-crime dataset

Inference for UCF crime

Finetune UCF crime

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages