A Second Pair of Eyes: Composing Foundation Models for Egocentric Video QA

Final Project for CS231N by Ronak Malde and Arjun Karanam.

Abstract

In this paper, we seek to tackle the project of Egocentric Video Question and Answer, with the goal of creating Augmented Reality systems that one could query for more information about the world around them. As opposed to traditional techniques of joint training across all modalities (in this case, egocentric video and language), we instead take the approach of composing multiple foundation models, using multi-modal informed captioning. This allows us to leverage the powerful priors in these foundational models while finetuning just one part, the Vision Language Model, with our egocentric data. We find that a pairing of PromptCap (a multimodal Vision Language Model) finetuned on data-augmented Egocentric videos + captions, composed with GPT3 yields the best results on the task set forth by the EgoVQA dataset. Using a separate Vision model to generate captions and GPT3 to answer the questions does not perform as well, demonstrating that there is still merit to jointly training a model with Egocentric Video and QA data in pursuit of the Egocentric Video Question and Answering task.

Environment

Our testing was done using Python 3.8 To set up the environment, install necessary requirements from requirements.txt

pip install -r requirements.txt

Usage

All tests can be run from main.py, by simple running

python main.py

Parameters are set in params.py to run different tests, edit this file with different parameters to run specific tests. The paper explores changing the parameter CaptionerParams.question_type, which can take on any value in CaptionerParams.Configs for different prompts for the captioning module. The paper also explores the parameter BaselineParams.n_caption_frames, which sets how many frames are used to construct the world state.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.vscode		.vscode
.DS_Store		.DS_Store
.gitignore		.gitignore
231N Final Poster.pdf		231N Final Poster.pdf
231N Final Report.pdf		231N Final Report.pdf
README.md		README.md
baseline.py		baseline.py
captioner.py		captioner.py
finetune.py		finetune.py
incontext.py		incontext.py
llm.py		llm.py
main.py		main.py
make_finetune_dataset.py		make_finetune_dataset.py
params.py		params.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Second Pair of Eyes: Composing Foundation Models for Egocentric Video QA

Abstract

Environment

Usage

About

Releases

Packages

Contributors 2

Languages

rmalde/Ego-QA-231

Folders and files

Latest commit

History

Repository files navigation

A Second Pair of Eyes: Composing Foundation Models for Egocentric Video QA

Abstract

Environment

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages