What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Yan Zeng*, Hanbo Zhang*, Jiani Zheng*, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong
*Equal Contribution

update

Jul 2023: Release preprint in arXiv, and page

Lynx (8B parameters):

results on Open-VQA image testsets

results on Open-VQA video testsets && OwlEval human eval && MME benchmark

ablation result

Quick Start

environment

conda env create -f environment.yml
conda activate lynx

prepare data

step 1: prepare annotation file

Open-VQA annotations file is under the path data/Open_VQA_images.jsonl and data/Open_VQA_videos.jsonl, there is an example:

{
  "dataset": "Open_VQA_images", # the dataset name of your data
  "question": "What is in the image?", 
  "answer": ["platform or tunnel"], # list
  "index": 1, 
  "image": "images/places365/val_256/Places365_val_00000698.jpg", # relative path of image
  "origin_dataset": "places365", 
  "class": "Place", # eight image VQA types and two video VQA types correspond to the open_VQA dataset
}

You can also convert your own data in jsonl format, the keys origin_dataset and class are optional.

step 2: prepare images

Download raw images from corresponding websites: Places365(256x256), VQAv2, OCRVQA, Something-Something-v.2, MSVD-QA, NeXT-QA and MSRVTT-QA.

step 3: modify the default setting in the code

You need the check some import settings in the configs configs/LYNX.yaml, for example:

# change this prompt for different task, this is the default prompt
prompt: "User: {question}\nBot:"
# the key must match the vision key in test_files
# if you test Open_VQA_videos.jsonl, need to change to "video"
vision_prompt_dict: "image"
output_prompt_dict: "answer"

prepare checkpoint

step 1: download the eva_vit_1b on official website and put it under the data/, rename it as eva_vit_g.pth
step 2: prepare the vicuna-7b and put it under the data/
- method 1: download from huggingface directly.
- method 2:
  - download Vicuna’s delta weight from v1.1 version (use git-lfs)
  - get LLaMA-7b from here or from the Internet.
  - install FastChat pip install git+https://github.com/lm-sys/FastChat.git
  - run python -m fastchat.model.apply_delta --base /path/to/llama-7b-hf/ --target ./data/vicuna-7b/ --delta /path/to/vicuna-7b-delta-v1.1/
step 3: download the pretrain_lynx.pt or finetune_lynx.pt and put it under the data/(please check the checkpoint in the config is match the file you download.)

organize the files like this:

lynx-llm/
    data/
        Open_VQA_images.jsonl
        Open_VQA_videos.jsonl
        eva_vit_g.pth
        vicuna-7b/
        finetune_lynx.pt
        pretrain_lynx.pt
    images/
        vqav2/val2014/*.jpg
        places365/val_256/*.jpg
        ocrvqa/images/*.jpg
        sthsthv2cap/val/*.mp4
        msvdqa/test/*.mp4
        nextqa/*.mp4
        msrvttqa/*.mp4

infer

sh generate.sh

Citation

If you find this repository useful, please considering giving ⭐ or citing:

@article{zeng2023matters,
  title={What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?},
  author={Zeng, Yan and Zhang, Hanbo and Zheng, Jiani and Xia, Jiangnan and Wei, Guoqiang and Wei, Yang and Zhang, Yuchen and Kong, Tao},
  journal={arXiv preprint arXiv:2307.02469},
  year={2023}
}

Contact

For issues using this code, please submit a GitHub issue.

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
data		data
dataset		dataset
images		images
licenses		licenses
models		models
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yml		environment.yml
generate.py		generate.py
generate.sh		generate.sh
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Quick Start

environment

prepare data

step 1: prepare annotation file

step 2: prepare images

step 3: modify the default setting in the code

prepare checkpoint

infer

Citation

Contact

License

About

Releases

Packages

Languages

License

bytedance/lynx-llm

Folders and files

Latest commit

History

Repository files navigation

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Quick Start

environment

prepare data

step 1: prepare annotation file

step 2: prepare images

step 3: modify the default setting in the code

prepare checkpoint

infer

Citation

Contact

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages