F-LMM: Grounding Frozen Large Multimodal Models

Introduction

This is the official release of paper F-LMM: Grounding Frozen Large Multimodal Models. It is currently under construction.

F-LMM: Grounding Frozen Large Multimodal Models,
Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen Change Loy
Bibtex

TODO

Training code
Evaluation code and checkpoints
Interactive Demo

Dependencies

This project is built on Xtuner. The segmentation modules including the U-Net and training losses are from MMSegmentation and MMDetection. Please refer to the official documents of these toolkits for installation guidance.
The version of transformers used in this project is v4.39.1. And we find using versions beyond v4.40.0 cannot reproduce the performances (we are debugging on this issue).
Accelerate is used to build the evaluation pipeline of our models. Please refer to its official webpage for installation.

Data Preparation

PNG Dataset. Download images train2017 and val2017 from COCO's official website and put them under data/coco. Download annotation files png_coco_train2017.json and png_coco_val2017.json from PNG's project page and put them under data/coco/annotations. Download mask annotations panoptic_train2017(.json) and panoptic_val2017(.json) from COCO's official website and put them under data/coco/annotations.

RefCOCO Series. Please refer to MMDetection's tutorial to prepare RefCOCO datasets.

VisCoT. We have prepared the test images under Google Drive. Download and extract the zip files under data/cot.

F-LMM/
├── data
    ├── cot
    ├── coco
        ├── annotations
            ├── panoptic_train2017.json
            ├── panoptic_val2017.json
            ├── png_coco_train2017.json
            ├── png_coco_val2017.json
            ├── panoptic_train2017     # panoptic masks
            ├── panoptic_val2017     # panoptic masks
        ├──refcoco
            ├──instances.json
            ├──refs(unc).p
        ├──refcoco+
            ├──instances.json
            ├──refs(unc).p
        ├──refcocog
            ├──instances.json
            ├──refs(umd).p
        ├── train2017
        ├── val2017
        ├── train2014

Checkpoints

SAM. Please obtain the checkpoint sam_vit_l_0b3195.pth of pretrained SAM model from SAM's official webpage.

F-LMM/
├── checkpoints
    ├── sam_vit_l_0b3195.pth

Large Multimodal Models. Models of off-the-shelf LMMs can be automatically downloaded from huggingface when running training or evaluation.

Run

Train

export PYTHONPATH=.
NPROC_PER_NODE=8 xtuner train configs/deepseek_vl/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.py --deepspeed deepspeed_zero2

Currently, there are bugs when deepspeed_zero3 is used, we are going to resolve this issue in the future.

Test

Checkpoints. The checkpoints of our trained models are available on Google Drive. Download and put them under checkpoints/.

#	LMM	Configs	Checkpoints
1	LLaVA-1.5-7B	frozen_llava_1_5_vicuna_7b_unet_sam_l_refcoco_png	model
2	LLaVA-Next-Vicuna-7B	frozen_llava_next_vicuna_7b_unet_sam_l_refcoco_png	model
3	LLaVA-Next-Mistral-7B	frozen_llava_next_mistral_7b_unet_sam_l_refcoco_png	model
4	DeepSeekVL-1.3B	frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png	model
5	DeepSeekVL-7B	frozen_deepseek_vl_7b_chat_unet_sam_l_refcoco_png	model
6	MiniGemini-2B	frozen_mgm_gemma_2b_unet_sam_l_refcoco_png	model
7	MiniGemini-7B	frozen_mgm_vicuna_7b_unet_sam_l_refcoco_png	model
8	MiniGemini-HD-7B	frozen_mgm_vicuna_7b_hd_unet_sam_l_refcoco_png	model
9	HPT-Air	frozen_hpt_air_unet_sam_l_refcoco_png	model
10	HPT-Air-1.5	frozen_hpt_air_1_5_unet_sam_l_refcoco_png	model

Panoptic Narrative Grounding (PNG).

export PYTHONPATH=.
accelerate launch scripts/multiprocess_eval_png.py \
 configs/deepseek_vl/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.py \
  --checkpoint checkpoints/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.pth

Referring Expression Segmentation (RES).

export PYTHONPATH=.
accelerate launch scripts/multiprocess_eval_refcoco.py \
 configs/deepseek_vl/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.py \
  --checkpoint checkpoints/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.pth --concat

Visual Chain-of-Thought Reasoning.

For now we only implement VisCot on DeepSeekVL models that work well with multi-image inputs. Some examples of visual cot is shown below.

1. Inference.

export PYTHONPATH=.
accelerate launch scripts/visual_cot/visual_cot_inference.py configs/deepseek_vl/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.py \
--checkpoint checkpoints/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.pth \ 
--version v1 --save_folder the/directory/of/result/json/files  --discard_sam

2. Evaluate using ChatGPT.

export OPENAI_API_KEY="your_openai_api_key"
python scripts/visual_cot/gpt_eval_cot_score_single.py --result_file a/single/json/file  # evaluate a single json file
python scripts/visual_cot/gpt_eval_cot_score.py --result_dir the/directory/of/all/json/files  # evaluate all json files

Demo

Grounded Human-AI Conversation. An interactive demo is coming soon. Below are some examples of grounded conversation.

Citation

@misc{wu2024flmm,
      title={F-LMM: Grounding Frozen Large Multimodal Models}, 
      author={Size Wu and Sheng Jin and Wenwei Zhang and Lumin Xu and Wentao Liu and Wei Li and Chen Change Loy},
      year={2024},
      eprint={2406.05821},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This project is licensed under NTU S-Lab License 1.0.

Acknowledgement

This project is impossible without open-source efforts of large multimodal models in the community, including LLaVA, DeepSeek-VL, MiniGemini and HPT. In addition, we also thank open-source code bases from transformers and openmmlab teams that facilitate the development of this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

F-LMM: Grounding Frozen Large Multimodal Models

Introduction

TODO

Dependencies

Data Preparation

Checkpoints

Run

Train

Test

Demo

Citation

License

Acknowledgement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 347 Commits
configs		configs
deepseek_vl		deepseek_vl
flmm		flmm
hpt		hpt
images		images
llava		llava
mgm		mgm
scripts		scripts
segment_anything		segment_anything
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

wusize/F-LMM

Folders and files

Latest commit

History

Repository files navigation

F-LMM: Grounding Frozen Large Multimodal Models

Introduction

TODO

Dependencies

Data Preparation

Checkpoints

Run

Train

Test

Demo

Citation

License

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages