- Release training and evaluation code.
- Release LaSagnA-7B model weights.
Recent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: the incapability of handling multiple targets per query and the failure to identify the absence of query objects in the image. In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries. Consequently, we define the general sequence format for complex queries. Then we incorporate a semantic segmentation task in the current pipeline to fulfill the requirements of training data. Furthermore, we present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format.The effectiveness of our model in processing complex queries is validated by the comparable results with conventional methods on both close-set and open-set semantic segmentation datasets. Additionally, we outperform a series of vLLMs in reasoning and referring segmentation, showcasing our model's remarkable capabilities.
- We identify a crucial limitation of recent vLLM-based segmentation assistants. These assistants struggle with handling queries that involve multiple arbitrary targets, which may or may not exist in the image. To overcome this limitation, we introduce a sequence format that takes into account multiple classes and negative classes. With the proposed format, the assistant can be readily trained on semantic segmentation datasets.
- To address the challenges associated with the training on the semantic segmentation task, we present three innovative techniques: random classes list, sequence augmentation, and order following. By employing these strategies, the vLLM model can effectively utilize segmentation datasets, thus significantly improving its overall segmentation performance.
- We conduct experiments on three distinct tasks and demonstrate the capability of the proposed model to handle complex queries. We reveal the potential of vLLM-based segmentation assistants in the fundamental perception task, namely, semantic segmentation. Moreover, we surpass a series of vLLMs in reasoning and referring segmentation.
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Except for cityscapes and openimage datas, other training datas are the same with LISA.
βββ dataset
βΒ Β βββ ade20k
βΒ Β βΒ Β βββ annotations
βΒ Β βΒ Β βββ images
βΒ Β βββ coco
βΒ Β βΒ Β βββ train2017
βΒ Β βΒ Β βββ 000000000009.jpg
βΒ Β βΒ Β βββ ...
βΒ Β βββ cocostuff
βΒ Β βΒ Β βββ train2017
βΒ Β βΒ Β βββ 000000000009.png
βΒ Β βΒ Β βββ ...
βΒ Β βββ llava_dataset
βΒ Β βΒ Β βββ llava_instruct_150k.json
βΒ Β βββ mapillary
βΒ Β βΒ Β βββ config_v2.0.json
βΒ Β βΒ Β βββ testing
βΒ Β βΒ Β βββ training
βΒ Β βΒ Β βββ validation
βΒ Β βββ cityscapes
βΒ Β | βββ gtFine
βΒ Β | | βββ cityscapes_panoptic_val.json
βΒ Β | | βββ train
β | | βββ val
βΒ Β | βββ leftImg8bit
βΒ Β | βββ train
β | βββ val
βΒ Β βββ OpenImageV6
βΒ Β | βββ folder_train
βΒ Β | βββ train-masks
βΒ Β | βββ train_mask.json
βΒ Β βββ reason_seg
βΒ Β βΒ Β βββ ReasonSeg
βΒ Β βΒ Β βββ train
βΒ Β βΒ Β βββ val
βΒ Β βΒ Β βββ explanatory
βΒ Β βββ refer_seg
βΒ Β βΒ Β βββ images
βΒ Β βΒ Β | βββ saiapr_tc-12
βΒ Β βΒ Β | βββ mscoco
βΒ Β βΒ Β | βββ images
βΒ Β βΒ Β | βββ train2014
βΒ Β βΒ Β βββ refclef
βΒ Β βΒ Β βββ refcoco
βΒ Β βΒ Β βββ refcoco+
βΒ Β βΒ Β βββ refcocog
βΒ Β βββ vlpart
βΒ Β βββ paco
β β βββ annotations
βΒ Β βββ pascal_part
βΒ Β βββ train.json
β βββ VOCdevkit
The training process needs loading LLaVA's pre-trained weights.
Specifically, we use the LLaVA-Lightning-7B-v1-1
for our LaSagnA-7B model.
The mask generator module requires loading SAM ViT-H weights SAM.
deepspeed --master_port=24999 train_ds.py \
--version="PATH_TO_LLaVA" \
--dataset_dir='./dataset' \
--vision_pretrained="PATH_TO_SAM" \
--vision_tower='PATH_TO_CLIP' \
--dataset="sem_seg||refer_seg||vqa||reason_seg" \
--exp_name="LaSagnA-7b" \
--batch_size=2 \
--model_max_length=1024 \
--num_all_classes=80 \
When training is finished, to get the full model weight:
cd ./runs/LaSagnA-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin
Merge the LoRA weights of pytorch_model.bin
, save the resulting model into your desired path in the Hugging Face format:
CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
--version="PATH_TO_LLaVA" \
--weight="PATH_TO_pytorch_model.bin" \
--save_path="PATH_TO_SAVED_MODEL"
deepspeed --master_port=24999 test_sem.py \
--version="PATH_TO_LaSagnA_MODEL" \
--dataset_dir='./dataset' \
--eval_only \
If you find this project useful in your research, please consider citing:
@article{wei2024lasagna,
title={LaSagnA: Language-based Segmentation Assistant for Complex Queries},
author={Wei, Cong and Tan, Haoxian and Zhong, Yujie and Yang, Yujiu and Ma, Lin},
journal={arXiv preprint arXiv:2404.08506},
year={2024}
}