This is the official repository for Multilingual Hallucination Removal (MHR), a straightforward yet notably effective approach aimed at alleviating multilingual hallucinations prevalent in Large Vision-Language Models (LVLMs).
- We proposed the Multilingual Hallucination Removal (MHR) strategy, a straightforward yet profoundly effective framework for eliminating hallucinations across various languages.
- Our Multilingual Hllucination Removal (MHR) framework comprises two stages, specifically Multilingual Supervised Fine-Tuning and Multilingual Direct Preference Optimization.
conda create -n mhr python=3.9
conda activate mhr
cd MHR
pip install -r requirements.txt
pip install -e .
- Multilingual Supervised Fine-tuning:
-
1.1 Prepare SFT data: PALO
-
1.2 Train SFT on LVLM:
SFT SCRIPTS
PROMPT_VERSION=v1 MODEL_VERSION=vicuna-v1-5-7b LM_MODEL_CKPT=lmsys/vicuna-7b-v1.5 deepspeed mhr/alignment/models/llava_v1_5/train_sft.py \ --deepspeed ./scripts/zero3.json \ --model_name_or_path $LM_MODEL_CKPT \ --version $PROMPT_VERSION \ --data_path ${DATA_PATH} \ --image_folder ${img_folder} \ --vision_tower openai/clip-vit-large-patch14 \ --pretrain_mm_mlp_adapter ${vision_tower_path} \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --bf16 True \ --output_dir ${output_dir}\ --num_train_epochs 3 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 16 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 1280 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to wandb \ --image_aspect_ratio 'pad'
-
- Generate Preference Data Using Scripts under
mhr/preprocess
- 2.1 prepare hallucination-based English data.
- A. For hallucination alignment or language alignment:
- 2.2 sample LVLM response using
lvlm_sampling.py
- 2.3 calculate alignment score using
calculate_PPL_score.py
ordesc_calculate_ppl_score.py
- 2.4 extract DPO data using
desc_extract_dpo_data.py
orextract_dpo_data.py
- 2.2 sample LVLM response using
- B. For Translation alignment:
- 2.2 Translate english hallucination preference dataset into other languages using
translate.py
- 2.2 Translate english hallucination preference dataset into other languages using
- Train on Preference Optimization
-
Train DPO on LVLM:
DPO SCRIPTS
accelerate launch --config_file=${accelerate_config_file} ./train_dpo.py \ --deepspeed ./scripts/deepspeed/zero3.json \ --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 0 \ --model_name_or_path ${model_name_or_path} \ --version v1 \ --vision_tower ${vision_tower_path} \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ${ckpt_save_path} \ --num_train_epochs 9 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps ${save_steps} \ --save_total_limit 5 \ --learning_rate 2e-6 \ --weight_decay 0. \ --warmup_steps 0 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --report_to wandb \ --run_name ${ckpt_name} \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --beta 0.1 \ --hallucination_data_path ${hallucination_data} \ --hallucination_data_type "dir_of_jsonl_desc" \ --hallucination_ratio 1 \ --preference_data_path ${preference_data} \ --preference_ratio 1 \ --preference_data_type "dir_of_jsonl_desc" \ --translation_data_path ${translation_data} \ --translation_ratio 1 \ --translation_data_type "dir_of_json_desc" \ --image_folder ${image_folder} \ --vg_path ${vg_annotation_path} \ --resume_from_checkpoint ${resume_from_checkpoint}
-
- Evaluation
- We evaluate our method using lmms-eval, Please follow the instructions to add task and data to evaluate.
- MHR significantly mitigates the multilingual hallucination issue across different languages.
Table 1. Enhanced LLaVA 1.5 model Performances on POPE benchmark’s all 3 datasets. We select the “popular" type to test. Average scores of current partition are marked in gray and bold text denotes the best results of the same backbone
- MHR gain remarkable performance on MME hallucination subset
Table 2. Results on the hallucination subset of MME. Higher scores indicate better performance and fewer hallucinations. The best performances within each setting are bolded. Limited by space, we only present 4 languages here, including high-resource languages ru and zh, and low-resource languages uk and bg. To help understand the overall performance comparison, we also report the average results for all 13 languages.
Figure 2. The performance on the full MME set, which consists of 14 tasks. Each graph displays the performance of the respective LLaVA-1.5 and our MHR model. Here we present results in four languages (uk, zh, bg, and ru) as outlined in Table 2.
- Please refer to our paper for detailed experimental results.
Figure 3. Illustration of hallucination removal by our proposed MHR with 7 languages as an example. We mark the hallucination part of response by Yellow and correctness by Green respectively.