-
We are the first to comprehensively study jailbreaking against MLLMs, showcasing strong data-universal property. Moreover, it exhibits notable modeltransferability, allowing for the jailbreaking of various models in a black-box manner.
-
We propose a construction-based method to harness our approach for LLM-jailbreaks, demonstrating superior efficiency compared to LLM-jailbreaking methods.
Until now, there is no existing multimodal dataset available for evaluating MLLM-jailbreaks. However, there are some pure text datasets for LLM-jailbreaking evaluation, such as AdvBench. Therefore, we construct a multimodal dataset, namely AdvBench-M, based on AdvBench in this paper.
We group all the harmful behaviors within AdvBench into 8 distinct semantic categories, specifically, “Bombs or Explosives”, “Drugs”, “Self-harm and Suicide”, “Cybersecurity and Privacy Issues”, “Physical Assault”, “Terrorism and Societal Tensions”, “Stock Market and Economy”, and “Firearms and Ammunition”. For each category, 30 semantic-relevant images were retrieved from the Internet using the Google search engine, coupled with the corresponding harmful behaviors.
1. Prepare the code and the environment
Git clone our repository, creating a python environment and activate it via the following command
git clone https://github.com/abc03570128/Jailbreaking-Attack-against-Multimodal-Large-Language-Model.git
cd MLLMs-jailbreaks
conda env create -f environment.yml
conda activate minigptv
2. Prepare the pretrained LLM weights
We examine several popular Multimodal LLMs, including MiniGPT-4, MiniGPT-v2, LLaVA, InstructBLIP, mPLUG-Owl2. Download the corresponding LLM weights from the following huggingface space via clone the repository using git-lfs.
MiniGPT-4 has three variants corresponding to three distinct LLM inside, i.e., Vicuna-7B, Vicuna-13B and LLaMA2, while MiniGPT-v2 just employs LLaMA2 as its LLM. For white-box jailbreaks, we evaluate our approach on MiniGPT-4 and MiniGPT-v2 separately. For evaluating model-transferaibility, we generate the imgJP on MiniGPT4 and subsequently employ it for black-box attacks on MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2.
Model name | Hugging Face Repo |
---|---|
MiniGPT-4(Vicuna7B) | Vision-CAIR/vicuna-7b |
MiniGPT-4(Vicuna13B) | Vision-CAIR/vicuna |
MiniGPT-4(LLaMA2) | meta-llama/Llama-2-7b-chat-hf |
MiniGPT-v2 | meta-llama/Llama-2-7b-chat-hf |
InstructBLIP | lmsys/vicuna-7b-v1.1 |
LLaVA | liuhaotian/llava-v1.5-13b |
mPLUG-Owl2 | MAGAer13/mplug-owl2-llama2-7b |
Then, set the variable llama_model in the model config file to the LLM weight path.
- For MiniGPT-v2, set the LLM path here at Line 14.
- For MiniGPT-4 (LLaMA2), set the LLM path here at Line 15.
- For MiniGPT-4 (Vicuna), set the LLM path here at Line 18
3. Prepare the pretrained model checkpoints
Download the pretrained model checkpoints
MiniGPT-4 (Vicuna 7B) | MiniGPT-4 (Vicuna 13B) | MiniGPT-4 (LLaMA-2 Chat 7B) | MiniGPT-v2 (online developing demo) |
---|---|---|---|
Download | Download | Download | Download |
For MiniGPT-v2, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigptv2_eval.yaml at Line 8.
For MiniGPT-4, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4_eval.yaml at Line 8 for Vicuna version or eval_configs/minigpt4_llama2_eval.yaml for LLama2 version.
In the fig folder, we showcase numerous successful jailbreaking instances, encompassing white-box attacks on MiniGPT4(LLaMA2) as well as examples of black-box transfer attacks.
1. imgJP-based Jailbreak(Multiple Harmful Behaviors)
For MiniGPT-4(LLaMA2), run
python v1_mprompt.py --cfg-path eval_configs/minigpt4_llama2_eval.yaml --gpu-id 0
For MiniGPT-4(LLaMA2+Img-suffix), run
python v1_mprompt_img_suffix.py --cfg-path eval_configs/minigpt4_llama2_eval.yaml --gpu-id 0
For MiniGPT-v2, run
python v2_mprompt.py --cfg-path eval_configs/minigptv2_eval.yaml --gpu-id 0
2. deltaJP-based Jailbreak
For MiniGPT-4(LLaMA2), run
python v1_Mprompt_Mimage.py --cfg-path eval_configs/minigpt4_llama2_eval.yaml --gpu-id 0
For MiniGPT-v2, run
python v2_Mprompt_Mimage.py --cfg-path eval_configs/minigptv2_eval.yaml --gpu-id 0
We generate imgJP on a surrogate model (e.g., Vicuna and LLaMA2) and use the generated imgJP to jailbreak various target models (e.g., mPLUG-Owl2, LLaVA, MiniGPT-v2, and InstructBLIP) in a black-box manner.
run
python v1_Mprompt_Mmodel.py --gpu-id 0
With the generated imgJP, we execute black-box attacks on all four models with their default conversation template. Taking mPLUG-Owl2 as an example, run
python mPLUG-Owl2_demo.py --gpu-id 0
In this manuscript, we delve into the exploration of jailbreaking of LLaMA2. We first construct a MLLM that encapsulates it, as shown in the following figure.
Secondly, we perform our MLLM-jailbreak to acquire imgJP, while concurrently recording the embedding embJP, i.e., run
python v1_mprompt_img_suffix.py --gpu-id 0
Thirdly, the embJP is reversed into text space through Deembedding and De-tokenizer operations. And then we execute LLM-jailbreaks with LLaMA2's default conversation template.
run
python Test_Llama2_image_suffix.py
-
MiniGPT-4 This repository is built upon MiniGPT-4!
-
llm-attacks Andy Zou’s outstanding work has found that a specific prompt suffix allows the jailbreaking of most popular LLMs. Don't forget to check this great open-source work if you don't know it before!
-
adversarial-attacks-pytorch Torchattacks is a PyTorch library that provides adversarial attacks to generate adversarial examples.