Official code and data of our paper:
Dissecting Adversarial Robustness of Multimodal LM Agents
Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan
Carnegie Mellon University
Preprint, Jun 2024
Oral presentation at NeurIPS 2024 Open-World Agents Workshop
[Paper link] | [Website] | [Data]
Compared to (A) attacks on image classifiers and (B) jailbreaking attacks on LLMs, attacks on agents have limited access to the input space (e.g., only one image in the environment), and the target output depends on the environment instead of a specific prediction. The attacker can manipulate the agent through (C) illusioning, which makes it appear to the agent that it is in a different state, or (D) goal misdirection, which makes the agent pursue a targeted different goal than the original user goal.
Our code requires two repositories, including this one. The file structure should look like this:
.
├── agent-attack # This repository
└── visualwebarena
Can skip this step if you only want to run the lightweight step-wise evaluation (e.g., for early development) or the attacks.
VisualWebArena is required if you want to run the episode-wise evaluation that reproduces the results in our paper. It requires at least 200GB of disk space and docker to run.
The original version of VisualWebArena can be found here, but we modified it to support perturbation to the trigger images. Clone the modified version and install:
git clone git@github.com:ChenWu98/visualwebarena.git
cd visualwebarena/
# Install based on the README.md of https://github.com/ChenWu98/visualwebarena
# Make sure that `pytest -x` passes
Clone the repository and install with pip:
git clone git@github.com:ChenWu98/agent-attack.git
cd agent-attack/
python -m pip install -e .
You may need to install PyTorch according to your CUDA version.
Important
Need to set up the corresponding API keys each time before running the code.
Configurate the OpenAI API key.
export OPENAI_API_KEY=<your-openai-api-key>
If using Claude, configurate the Anthropic API key.
export ANTHROPIC_API_KEY=<your-anthropic-api-key>
If using Gemini, first install the gcloud CLI. Setup a Google Cloud project and get the ID at the Google Cloud console. Get the AI Studio API key from the AI Studio console. Authenticate Google Cloud and configure the AI Studio API key:
gcloud auth login
gcloud config set project <your-google-cloud-project-id>
export VERTEX_PROJECT=<your-google-cloud-project-id> # Same as above
export AISTUDIO_API_KEY=<your-aistudio-api-key>
Only need to do this once.
Copy the raw data files to the experiment data directory:
scp -r data/ exp_data/
The adversarial examples will later be saved to the exp_data/
directory.
Can skip this step if you want to see how the attacks break the agent without running the attacks yourself. We have provided the pre-generated adversarial examples.
This section describes how to reproduce the attacks in our paper. Each attack on an image takes about 1 hour on a single GPU. FYI, we used NVIDIA A100 (80G) for the captioner attack and NVIDIA A6000 for the CLIP attack.
To run the captioner attack:
python scripts/run_cap_attack.py
To run the CLIP attack, run the corresponding script for each model:
python scripts/run_clip_attack.py --model gpt-4-vision-preview
python scripts/run_clip_attack.py --model gemini-1.5-pro-latest
python scripts/run_clip_attack.py --model claude-3-opus-20240229
python scripts/run_clip_attack.py --model gpt-4o-2024-05-13
The generated adversarial examples will be saved to files in the exp_data/
directory.
Important
Need to set up the urls each time before running the code.
Configurate the urls for each website:
export CLASSIFIEDS="http://127.0.0.1:9980"
# Default reset token for classifieds site, change if you edited its docker-compose.yml
export CLASSIFIEDS_RESET_TOKEN="4b61655535e7ed388f0d40a93600254c"
export SHOPPING="http://127.0.0.1:7770"
export REDDIT="http://127.0.0.1:9999"
export WIKIPEDIA="http://127.0.0.1:8888"
export HOMEPAGE="http://127.0.0.1:4399"
You can replace the http://127.0.0.1
with the actual IP address you are using.
Only need to process the data files once.
Process the data files (e.g., replace the url placeholders with the actual urls):
python scripts/process_data.py --data_dir exp_data/
Run the episode-wise evaluation for the GPT-4V + SoM agent:
# Episode-wise, benign
bash episode_scripts/gpt4v_benign.sh
# Episode-wise, benign, no captioning
bash episode_scripts/gpt4v_benign_no_cap.sh
# Episode-wise, benign, self-caption
bash episode_scripts/gpt4v_benign_self_cap.sh
# Episode-wise, with captioner attack
bash episode_scripts/gpt4v_bim_caption_attack.sh
# Episode-wise, with CLIP attack
bash episode_scripts/gpt4v_clip_attack_self_cap.sh
# Episode-wise, with CLIP attack, no captioning
bash episode_scripts/gpt4v_clip_attack_no_cap.sh
Run the episode-wise evaluation for the GPT-4o (05-13) + SoM agent:
# Episode-wise, benign
bash episode_scripts/gpt4o_benign.sh
# Episode-wise, benign, no captioning
bash episode_scripts/gpt4o_benign_no_cap.sh
# Episode-wise, benign, self-caption
bash episode_scripts/gpt4o_benign_self_cap.sh
# Episode-wise, with captioner attack
bash episode_scripts/gpt4o_bim_caption_attack.sh
# Episode-wise, with CLIP attack
bash episode_scripts/gpt4o_clip_attack_self_cap.sh
# Episode-wise, with CLIP attack, no captioning
bash episode_scripts/gpt4o_clip_attack_no_cap.sh
Run the episode-wise evaluation for the Gemini 1.5 Pro + SoM agent:
# Episode-wise, benign
bash episode_scripts/gemini1.5pro_benign.sh
# Episode-wise, benign, no captioning
bash episode_scripts/gemini1.5pro_benign_no_cap.sh
# Episode-wise, benign, self-caption
bash episode_scripts/gemini1.5pro_benign_self_cap.sh
# Episode-wise, with captioner attack
bash episode_scripts/gemini1.5pro_bim_caption_attack.sh
# Episode-wise, with CLIP attack
bash episode_scripts/gemini1.5pro_clip_attack_self_cap.sh
# Episode-wise, with CLIP, no captioning
bash episode_scripts/gemini1.5pro_clip_attack_no_cap.sh
Run the episode-wise evaluation for Claude 3 Opus + SoM agent:
# Episode-wise, benign
bash episode_scripts/claude3opus_benign.sh
# Episode-wise, benign, no captioning
bash episode_scripts/claude3opus_benign_no_cap.sh
# Episode-wise, benign, self-caption
bash episode_scripts/claude3opus_benign_self_cap.sh
# Episode-wise, with captioner attack
bash episode_scripts/claude3opus_bim_caption_attack.sh
# Episode-wise, with CLIP attack
bash episode_scripts/claude3opus_clip_attack_self_cap.sh
# Episode-wise, with CLIP attack, no captioning
bash episode_scripts/claude3opus_clip_attack_no_cap.sh
Run the stepwise evaluation for the GPT-4V + SoM agent:
# Step-wise, benign
bash step_scripts/gpt4v_benign.sh
# Step-wise, benign, no captioning
bash step_scripts/gpt4v_benign_no_cap.sh
# Step-wise, with captioner attack
bash step_scripts/gpt4v_bim_caption_attack.sh
# Step-wise, with CLIP attack
bash step_scripts/gpt4v_clip_attack_self_cap.sh
# Step-wise, with CLIP attack, no captioning
bash step_scripts/gpt4v_clip_attack_no_cap.sh
Run the stepwise evaluation for the GPT-4o (05-13) + SoM agent:
# Step-wise, benign
bash step_scripts/gpt4o_benign.sh
# Step-wise, benign, no captioning
bash step_scripts/gpt4o_benign_no_cap.sh
# Step-wise, with captioner attack
bash step_scripts/gpt4o_bim_caption_attack.sh
# Step-wise, with CLIP attack
bash step_scripts/gpt4o_clip_attack_self_cap.sh
# Step-wise, with CLIP attack, no captioning
bash step_scripts/gpt4o_clip_attack_no_cap.sh
Run the stepwise evaluation for the Gemini 1.5 Pro + SoM agent:
# Step-wise, benign
bash step_scripts/gemini1.5pro_benign.sh
# Step-wise, benign, no captioning
bash step_scripts/gemini1.5pro_benign_no_cap.sh
# Step-wise, with captioner attack
bash step_scripts/gemini1.5pro_bim_caption_attack.sh
# Step-wise, with CLIP attack
bash step_scripts/gemini1.5pro_clip_attack_self_cap.sh
# Step-wise, with CLIP, no captioning
bash step_scripts/gemini1.5pro_clip_attack_no_cap.sh
Run the stepwise evaluation for the Claude 3 Opus + SoM agent:
# Step-wise, benign
bash step_scripts/claude3opus_benign.sh
# Step-wise, benign, no captioning
bash step_scripts/claude3opus_benign_no_cap.sh
# Step-wise, with captioner attack
bash step_scripts/claude3opus_bim_caption_attack.sh
# Step-wise, with CLIP attack
bash step_scripts/claude3opus_clip_attack_self_cap.sh
# Step-wise, with CLIP attack, no captioning
bash step_scripts/claude3opus_clip_attack_no_cap.sh
See the FIXME
comments in the code for some hard-coded hacks we used to work around slight differences in the environment.
If you find this code useful, please consider citing our paper:
@article{wu2024agentattack,
title={Adversarial Attacks on Multimodal Agents},
author={Wu, Chen Henry and Koh, Jing Yu and Salakhutdinov, Ruslan and Fried, Daniel and Raghunathan, Aditi},
journal={arXiv preprint arXiv:2406.12814},
year={2024}
}