We propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and is not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts.
Bibtex for our paper:
@article{wang2024defending,
title={Defending LLMs against Jailbreaking Attacks via Backtranslation},
author={Wang, Yihan and Shi, Zhouxing and Bai, Andrew and Hsieh, Cho-Jui},
journal={arXiv preprint arXiv:2402.16459},
year={2024}
}
Our implementation is based on the llm-jailbreaking-defense
library developed by us.
The library provides general interfaces for wrapping a LLM with a jailbreaking defense
including the defense by backtranslation we proposed.
Please install the library first following its setup guide.
We have considered attacks including GCG, PAIR, AutoDAN, PAP, and jailbreakchat.com.
Running GCG requires a separate environment with fschat==0.2.20
(it doesn't work with the latest fschat
).
Install GCG by:
cd GCG
pip install -e .
Run an individual attack by
cd GCG/experiments
python -u main.py \
--config="configs/individual_{target_model}.py" \
--config.train_data="../../data/harmful_behaviors_custom.csv" \
--config.result_prefix="output/attacks/gcg/gcg_individual_{target_model}"
{target_model}
can be chosen from vicuna_13B
and llama2_13B
.
Run the transfer attacks against Vicuna and Guanaco:
cd GCG/experiments
# Vicuna-7B and Vicuna-13B, with the first random seed
# (2 GPUs are needed for the 2 models, respectively.)
python main.py \
--config configs/transfer_vicuna.py \
--config.train_data ../../data/harmful_behaviors_custom.csv \
--config.result_prefix output/attacks/gcg/gcg_transfer_vicuna_1
# Vicuna-7B and Vicuna-13B, with the second random seed
# (2 GPUs are needed for the 2 models, respectively.)
python main.py \
--config configs/transfer_vicuna.py \
--config.train_data ../../data/harmful_behaviors_custom.csv \
--config.result_prefix output/attacks/gcg/gcg_transfer_vicuna_2 \
# Vicuna-7B, Vicuna-13B, Guanaco-7B, Guanaco-13B
# (4 GPUs are needed for the 4 models, respectively.)
python main.py \
--config configs/transfer_vicuna_guanaco.py \
--config.train_data ../../data/harmful_behaviors_custom.csv \
--config.result_prefix output/attacks/gcg/gcg_transfer_vicuna_guanaco
python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_individual_vicuna_13B*.json \
--output_file output/attacks/gcg_processed/gcg_individual_vicuna_13B.json
python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_individual_llama2_13B*.json \
--output_file output/attacks/gcg_processed/gcg_individual_llama2_13B.json
python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_transfer_vicuna_1*.json \
--output_file output/attacks/gcg_processed/gcg_transfer_vicuna_1.json
python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_transfer_vicuna_2*.json \
--output_file output/attacks/gcg_processed/gcg_transfer_vicuna_2.json
python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_transfer_vicuna_guanaco.json \
--output_file output/attacks/gcg_processed/gcg_transfer_vicuna_guanaco.json
Installing additional dependencies for running PAIR:
cd PAIR
pip install -r requirements.txt
cd ..
Run wandb login
to login to your WandB account beforehand to log your data.
We run PAIR by specifying a target_model
chosen from
vicuna-13b-v1.5
, llama-2-13b
, and gpt-3.5-turbo
.
We can optionally run PAIR with a defense method by setting --defense_method
.
See options for --defense_method
in the "Inference" section below.
python -m attacks.pair \
--load_data_path ./data/harmful_behaviors_custom.json \
--save_result_path {attack_output} \
--target_model {target_model} \
--defense_method {defense_method}
Run the AutoDAN attack by:
python attacks/autodan.py \
--harmful_behavior_path ./data/harmful_behaviors_custom.json \
--save_output_path {attack_output} \
--target_model {target_model}
{target_model}
can be chosen from: vicuna-13b-v1.5
and llama-2-13b
.
For gpt-3.5-turbo
, we adopt a transfer attack setting where the attacks
are generated using vicuna-13b-v1.5
as the target model.
Run inference by:
To run attack and inference for GCG, fschat==0.2.20
is required
(a separate virtual environment is recommended).
python run_inference.py \
--load_data_path {attack_output} \
--save_result_path {inference_output} \
--target_model {target_model} \
--defense_method {defense_method} \
--prompt_key {key_of_jailbreaking_prompt} \
--target_max_n_tokens {target_max_n_tokens} \
--verbose
-
{attack_output}
is the JSON output of the attack. -
{target_model}
can be chosen fromvicuna-13b-v1.5
,llama-2-13b
,gpt-3.5-turbo
andgpt-3.5-turbo-0301
(for AutoDAN). -
{defense_method}
can be chosen fromNone
(No defense),SmoothLLM
,paraphrase_prompt
(Paraphrasing),response_check
(Response check), andbacktranslation
.- For
backtranslation
, we can specify the threshold$\gamma$ by setting--backtranslation_threshold {gamma}
. - For
response_check
, we can specify the threshold by setting--response_check_threshold {threshold}
.
- For
-
{key_of_jailbreaking_prompt}
is the json key of jailbreaking prompt in{attack_output}
-
{target_max_n_tokens}
is the max number of tokens in inference output.- For AdvBench data from
data/harmful_behaviors_custom.json
, we use--target_max_n_tokens 300
. - For MT-Bench data from
benign_behaviors_custom.json
, we use--target_max_n_tokens 1024
.
- For AdvBench data from
To run inference without jailbreaking, set --prompt_key goal
and --load_data_path data/harmful_behaviors_custom.json
.
Finally, we run a judge on the inference output:
python run_judge.py \
--judge_name {judge_model} \
--load_data_path {inference_output} \
--save_result_path {judge_output} \
--judge_max_n_tokens {max_n_tokens} \
--goal_key {goal_key} \
--response_key {response_key} \
--verbose
judge_name
is the name of the judge method with an optional judge model: [judge_method]@[judge_model]
.
Options for judge_method
include:
pair
: The judge introduced in PAIR with--judge_max_n_tokens 10
.quality
: The judge used to rate the quality of a response, introduced in MT-Bench with--judge_max_n_tokens 2048
.openai_policy
: The judge adopted by PAP with--judge_max_n_tokens 1024
.gcg_matching
: THe judge adopted by GCG.
goal_key
and response_key
are the keys of jailbreaking attack goals and responses to be judged in the inference output json file.
Options for judge_model
include gpt-4
and gpt-3.5
.