Defending LLMs against Jailbreaking Attacks via Backtranslation

We propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and is not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts.

Bibtex for our paper:

@article{wang2024defending,
  title={Defending LLMs against Jailbreaking Attacks via Backtranslation},
  author={Wang, Yihan and Shi, Zhouxing and Bai, Andrew and Hsieh, Cho-Jui},
  journal={arXiv preprint arXiv:2402.16459},
  year={2024}
}

Setup

Our implementation is based on the llm-jailbreaking-defense library developed by us. The library provides general interfaces for wrapping a LLM with a jailbreaking defense including the defense by backtranslation we proposed. Please install the library first following its setup guide.

Attacks

We have considered attacks including GCG, PAIR, AutoDAN, PAP, and jailbreakchat.com.

GCG

Setup

Running GCG requires a separate environment with fschat==0.2.20 (it doesn't work with the latest fschat).

Install GCG by:

cd GCG
pip install -e .

Individual attacks

Run an individual attack by

cd GCG/experiments
python -u main.py \
--config="configs/individual_{target_model}.py" \
--config.train_data="../../data/harmful_behaviors_custom.csv" \
--config.result_prefix="output/attacks/gcg/gcg_individual_{target_model}"

{target_model} can be chosen from vicuna_13B and llama2_13B.

Transfer attacks

Run the transfer attacks against Vicuna and Guanaco:

cd GCG/experiments

# Vicuna-7B and Vicuna-13B, with the first random seed
# (2 GPUs are needed for the 2 models, respectively.)
python main.py \
--config configs/transfer_vicuna.py \
--config.train_data ../../data/harmful_behaviors_custom.csv \
--config.result_prefix output/attacks/gcg/gcg_transfer_vicuna_1

# Vicuna-7B and Vicuna-13B, with the second random seed
# (2 GPUs are needed for the 2 models, respectively.)
python main.py \
--config configs/transfer_vicuna.py \
--config.train_data ../../data/harmful_behaviors_custom.csv \
--config.result_prefix output/attacks/gcg/gcg_transfer_vicuna_2 \

# Vicuna-7B, Vicuna-13B, Guanaco-7B, Guanaco-13B
# (4 GPUs are needed for the 4 models, respectively.)
python main.py \
--config configs/transfer_vicuna_guanaco.py \
--config.train_data ../../data/harmful_behaviors_custom.csv \
--config.result_prefix output/attacks/gcg/gcg_transfer_vicuna_guanaco

Post-processing the attack outputs

python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_individual_vicuna_13B*.json \
--output_file output/attacks/gcg_processed/gcg_individual_vicuna_13B.json

python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_individual_llama2_13B*.json \
--output_file output/attacks/gcg_processed/gcg_individual_llama2_13B.json

python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_transfer_vicuna_1*.json \
--output_file output/attacks/gcg_processed/gcg_transfer_vicuna_1.json

python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_transfer_vicuna_2*.json \
--output_file output/attacks/gcg_processed/gcg_transfer_vicuna_2.json

python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_transfer_vicuna_guanaco.json \
--output_file output/attacks/gcg_processed/gcg_transfer_vicuna_guanaco.json

PAIR

Setup

Installing additional dependencies for running PAIR:

cd PAIR
pip install -r requirements.txt
cd ..

Run wandb login to login to your WandB account beforehand to log your data.

Run Attack

We run PAIR by specifying a target_model chosen from vicuna-13b-v1.5, llama-2-13b, and gpt-3.5-turbo.

We can optionally run PAIR with a defense method by setting --defense_method. See options for --defense_method in the "Inference" section below.

python -m attacks.pair \
--load_data_path ./data/harmful_behaviors_custom.json \
--save_result_path {attack_output} \
--target_model {target_model} \
--defense_method {defense_method}

AutoDAN

Run the AutoDAN attack by:

python attacks/autodan.py \
--harmful_behavior_path ./data/harmful_behaviors_custom.json \
--save_output_path {attack_output} \
--target_model {target_model}

{target_model} can be chosen from: vicuna-13b-v1.5 and llama-2-13b. For gpt-3.5-turbo, we adopt a transfer attack setting where the attacks are generated using vicuna-13b-v1.5 as the target model.

Inference

Run inference by: To run attack and inference for GCG, fschat==0.2.20 is required (a separate virtual environment is recommended).

python run_inference.py \
--load_data_path {attack_output} \
--save_result_path {inference_output} \
--target_model {target_model} \
--defense_method {defense_method} \
--prompt_key {key_of_jailbreaking_prompt} \
--target_max_n_tokens {target_max_n_tokens} \
--verbose

{attack_output} is the JSON output of the attack.
{target_model} can be chosen from vicuna-13b-v1.5, llama-2-13b, gpt-3.5-turbo and gpt-3.5-turbo-0301 (for AutoDAN).
{defense_method} can be chosen from None (No defense), SmoothLLM, paraphrase_prompt (Paraphrasing), response_check (Response check), and backtranslation.
- For backtranslation, we can specify the threshold $\gamma$ by setting --backtranslation_threshold {gamma}.
- For response_check, we can specify the threshold by setting --response_check_threshold {threshold}.
{key_of_jailbreaking_prompt} is the json key of jailbreaking prompt in {attack_output}
{target_max_n_tokens} is the max number of tokens in inference output.
- For AdvBench data from data/harmful_behaviors_custom.json, we use --target_max_n_tokens 300.
- For MT-Bench data from benign_behaviors_custom.json, we use --target_max_n_tokens 1024.

To run inference without jailbreaking, set --prompt_key goal and --load_data_path data/harmful_behaviors_custom.json.

Judge

Finally, we run a judge on the inference output:

python run_judge.py \
--judge_name {judge_model} \
--load_data_path {inference_output} \
--save_result_path {judge_output} \
--judge_max_n_tokens {max_n_tokens} \
--goal_key {goal_key} \
--response_key {response_key} \
--verbose

judge_name is the name of the judge method with an optional judge model: [judge_method]@[judge_model]. Options for judge_method include:

pair: The judge introduced in PAIR with --judge_max_n_tokens 10.
quality: The judge used to rate the quality of a response, introduced in MT-Bench with --judge_max_n_tokens 2048.
openai_policy: The judge adopted by PAP with --judge_max_n_tokens 1024.
gcg_matching: THe judge adopted by GCG.

goal_key and response_key are the keys of jailbreaking attack goals and responses to be judged in the inference output json file.

Options for judge_model include gpt-4 and gpt-3.5.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
GCG		GCG
PAIR		PAIR
attacks		attacks
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arguments.py		arguments.py
run_inference.py		run_inference.py
run_judge.py		run_judge.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Defending LLMs against Jailbreaking Attacks via Backtranslation

Setup

Attacks

GCG

Setup

Individual attacks

Transfer attacks

Post-processing the attack outputs

PAIR

Setup

Run Attack

AutoDAN

Inference

Judge

About

Releases

Packages

Contributors 3

Languages

License

YihanWang617/LLM-Jailbreaking-Defense-Backtranslation

Folders and files

Latest commit

History

Repository files navigation

Defending LLMs against Jailbreaking Attacks via Backtranslation

Setup

Attacks

GCG

Setup

Individual attacks

Transfer attacks

Post-processing the attack outputs

PAIR

Setup

Run Attack

AutoDAN

Inference

Judge

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages