Skip to content

YihanWang617/LLM-Jailbreaking-Defense-Backtranslation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Defending LLMs against Jailbreaking Attacks via Backtranslation

arXiv

We propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and is not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts.

Bibtex for our paper:

@article{wang2024defending,
  title={Defending LLMs against Jailbreaking Attacks via Backtranslation},
  author={Wang, Yihan and Shi, Zhouxing and Bai, Andrew and Hsieh, Cho-Jui},
  journal={arXiv preprint arXiv:2402.16459},
  year={2024}
}

Setup

Our implementation is based on the llm-jailbreaking-defense library developed by us. The library provides general interfaces for wrapping a LLM with a jailbreaking defense including the defense by backtranslation we proposed. Please install the library first following its setup guide.

Attacks

We have considered attacks including GCG, PAIR, AutoDAN, PAP, and jailbreakchat.com.

GCG

Setup

Running GCG requires a separate environment with fschat==0.2.20 (it doesn't work with the latest fschat).

Install GCG by:

cd GCG
pip install -e .

Individual attacks

Run an individual attack by

cd GCG/experiments
python -u main.py \
--config="configs/individual_{target_model}.py" \
--config.train_data="../../data/harmful_behaviors_custom.csv" \
--config.result_prefix="output/attacks/gcg/gcg_individual_{target_model}"

{target_model} can be chosen from vicuna_13B and llama2_13B.

Transfer attacks

Run the transfer attacks against Vicuna and Guanaco:

cd GCG/experiments

# Vicuna-7B and Vicuna-13B, with the first random seed
# (2 GPUs are needed for the 2 models, respectively.)
python main.py \
--config configs/transfer_vicuna.py \
--config.train_data ../../data/harmful_behaviors_custom.csv \
--config.result_prefix output/attacks/gcg/gcg_transfer_vicuna_1

# Vicuna-7B and Vicuna-13B, with the second random seed
# (2 GPUs are needed for the 2 models, respectively.)
python main.py \
--config configs/transfer_vicuna.py \
--config.train_data ../../data/harmful_behaviors_custom.csv \
--config.result_prefix output/attacks/gcg/gcg_transfer_vicuna_2 \

# Vicuna-7B, Vicuna-13B, Guanaco-7B, Guanaco-13B
# (4 GPUs are needed for the 4 models, respectively.)
python main.py \
--config configs/transfer_vicuna_guanaco.py \
--config.train_data ../../data/harmful_behaviors_custom.csv \
--config.result_prefix output/attacks/gcg/gcg_transfer_vicuna_guanaco

Post-processing the attack outputs

python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_individual_vicuna_13B*.json \
--output_file output/attacks/gcg_processed/gcg_individual_vicuna_13B.json

python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_individual_llama2_13B*.json \
--output_file output/attacks/gcg_processed/gcg_individual_llama2_13B.json

python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_transfer_vicuna_1*.json \
--output_file output/attacks/gcg_processed/gcg_transfer_vicuna_1.json

python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_transfer_vicuna_2*.json \
--output_file output/attacks/gcg_processed/gcg_transfer_vicuna_2.json

python attacks/gcg.py \
--input_file output/attacks/gcg/gcg_transfer_vicuna_guanaco.json \
--output_file output/attacks/gcg_processed/gcg_transfer_vicuna_guanaco.json

PAIR

Setup

Installing additional dependencies for running PAIR:

cd PAIR
pip install -r requirements.txt
cd ..

Run wandb login to login to your WandB account beforehand to log your data.

Run Attack

We run PAIR by specifying a target_model chosen from vicuna-13b-v1.5, llama-2-13b, and gpt-3.5-turbo.

We can optionally run PAIR with a defense method by setting --defense_method. See options for --defense_method in the "Inference" section below.

python -m attacks.pair \
--load_data_path ./data/harmful_behaviors_custom.json \
--save_result_path {attack_output} \
--target_model {target_model} \
--defense_method {defense_method}

AutoDAN

Run the AutoDAN attack by:

python attacks/autodan.py \
--harmful_behavior_path ./data/harmful_behaviors_custom.json \
--save_output_path {attack_output} \
--target_model {target_model}

{target_model} can be chosen from: vicuna-13b-v1.5 and llama-2-13b. For gpt-3.5-turbo, we adopt a transfer attack setting where the attacks are generated using vicuna-13b-v1.5 as the target model.

Inference

Run inference by: To run attack and inference for GCG, fschat==0.2.20 is required (a separate virtual environment is recommended).

python run_inference.py \
--load_data_path {attack_output} \
--save_result_path {inference_output} \
--target_model {target_model} \
--defense_method {defense_method} \
--prompt_key {key_of_jailbreaking_prompt} \
--target_max_n_tokens {target_max_n_tokens} \
--verbose
  • {attack_output} is the JSON output of the attack.
  • {target_model} can be chosen from vicuna-13b-v1.5, llama-2-13b, gpt-3.5-turbo and gpt-3.5-turbo-0301 (for AutoDAN).
  • {defense_method} can be chosen from None (No defense), SmoothLLM, paraphrase_prompt (Paraphrasing), response_check (Response check), and backtranslation.
    • For backtranslation, we can specify the threshold $\gamma$ by setting --backtranslation_threshold {gamma}.
    • For response_check, we can specify the threshold by setting --response_check_threshold {threshold}.
  • {key_of_jailbreaking_prompt} is the json key of jailbreaking prompt in {attack_output}
  • {target_max_n_tokens} is the max number of tokens in inference output.
    • For AdvBench data from data/harmful_behaviors_custom.json, we use --target_max_n_tokens 300.
    • For MT-Bench data from benign_behaviors_custom.json, we use --target_max_n_tokens 1024.

To run inference without jailbreaking, set --prompt_key goal and --load_data_path data/harmful_behaviors_custom.json.

Judge

Finally, we run a judge on the inference output:

python run_judge.py \
--judge_name {judge_model} \
--load_data_path {inference_output} \
--save_result_path {judge_output} \
--judge_max_n_tokens {max_n_tokens} \
--goal_key {goal_key} \
--response_key {response_key} \
--verbose

judge_name is the name of the judge method with an optional judge model: [judge_method]@[judge_model]. Options for judge_method include:

  • pair: The judge introduced in PAIR with --judge_max_n_tokens 10.
  • quality: The judge used to rate the quality of a response, introduced in MT-Bench with --judge_max_n_tokens 2048.
  • openai_policy: The judge adopted by PAP with --judge_max_n_tokens 1024.
  • gcg_matching: THe judge adopted by GCG.

goal_key and response_key are the keys of jailbreaking attack goals and responses to be judged in the inference output json file.

Options for judge_model include gpt-4 and gpt-3.5.

About

Code for paper "Defending aginast LLM Jailbreaking via Backtranslation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages