Representation Space Jailbreak

This repo contains the code used in our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis. (arXiv)

Install Dependencies

Create a new virtual environment with conda:

conda create -n YOUR_ENVIRONMENT_NAME python=3.11

Switch to it:

conda activate YOUR_ENVIRONMENT_NAME

Install pip requirements:

WARNING: It is possible that the following command installs the CPU version of PyTorch. In that case, remove the torch=... line in requirements.txt, and install PyTorch+CUDA manually.

pip install -r requirements.txt

(Optional) By default fp16 is enabled for model loading. If your hardware supports FlashAttention2 (requires >= Ampere architecture GPU), you may want to enable it to speed up the process. In that case, you will need flash-attn (DON'T DO THIS IF YOUR GPU ARCHITECTURE IS LOWER THAN AMPERE!):

conda install cuda-toolkit -c nvidia
pip install flash-attn --no-build-isolation
# If the second line above gives an error, try again with this:
# MAX_JOBS=4 pip install flash-attn==2.4.* --no-build-isolation --verbose --no-cache-dir --force-reinstall --no-deps

If you don't have FlashAttention2 installed, that's okay. It'll automatically resolve back to the good old attention implementation.

Usage

All the launcher scripts are located in ./scripts except SLURM related ones.

If you are using SLURM to submit jobs, modify ./sbatch_direct_submit_{EXPERIMENT}.sh to your need.

IMPORTANT: For black-box models, you should provide your own OpenAI API key in ./.env (create if not exist). See ./.env_template for an example. You can apply for an OpenAI API key at their website.

Main Experiments (Section 5.2)
- Baseline and Ours
Run ./sbatch_direct_submit_{EXPERIMENT}.sh to submit SLURM jobs. Alternatively, run ./scripts/directrun_{EXPERIMENT}.sh [PARAMETERS...] with parameters (see contents in the scripts to determine) to run experiments on local machine.

By default, individual results of each prompt will be output to ./results/{METHOD_NAME}/{MODEL_NICKNAME}/ as JSON files. Run ./scripts/merge_result.sh to merge all JSONs into one CSV and calculate the ASR. Modify the merging script to your need.
- Clean
Run ./scripts/generate_clean.sh. It runs on local machine (does not submit SLURM jobs).
- DAN
Run ./scripts/attack_dan{_api}.sh. Modify the parameters in the scripts to your need.

For black-box models, gpt-3.5-turbo-0125 or gpt-4-0125-preview are used in our experiments. Replace the --model_target argument to reproduce our results.
Visualization (Section 3)

After you obtained jailbreak results by running the main experiment, use ./tools/extract_visualization_from_result.ipynb to generate visualization datasets from your results. Modify the filepaths in the jupyter notebook.

Run ./scripts/visualize_anchored.sh. Modify the parameters in the scripts to your need.

In the following cases, change python visualizer_anchored.py to python visualizer_anchored_{var, var_first2comp, emptydatasets}.py in this script:
- visualizer_anchored_var.py: Calculates the overall between-class/within-class variance ratio.
- visualizer_anchored_var_first2comp.py: Calculates the between-class/within-class variance ratio over the first 2 principal components/the other n_components - 2 principal components. Set n_components = 200 to approximate the "full" dimensions (~1.0 PCA explained variance ratio over first 200 principal components). Actual full dimensions will suffer from the curse of dimensions and produce NaN values.
- visualizer_anchored_emptydatasets.py: Supports visualizaion with no --datasets provided, or some of the datasets containing no samples. Especially useful when you want to visualize only the anchor datasets, or your attack produces 0%/100% ASR in some categories.
Defense (Section 5.3)

Run ./scripts/defense_{perplexity, paraphrase}.sh. Modify the parameters in the scripts to your need.
Transfer Attack (Section 5.4)

Run ./scripts/transfer{_api}.sh. Modify the parameters in the scripts to your need.

For black-box models, gpt-3.5-turbo-0125 or gpt-4-0125-preview are used in our experiments. Replace the --model_target argument to reproduce our results.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
scripts		scripts
tools		tools
.env_template		.env_template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
defense_paraphrase.py		defense_paraphrase.py
defense_perplexity.py		defense_perplexity.py
download_model.py		download_model.py
evaluate_llm.py		evaluate_llm.py
evaluate_stringmatch.py		evaluate_stringmatch.py
generate_clean_response.py		generate_clean_response.py
generate_transfer_response.py		generate_transfer_response.py
generate_transfer_response_api.py		generate_transfer_response_api.py
jailbreak.py		jailbreak.py
jailbreak_autodan.py		jailbreak_autodan.py
jailbreak_dan.py		jailbreak_dan.py
jailbreak_dan_api.py		jailbreak_dan_api.py
jailbreak_gcg.py		jailbreak_gcg.py
jailbreak_ours_autodan.py		jailbreak_ours_autodan.py
jailbreak_ours_gcg.py		jailbreak_ours_gcg.py
merge_result.py		merge_result.py
requirements.txt		requirements.txt
sbatch_direct_submit_autodan.sh		sbatch_direct_submit_autodan.sh
sbatch_direct_submit_gcg.sh		sbatch_direct_submit_gcg.sh
sbatch_direct_submit_ours_autodan.sh		sbatch_direct_submit_ours_autodan.sh
sbatch_direct_submit_ours_gcg.sh		sbatch_direct_submit_ours_gcg.sh
utils.py		utils.py
visualize_gcg_trace.py		visualize_gcg_trace.py
visualizer_anchored.py		visualizer_anchored.py
visualizer_anchored_emptydatasets.py		visualizer_anchored_emptydatasets.py
visualizer_anchored_var.py		visualizer_anchored_var.py
visualizer_anchored_var_first2comp.py		visualizer_anchored_var_first2comp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Representation Space Jailbreak

Install Dependencies

Usage

About

Releases

Packages

Languages

License

yuplin2333/representation-space-jailbreak

Folders and files

Latest commit

History

Repository files navigation

Representation Space Jailbreak

Install Dependencies

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages