Safety in Pruning

This is a repository for replicating the experiments from our paper: Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning .

Getting Started

Install the dependencies and obtain a Wanda pruned model checkpoint as described in the original repository

Generating outputs for our jailbreaking dataset

Run the following command to generate model responses to our jailbreaking dataset (integrated.yaml). Depending on the base model, set the prompt template to be one of llama, vicuna, or mistral for correct inference.

python inference.py \
  --model path/to/model \
  --dataset path/to/dataset \
  --template llama|vicuna|mistral

Benchmarking model

We provide methods for running various benchmarks. To run the AltQA long context test or the WikiText perplexity test, run the following. Depending on the base model, set the prompt template to be one of llama, vicuna, or mistral for correct inference.

python evaluate.py \
  --model_path path/to/model \
  --output_path path/to/output/directory \
  --template llama|vicuna|mistral \
  --benchmark altqa|wikitext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Safety in Pruning

Getting Started

Generating outputs for our jailbreaking dataset

Benchmarking model

Files

README.md

Latest commit

History

README.md

File metadata and controls

Safety in Pruning

Getting Started

Generating outputs for our jailbreaking dataset

Benchmarking model