Skip to content

Latest commit

 

History

History
26 lines (22 loc) · 1.31 KB

README.md

File metadata and controls

26 lines (22 loc) · 1.31 KB

Safety in Pruning

This is a repository for replicating the experiments from our paper: Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning .

Getting Started

Install the dependencies and obtain a Wanda pruned model checkpoint as described in the original repository

Generating outputs for our jailbreaking dataset

Run the following command to generate model responses to our jailbreaking dataset (integrated.yaml). Depending on the base model, set the prompt template to be one of llama, vicuna, or mistral for correct inference.

python inference.py \
  --model path/to/model \
  --dataset path/to/dataset \
  --template llama|vicuna|mistral

Benchmarking model

We provide methods for running various benchmarks. To run the AltQA long context test or the WikiText perplexity test, run the following. Depending on the base model, set the prompt template to be one of llama, vicuna, or mistral for correct inference.

python evaluate.py \
  --model_path path/to/model \
  --output_path path/to/output/directory \
  --template llama|vicuna|mistral \
  --benchmark altqa|wikitext