Detoxifying Large Language Models via Knowledge Editing
Overview • DINM • How to Run • NLPCC 2024 • Citation • Paper • Website
Detoxifying LLM strives to build a safe and trustworthy large language model (LLM). Knowledge editing focuses on specific areas for permanent adjustment without compromising overall performance. Then, detoxifying LLM via knowledge editing leverages a small amount of data, usually an instance, to correct the toxic behaviors of the LLM. The edited LLM can defend against various malicious inputs.
We extend evaluation metrics to Defense Duccess (DS), Defense Generalization (DG), and General Performance.
-
Defense Duccess (DS): the detoxification success rate of edited LLM for an adversarial input (attack prompt + harmful question), which is used to modify LLM.
-
Defense Generalization (DG): the detoxification success rate of edited LLM for out-of-domain (OOD) malicious inputs.
-
DG of only harmful question
($\mathrm{DG}_\text{onlyQ}$ ): the detoxification success rate for only harmful question. -
DG of other attack prompts
($\mathrm{DG}_\text{otherA}$ ): the detoxification success rate for unseen attack prompts. -
DG of other harmful questions
($\mathrm{DG}_\text{otherQ}$ ): the detoxification success rate for unseen harmful questions. -
DG of other attack prompts and questions
($\mathrm{DG}_\text{otherAQ}$ ): the detoxification success rate for unseen attack prompts and harmful questions.
-
-
General Performance: the side effects for unrelated task.
We evaluate DS and DG by SafeEdit-Safety-Classifier, the usage of which is detailed in Safety Classifier Preparation. The statistics of Fluency can be found in our EasyEdit. We evaluate KQA and CSM by OpenCompass.
Inspired by intraoperative neurophysiological monitoring, we design a simple yet effective knowledge editing baseline called Detoxifying with Intraoperative Neural Monitoring (DINM). DINM uses an instance to locate and edit toxic regions of the LLM.
To get started, simply install conda and run:
git clone https://github.com/zjunlp/EasyEdit.git
conda create -n EasyEdit python=3.9.7
...
conda activate EasyEdit
pip install -r requirements.txt
❗️❗️ If you intend to use Mistral, please update the
transformers
library to version 4.34.0 manually. You can use the following code:pip install transformers==4.34.0
.
Dataset for Detoxifying LLM via Knowledge Editing: SafeEdit You can download it from [Hugging Face], then put the data in folder "./data". "SafeEdit_test.json" is the test data file containing 1350 instances.
SafeEdit-Safety-Classifier, we used for judgment, is hosted on Hugging Face. You can use the Safety Classifier:
from transformers import RobertaForSequenceClassification, RobertaTokenizer
safety_classifier_dir = 'zjunlp/SafeEdit-Safety-Classifier'
safety_classifier_model = RobertaForSequenceClassification.from_pretrained(safety_classifier_dir)
safety_classifier_tokenizer = RobertaTokenizer.from_pretrained(safety_classifier_dir)
You can also download SafeEdit-Safety-Classifier, and put the judgment model to your own path. When running the run_safety_editing.py file, you only need to provide safety_classifier_dir to use this classifier.
Before you begin running the program, ensure that the necessary files are present and properly set up, specifically the directories ./data, ./hparams,.
Our method supports multi-GPU editing. You can try setting the model_parallel
to true
in the configuration file ../hparams/DINM/mistral-7b
to enable multi-GPU editing.
python run_safety_editing.py --editing_method=DINM --edited_model=mistral-7b --hparams_dir=../hparams/DINM/mistral-7b --safety_classifier_dir=zjunlp/SafeEdit-Safety-Classifier --metrics_save_dir=../safety_results
❗️❗️ You can download SafeEdit-Safety-Classifier manually to your own path, and set safety_classifier_dir to your local path. Then, you can obtain the evaluation for DS, DG, and Fluency in the path ../safety_results. For KQA and CSM evaluations, please use OpenCompass.
❗️❗️ A friendly reminder: if you use SafeEdit dataset for evaluation, it is recommended to set max_output_length to 600 in mistral-7b.yaml (if you don't use mistral-7b.yaml, please replace your own .yaml file). For some role-playing attack prompts, LLMs may initially generate safe responses and then suddenly generate toxic text. Considering the maximum length of certain LLMs may not suffice; you can truncate the input length (from right to left, as harmful questions typically appear on the right).
Here is the demo introduction of detoxifying Mistral-7B-v0.1 on one A800 GPU by DINM. You can download the demo video and use SafeEdit_demo to get started quickly.
- Click the button Edit: DINM use an instace to locate and edit toxic regions of Mistral-7B-v0.1. Then, we can obtain the toxic layer of Mistral-7B-v0.1, and edited Mistral-7B-v0.1.
- Click the button Generate of Defense Success: Edited Mistral-7B-v0.1 generates response for adversarial input, which is used for Defense Success metric.
- Click the button Generate of Defense Generalization: Edited Mistral-7B-v0.1 generates response for out-of-domain malicous input, which is used for Defense Generalization metric.
Please refer to this link for the code of the SFT and DPO.
For DINM method, you should first complete the Data Preparation.
Second, move the file train_DINM_for_NLPCC.py to ./ (We will later modify the code to adapt to running in the current directory), and run:
python train_DINM_for_NLPCC.py --hparams_dir ./hparams/DINM/llama-7b --results_save_dir ./safety_results
To evaluate the detoxifying performance of edited model, you should move the file test_detoxify_generate_for_NLPCC.py to ./ (We will later modify the code to adapt to running in the current directory), and run:
python test_detoxify_generate.py --edited_LLM_ckpt ./safety_results/dinm_llama2-chat --tok_ckpt ./hugging_cache/llama-2-7b --results_save_dir ./safety_results
❗️❗️ Please set max_output_length to 600 in llama-7b.yaml. For some role-playing attack prompts, LLMs may initially generate safe responses and then suddenly generate toxic text. Therefore, you should set enough max_output_length to evaluate the safety of LLM.
Please cite our paper if you use SafeEdit, SafeEdit-Safety-Classifier and DINM in your work.
@misc{wang2024SafeEdit,
title={Detoxifying Large Language Models via Knowledge Editing},
author={Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, Huajun Chen},
year={2024},
eprint={2403.14472},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
We are deeply grateful to Yue Zhang from Westlake University and Xing Xie from Microsoft Research Asia for their insightful feedback and constructive suggestions, which greatly enhanced the quality of this paper. We would like to express our heartfelt gratitude for Minlie Huang and team members from Tsinghua University for the contributions of Safety Benchmark and Assessmen, Tatsunori B. Hashimoto and his team for the contributions of instructions following data, Jiahao Yu, Yang Li, Shujian Huang, Danqi Chen, and Jacob Steinhardtfor their contributions of security attack technique. We utilize portions of their attack prompts and unsafe category in this paper and express sincere gratitude. We also extend our thanks to Andrew Lee. Inspired by Andrew Lee's research , we delve into a preliminary mechanistic analysis of SFT, DPO, and our DINM. Besides, we extend special thanks to Zhexin Zhang form Tsinghua university for providing valuable insights on conducting fair comparisons between traditional and knowledge editing methods in our experiments.