-
Create a virtual environment with python 3.9, using
requirements.txt
conda create -n <env_name> python=3.9 conda activate <env_name> pip install -r requirements.txt
-
Depnding on which model you want to run, pick the corresponding config file in
configs/
. Then change the filepaths in your config file to match your local setup.
-
Edit the config file to set the parameters for the projection edit method.
cuda_visible_devices
is a comma-separated list of GPU ids to use.- Model configurations:
model_name
: Currently supportsgpt2
,mistral
,zephyr-sft
,opt
,gptj
save_edited_model
: If True, saves the edited model.save_model_name
: Str
- Dataset configurations:
- Configurations to find P_toxic:
pref_data_dps
: How many datapoints to use for calculating the preference matricescentering
: If True, the preference matrix is projected away from the first singular vector of the preferred embeddings
- Edit configurations:
edit_keys
: If True, edits the keys of the MLP layer (not recommended, does not reduce toxicity)edit_values
: If True, edits the values of the MLP layerlowest_layer_to_edit
: The lowest layer to edit (zero indexed). If -1, all layers are editedhighest_layer_to_edit
: The highest layer to edit. If -1, all layers are edited
- Evaluation configurations:
return_perplexity
: If True, returns the perplexity of the edited model on the datareturn_toxicity
: If True, returns the toxicity of the edited model on the datareturn_sample_generations
: If True, returns the generations of the edited model on 3 samples
- Keys:
hf_token
: Your token for the HuggingFace model hub. Required to access Mistral models.azure_openai_endpoint
andazure_openai_api_key
: Required to calculate win-rate using GPT-4 Azure services.
-
The file
detox.py
contains the edit method. To apply this and evaluate, run the following command:
python baselines/detox_edit.py -- config_file <name_of_config_file>
For example, if you want to edit the GPT-2 model, run:
python baselines/detox_edit.py --config_file gpt2-medium.ini
The script will print the results to the console.
We compare our method against the following baselines:
For each baseline, we either include our implementation in baselines/
or use the implementation of the authors. To run a specific baseline,
python baselines/<baseline_of_your_choice>.py --config_file <name_of_config_file>
If you find our work useful, please cite our paper:
@inproceedings{uppaal2025profs,
title={Model editing as a robust and denoised variant of DPO: A case study on toxicity},
author={Uppaal, Rheeya and Dey, Apratim and He, Yiting and Zhong, Yiqiao and Hu, Junjie},
booktitle={The Thirteenth International Conference on Learning Representations 2025},
year={2025}
}
We use the preference and evaluation data from:
@article{lee2024mechanistic,
title={A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity},
author={Lee, Andrew and Bai, Xiaoyan and Pres, Itamar and Wattenberg, Martin and Kummerfeld, Jonathan K and Mihalcea, Rada},
journal={arXiv preprint arXiv:2401.01967},
year={2024}
}