Skip to content

Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.

License

Notifications You must be signed in to change notification settings

Zhang-Yihao/Adversarial-Representation-Engineering

Repository files navigation

Adversarial Representation Engineering (NeurIPS 2024)

This is the official implementation repository for the paper Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models (NeurIPS 2024)

arxiv: pdf. See details below.

w/ Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun.

Introduction

This minimal scale demo is still in the testing phase, which provides the implementation for Section 5.1 Alignment: To Generate (Harmful Responses) or Not to Generate and 5.2 Hallucination: To Hallucinate or Not to Hallucinate.

Setup

Parameters are hardcoded in main.py for now. If you wish to modify the parameters, please edit main.py directly. We will implement argparse soon.

Execution

Currently, you can run the program by executing:

python main.py

You can change the model by modifying the model_path in main.py. Please note that this set of parameters may not be suitable for larger models, and adjustments may be necessary based on the specific requirements. Demo for decreasing hallucination is provided in hallucination.ipynb.

Dependencies

Install the necessary libraries including:

transformers
torch>=2.0
numpy
datasets
peft
pandas
tqdm
sklearn

Additional Information

More code and details will be available upon publication of our paper. Code for processing TrustfulQA dataset is partly borrowed from This Repo.

Citation

@InProceedings{zhang2024towards,
  title={Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models},
  author={Zhang, Yihao and Wei, Zeming and Sun, Jun and Sun, Meng},
  booktitle = {NeurIPS},
  year={2024}
}

About

Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published