Zhen Xiang1 ,
Fengqing Jiang2 ,
Zidi Xiong1
Bhaskar Ramasubramanian3 ,
Radha Poovendran2 ,
Bo Li1
1University of Illinois Urbana-Champaign 2University of Washington 3Western Washington University
ICLR 2024
We propose BadChain,the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. In particular, a subset of demonstrations will be manipulated to incorporate a backdoor reasoning step in COT prompting. Consequently, given any query prompt containing the backdoor trigger, the LLM will be misled to output unintended content. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning.
- Make sure to set your api key before running experiment. See the placeholder at
utils.py
A example running command is as follow:
python run.py --llm gpt-3.5 --task gsm8k
More details can be found at run.py
If you find our work is useful in your research, please consider citing:
@misc{xiang2024badchain,
title={BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models},
author={Zhen Xiang and Fengqing Jiang and Zidi Xiong and Bhaskar Ramasubramanian and Radha Poovendran and Bo Li},
year={2024},
eprint={2401.12242},
archivePrefix={arXiv},
primaryClass={cs.CR}
}