BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang¹ , Fengqing Jiang² , Zidi Xiong¹
Bhaskar Ramasubramanian³ , Radha Poovendran² , Bo Li¹
¹University of Illinois Urbana-Champaign ²University of Washington ³Western Washington University

ICLR 2024

[arXiv] [OpenReview]

Overview

We propose BadChain,the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. In particular, a subset of demonstrations will be manipulated to incorporate a backdoor reasoning step in COT prompting. Consequently, given any query prompt containing the backdoor trigger, the LLM will be misled to output unintended content. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning.

Experiment

Setup

Make sure to set your api key before running experiment. See the placeholder at utils.py

Running

A example running command is as follow:

python run.py --llm gpt-3.5 --task gsm8k

More details can be found at run.py

Citation

If you find our work is useful in your research, please consider citing:

@misc{xiang2024badchain,
    title={BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models}, 
    author={Zhen Xiang and Fengqing Jiang and Zidi Xiong and Bhaskar Ramasubramanian and Radha Poovendran and Bo Li},
    year={2024},
    eprint={2401.12242},
    archivePrefix={arXiv},
    primaryClass={cs.CR}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
ans_eval		ans_eval
assets		assets
data		data
index_file		index_file
lib_prompt		lib_prompt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attack.py		attack.py
defense.py		defense.py
run.py		run.py
triggers.json		triggers.json
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Overview

Experiment

Setup

Running

Citation

About

Releases

Packages

Languages

License

Django-Jiang/BadChain

Folders and files

Latest commit

History

Repository files navigation

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Overview

Experiment

Setup

Running

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages