This is the official implementation of the paper "You Know What I'm Saying - Jailbreak Attack via Implicit Reference".
Model | LLaMA-3-8b | LLaMA-3-70B | Qwen-2-7B | Qwen-2-72B | GPT-4o-mini | GPT-4o | Claude-3.5-Sonnet |
---|---|---|---|---|---|---|---|
Attack Success Rate (%) | 77 | 84 | 80 | 81 | 87 | 95 | 93 |
Model | Claude-3.5-Sonnet | LLaMA-3-8B |
---|---|---|
Attack Success Rate (%) | 96 | 83 |
- Anaconda/Miniconda
- Python 3.10
-
Create a Conda Environment
Open your terminal and create a new Conda environment with Python 3.10:
conda create -n your_environment_name python=3.10 conda activate your_environment_name
-
Install Required Packages
Install all necessary packages using
pip
:pip install -r requirements.txt
-
Configure Environment Variables
Copy the
.env_template
file to.env
:cp .env_template .env
Edit the
.env
file to set the required environment variables:nano .env
GPT
: Rewrite ModelTARGET
: Target ModelCONTEXT
: Context Model (can be ignored if not doing cross-model attack)LLAMA3_JUDGE
: LLaMA-3-70b Judge (can be ignored if using GPT as judge)
-
Execute the Auto Attack Script
After running the model script, execute the auto attack script for your target model:
bash scripts/run_attack_{target_model}.sh
Replace
{target_model}
with the appropriate target model name.
The project uses the following command-line arguments to control various aspects of the attack:
--n_requests
: Number of requests. (default: 100)--n_restarts
: Number of restarts. (default: 20)--attack_method
: Attack type. Choices are "direct", "k2", "k3", "k4", "k5", "k6". (default: "k4")--target_model
: Name of target model.--target_base_url
: Base URL of target model.--context_model
: Name of context model.--judge
: Judge type. Choices are "gpt" or "llama". (default: "gpt")