GitHub - kangmintong/R-2-Guard: Code implementation of R^2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Evaluations on standard safety benchmark

bash scripts.sh

Evaluations against jailbreaks

For GCG attack:

bash scripts.sh

In scripts.sh, please specify the corresponding adv_string (adv_string_1: GCG-U1; adv_string_2: GCG-U2; adv_string_3: GCG-V; adv_string_4: GCG-L; adv_string_5: GCG-R)

For AutoDAN, TAP, PAIR attack:

bash jailbreak.sh

Note: Since the evaluations are extensive, you may need to comment out the one to be evaluated. The scripts will be reorganized after the paper review.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
AutoDAN.py		AutoDAN.py
README.md		README.md
TAP.py		TAP.py
data_loading.py		data_loading.py
distill_r2guard.py		distill_r2guard.py
fuse_advbench_benign.py		fuse_advbench_benign.py
jailbreak.py		jailbreak.py
jailbreak.sh		jailbreak.sh
knowledge_guardrail_inference.py		knowledge_guardrail_inference.py
knowledge_guardrail_model.py		knowledge_guardrail_model.py
run_knowledge_models.py		run_knowledge_models.py
run_weight_optimization.py		run_weight_optimization.py
save_autodan_prompt.py		save_autodan_prompt.py
scripts.sh		scripts.sh
test.py		test.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluations on standard safety benchmark

Evaluations against jailbreaks

Note: Since the evaluations are extensive, you may need to comment out the one to be evaluated. The scripts will be reorganized after the paper review.

About

Releases

Packages

Languages

kangmintong/R-2-Guard

Folders and files

Latest commit

History

Repository files navigation

Evaluations on standard safety benchmark

Evaluations against jailbreaks

Note: Since the evaluations are extensive, you may need to comment out the one to be evaluated. The scripts will be reorganized after the paper review.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages