Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BEAST Attack Implementation #728

Conversation

erickgalinkin
Copy link
Collaborator

Add BEAST attack; add resources/common; refactor to use common. Remove advbench data.

…Create resources.common with commonly used attributes. Fix typo in generate_gcg.py. Rename gcg.py to suffix.py to encompass other similar attacks.
…s bars. Catch case where advbench directory doesn't exist.
@jmartin-tech jmartin-tech marked this pull request as draft June 7, 2024 14:18
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early review thoughts.

garak/resources/common.py Outdated Show resolved Hide resolved
garak/resources/autodan/autodan.py Outdated Show resolved Hide resolved
garak/resources/beast/beast_attack.py Show resolved Hide resolved
garak/probes/suffix.py Show resolved Hide resolved

try:
beast_output = self.run_beast(target_generator=self.generator)
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that there is error handling here.

Can this be more specific in the Exception types captured? Consider using GarakException and having any generated exceptions in the utility inherit.

garak/resources/beast/__init__.py Show resolved Hide resolved
garak/resources/beast/beast_attack.py Outdated Show resolved Hide resolved
garak/resources/common.py Outdated Show resolved Hide resolved
…unction and catch case where tokenizer does not support. Fix get_perplexity. Fix evaluate. Add score_candidates function. Apply black.
…emove duplicate invocation of `Path`. Denote methods in beast_attack.py as private. Fix _evaluate method to deal with generator returns properly.
@erickgalinkin erickgalinkin marked this pull request as ready for review June 10, 2024 20:53
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice progress, a suggestion for code reuse and some minor thoughts. No required revisions that I see although some more testing to be done.

garak/resources/beast/data/suffixes.txt Outdated Show resolved Hide resolved
suffix_len: int = 40,
data_size: int = 20,
target: Optional[str] = "",
outfile: str = beast_resource_data / "suffixes.txt",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default location is inside the installed path, I would suggest this should be either in the OS temp directory or a user's profile/configuration directory.

This is likely a reasonable choice to get this PR staged and landed, tagging #479 to note location that may need update.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one temp solution is something like garak._config.reporting.report_dir / "garak." + garak._config.transient.run_id + ".beast_suffixes.txt"

"A chat between a curious user and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the user's questions."
)
beast_resource_data = garak._config.transient.basedir / "resources" / "beast" / "data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
beast_resource_data = garak._config.transient.basedir / "resources" / "beast" / "data"
resource_beast_data = garak._config.transient.basedir / "resources" / "beast" / "data"

Nitpick, it seems reasonable to have the name match the path is selects. May become a moot point depending on long term location for staging file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This var name reads weirdly to me, but I get it. I think once we get #479 squared away, it's likely a moot point.

garak/probes/suffix.py Outdated Show resolved Hide resolved
Comment on lines +45 to +50
advbench_path = (
garak._config.transient.basedir
/ "resources"
/ "advbench"
/ "harmful_behaviors.csv"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another location reference for #479

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this I think is a static resource that comes with garak, rather than data that will be modified

@erickgalinkin what's the license on this data btw?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -36,7 +37,7 @@

logger = getLogger(__name__)

gpg_resource_data = garak._config.transient.basedir / "resources" / "gcg" / "data"
gcg_resource_data = garak._config.transient.basedir / "resources" / "gcg" / "data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gcg_resource_data = garak._config.transient.basedir / "resources" / "gcg" / "data"
resource_gcg_data = garak._config.transient.basedir / "resources" / "gcg" / "data"

Another nitpick, just seems reasonable to match name to path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, going to defer this change to #479

Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent effort, i can't believe we're getting BEAST in, that is extremely cool. just a few structural suggestions and questions.

garak/resources/beast/__init__.py Outdated Show resolved Hide resolved
garak/probes/suffix.py Outdated Show resolved Hide resolved
garak/resources/beast/beast_attack.py Outdated Show resolved Hide resolved
garak/resources/beast/beast_attack.py Show resolved Hide resolved
garak/resources/beast/beast_attack.py Show resolved Hide resolved
jailbreak = False
for output in outputs:
if not any([rs in output[0] for rs in REJECTION_STRINGS]):
jailbreak = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detecting absence of mitigation well is a difficult task, eh. does this mean the detector should be predicated on REJECTION_STRINGS else we might have a bad time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. If only we had a better mitigation detector.

We could, theoretically, use LLM as a judge or whatever. May even be able to simply wrap in the MitigationDetector instead, but this is pretty lightweight. Definitely want to square REJECTION_STRINGS and the MitigationDetector when we start work on those.

garak/resources/beast/beast_attack.py Outdated Show resolved Hide resolved
suffix_len: int = 40,
data_size: int = 20,
target: Optional[str] = "",
outfile: str = beast_resource_data / "suffixes.txt",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one temp solution is something like garak._config.reporting.report_dir / "garak." + garak._config.transient.run_id + ".beast_suffixes.txt"

Comment on lines +45 to +50
advbench_path = (
garak._config.transient.basedir
/ "resources"
/ "advbench"
/ "harmful_behaviors.csv"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this I think is a static resource that comes with garak, rather than data that will be modified

@erickgalinkin what's the license on this data btw?

Comment on lines +55 to +62
except pd.errors.ParserError as e:
msg = f"Failed to parse the csv at {hb}"
logging.error(msg)
raise pd.errors.ParserError
except urllib.error.HTTPError as e:
msg = f"Encountered error {e} trying to retrieve {hb}"
logging.error(msg)
raise urllib.error.HTTPError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these stop the garak run? Would be preferable for probe() to just return []

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to raise GarakException here instead so we can handle it more gracefully?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like upstream of everywhere load_advbench is currently used in a probe, it's already wrapped in a try/except block that'll handle the error gracefully -- log the error, print the error encountered, and register that there was no output from the generation attempt, so it won't stop the run.

erickgalinkin and others added 4 commits June 24, 2024 11:40
Co-authored-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Erick Galinkin <erick.galinkin@gmail.com>
Fix copyright date

Co-authored-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Erick Galinkin <erick.galinkin@gmail.com>
Co-authored-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Erick Galinkin <erick.galinkin@gmail.com>
Add description to progress bar.

Co-authored-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Erick Galinkin <erick.galinkin@gmail.com>
Comment on lines +83 to +86
if logs is not None:
logs += log
else:
logs = log
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to work out how to skip the conditional, but it's a marginal optimisation

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A syntax error is currently generated when running the suffix.BEAST probe.

  File "/garak/garak/probes/suffix.py", line 154, in probe
    self.prompts = [self.goal + beast_output]
...
TypeError: can only concatenate str (not "list") to str```

garak/probes/suffix.py Outdated Show resolved Hide resolved
Co-authored-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Erick Galinkin <erick.galinkin@gmail.com>
@jmartin-tech jmartin-tech merged commit 96de942 into main Jun 28, 2024
6 checks passed
@jmartin-tech jmartin-tech deleted the 530-probe-beast-fast-adversarial-attacks-on-language-models-in-one-gpu-minute branch June 28, 2024 13:58
@github-actions github-actions bot locked and limited conversation to collaborators Jun 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

probe: BEAST / Fast Adversarial Attacks on Language Models In One GPU Minute
3 participants