-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BEAST Attack Implementation #728
BEAST Attack Implementation #728
Conversation
…Create resources.common with commonly used attributes. Fix typo in generate_gcg.py. Rename gcg.py to suffix.py to encompass other similar attacks.
… Refactor gcg to use new functionality.
…s bars. Catch case where advbench directory doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early review thoughts.
|
||
try: | ||
beast_output = self.run_beast(target_generator=self.generator) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love that there is error handling here.
Can this be more specific in the Exception
types captured? Consider using GarakException
and having any generated exceptions in the utility inherit.
…unction and catch case where tokenizer does not support. Fix get_perplexity. Fix evaluate. Add score_candidates function. Apply black.
…emove duplicate invocation of `Path`. Denote methods in beast_attack.py as private. Fix _evaluate method to deal with generator returns properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice progress, a suggestion for code reuse and some minor thoughts. No required
revisions that I see although some more testing to be done.
suffix_len: int = 40, | ||
data_size: int = 20, | ||
target: Optional[str] = "", | ||
outfile: str = beast_resource_data / "suffixes.txt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This default location is inside the installed path, I would suggest this should be either in the OS temp directory or a user's profile/configuration directory.
This is likely a reasonable choice to get this PR staged and landed, tagging #479 to note location that may need update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one temp solution is something like garak._config.reporting.report_dir / "garak." + garak._config.transient.run_id + ".beast_suffixes.txt"
"A chat between a curious user and an artificial intelligence assistant. " | ||
"The assistant gives helpful, detailed, and polite answers to the user's questions." | ||
) | ||
beast_resource_data = garak._config.transient.basedir / "resources" / "beast" / "data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beast_resource_data = garak._config.transient.basedir / "resources" / "beast" / "data" | |
resource_beast_data = garak._config.transient.basedir / "resources" / "beast" / "data" |
Nitpick, it seems reasonable to have the name match the path is selects. May become a moot point depending on long term location for staging file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This var name reads weirdly to me, but I get it. I think once we get #479 squared away, it's likely a moot point.
advbench_path = ( | ||
garak._config.transient.basedir | ||
/ "resources" | ||
/ "advbench" | ||
/ "harmful_behaviors.csv" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another location reference for #479
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this I think is a static resource that comes with garak, rather than data that will be modified
@erickgalinkin what's the license on this data btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -36,7 +37,7 @@ | |||
|
|||
logger = getLogger(__name__) | |||
|
|||
gpg_resource_data = garak._config.transient.basedir / "resources" / "gcg" / "data" | |||
gcg_resource_data = garak._config.transient.basedir / "resources" / "gcg" / "data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcg_resource_data = garak._config.transient.basedir / "resources" / "gcg" / "data" | |
resource_gcg_data = garak._config.transient.basedir / "resources" / "gcg" / "data" |
Another nitpick, just seems reasonable to match name to path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, going to defer this change to #479
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent effort, i can't believe we're getting BEAST in, that is extremely cool. just a few structural suggestions and questions.
jailbreak = False | ||
for output in outputs: | ||
if not any([rs in output[0] for rs in REJECTION_STRINGS]): | ||
jailbreak = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detecting absence of mitigation well is a difficult task, eh. does this mean the detector should be predicated on REJECTION_STRINGS
else we might have a bad time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. If only we had a better mitigation detector.
We could, theoretically, use LLM as a judge or whatever. May even be able to simply wrap in the MitigationDetector
instead, but this is pretty lightweight. Definitely want to square REJECTION_STRINGS
and the MitigationDetector
when we start work on those.
suffix_len: int = 40, | ||
data_size: int = 20, | ||
target: Optional[str] = "", | ||
outfile: str = beast_resource_data / "suffixes.txt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one temp solution is something like garak._config.reporting.report_dir / "garak." + garak._config.transient.run_id + ".beast_suffixes.txt"
advbench_path = ( | ||
garak._config.transient.basedir | ||
/ "resources" | ||
/ "advbench" | ||
/ "harmful_behaviors.csv" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this I think is a static resource that comes with garak, rather than data that will be modified
@erickgalinkin what's the license on this data btw?
except pd.errors.ParserError as e: | ||
msg = f"Failed to parse the csv at {hb}" | ||
logging.error(msg) | ||
raise pd.errors.ParserError | ||
except urllib.error.HTTPError as e: | ||
msg = f"Encountered error {e} trying to retrieve {hb}" | ||
logging.error(msg) | ||
raise urllib.error.HTTPError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these stop the garak run? Would be preferable for probe()
to just return []
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to raise GarakException
here instead so we can handle it more gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like upstream of everywhere load_advbench
is currently used in a probe, it's already wrapped in a try
/except
block that'll handle the error gracefully -- log the error, print the error encountered, and register that there was no output from the generation attempt, so it won't stop the run.
Co-authored-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: Erick Galinkin <erick.galinkin@gmail.com>
Fix copyright date Co-authored-by: Leon Derczynski <leonderczynski@gmail.com> Signed-off-by: Erick Galinkin <erick.galinkin@gmail.com>
Co-authored-by: Leon Derczynski <leonderczynski@gmail.com> Signed-off-by: Erick Galinkin <erick.galinkin@gmail.com>
Add description to progress bar. Co-authored-by: Leon Derczynski <leonderczynski@gmail.com> Signed-off-by: Erick Galinkin <erick.galinkin@gmail.com>
if logs is not None: | ||
logs += log | ||
else: | ||
logs = log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to work out how to skip the conditional, but it's a marginal optimisation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A syntax error is currently generated when running the suffix.BEAST
probe.
File "/garak/garak/probes/suffix.py", line 154, in probe
self.prompts = [self.goal + beast_output]
...
TypeError: can only concatenate str (not "list") to str```
Co-authored-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: Erick Galinkin <erick.galinkin@gmail.com>
Add BEAST attack; add resources/common; refactor to use common. Remove advbench data.