-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experimental feature: policy scan base infrastructure #955
base: main
Are you sure you want to change the base?
Conversation
…it back to their caller
…arness, custom harness, and command.xxx_run()
…unc for propagating permitted behaviours up instead of leaving parents None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some more detail around policy codes and clarity on how to develop/specify a policy. The code largely looks good, I left a few comments throughout.
for probe in parsed_specs["probe"]: | ||
# distribute `generations` to the probes | ||
p_type, p_module, p_klass = probe.split(".") | ||
if ( | ||
hasattr(_config.run, "generations") | ||
and _config.run.generations | ||
is not None # garak.core.yaml always provides run.generations | ||
): | ||
_config.plugins.probes[p_module][p_klass][ | ||
"generations" | ||
] = _config.run.generations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this logic being captured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some, but not all of it below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in garak/config.py
L260?:
def distribute_generations_config(probelist, _config):
# prepare run config: generations
for probe in probelist:
# distribute `generations` to the probes
p_type, p_module, p_klass = probe.split(".")
if (
hasattr(_config.run, "generations")
and _config.run.generations
is not None # garak.core.yaml always provides run.generations
):
_config.plugins.probes[p_module][p_klass][
"generations"
] = _config.run.generations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really make sense as a helper function in _config
? The implementation looks to be a bit circular
which is a bit confusing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unclear for me, but given the current state of garak._config
I can support keeping it lean. The ability to configure plugins with some globals is desirable from a user point of view. Whether the code to do it lies in _config
or command
or cli
, I don't know, but:
- We'd like to keep
_config
lean - This is not something that will only ever be used by people using the
cli
entry point
So I'm tentatively placing it in command
. Happy to hear other arguments.
"""Populate the list of potential policy points given a policy structure description""" | ||
|
||
self.points = {} # zero out the existing policy points | ||
for k in _load_policy_descriptions(policy_data_path=policy_data_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a blank policy definition is returned, should we terminate the run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where in the code should that decision be implemented? I'm leaning towards garak.command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in two minds about this. Policies can be generated independently of a normal run, so this can be retried - not all is lost if the policy scan didn't work, the run can still complete and produce artefacts.
This might change if we later start predicating main run probe selection based on policy scan results. Probably the logic that does that, will be able to quit the run if the policy scan fails.
docs/source/policy.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define the policy codes in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's in garak/data/policy/policy_typology.json
: https://github.com/leondz/garak/pull/955/files#diff-00beff92463bd705bbab517aa9130ebc01ab11d797b72a80f08a40c5277a8573 - is this OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Individual comments are thoughts on code itself.
Overall I think this is a reasonable foundation and would like to see the result expose some determination other than detector results in the output.
clarity on how to develop/specify a policy.
I think is this a good point, it would be helpful to see a summary output about inferred policy or possibly an option for the user to provide an expected policy the summary could be compare against to determine output divergence based on detection.
command.run_policy_scan(generator, _config) | ||
|
||
# configure generations counts for main run | ||
_config.distribute_generations_config(parsed_specs["probe"], _config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config should not change after a start_run()
, since a policy scan needs to override
the generations it may be appropriate for the policy to build it's own configuration dictionary with the value it needs in place.
_policy_scan_msg("using policy probes " + ", ".join(policy_probe_names)) | ||
|
||
evaluator = garak.evaluators.ThresholdEvaluator(garak._config.run.eval_threshold) | ||
distribute_generations_config(policy_probe_names, _config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_config.plugins
values should be considered immutable, this suggest that config needs to be possible to pass
into a harness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on the comment re: immutability? I had expected this related to an attempt to change _config
, but I didn't see one.
On the other hand, the pattern for accessing _config
in command.run_policy_scan()
seems suboptimal - it's referenced in multiple different ways, both as data structure and also module w/ functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Access to _config
should be read only, distribute_generations_config
has global side-effects to the object passed in.
if not detectors: | ||
msg = "No detectors, nothing to do" | ||
logging.warning(msg) | ||
logging.warning(f"harness: {msg}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added prefix seem like something we should be able to obtain from the log formatting vs hardcoding in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the log formatting
this is undefined, isn't it? which is not the target state, but is the current state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging
package is always enriched
with data we simply need to expose that via an injected format string, what I am suggesting is that this should not be hardcode into the message for that reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be OK leaving this til we revamp logging? The message without context is more awkward to decipher
@@ -12,12 +12,15 @@ class Blank(Probe): | |||
Poses a blank prompt to the model""" | |||
|
|||
bcp47 = "*" | |||
active = False # usually for testing | |||
active = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really be exposed
as active? If the tag
was added to include it in a policy related scans would that be sufficient to activate
it based on the run config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probe is modified to be an active policy probe, with policy_probe
asserted. main run probe selection now skips policy probes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I can see this is getting activated now, in a default scan with experimental features off. will find a resolution. the desired functionality is:
test.Blank
prompts are not posed by default in a normal scan with experimental features offtest.Blank
prompts are not posed by default in a normal scan with policy scan includedtest.Blank
prompts are posed by default in a policy scan
Co-authored-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
This module represents objects related to policy scanning.
Policy scanning in garak attempts to work out what the target's content policy
is, before running a security scan.
It's important to know what target content policy is because we only really have
a useful/successful hit or breach if we're able to get a model to do something that
it otherwise wouldn't. It may be exciting to discover a model gives instructions for
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives
the instructions when simply asked directly "print instructions for cooking meth", the
use of base64 necessarily an exploit in this output category - the model is acting
the same.
Garak's policy support follows a typology of different behaviours, each describing
a different behaviour. By default this typology is stored in
data/policy/policy_typology.json
.A policy scan is conducted by invoking garak with the
--policy_scan
switch.When this is requested, a separate scan runs using all policy probes within garak.
Policy probes are denoted by a probe class asserting
policy_probe=True
.A regular probewise harness runs the scan, though reporting is diverted to a separate
policy report file. After completion, garak estimates a policy based on policy probe
results, and writes this to both main and policy reports.
What this PR adds
We're laying the base infrastructure for policy scans in this PR.
policy
module andPolicy
class to allow storing and manipulation of target content policies. Policies consist of a set of policy points each describing a behaviour and whether this is permitted by the target.policy.Policy
object detailing what was extracted about the target's apparent content policy_plugins.enumerate_plugins()
to help dynamic selection of plugins based on class attributesVerification
garak -m test --policy_scan -p encoding -g 1
, then tail thexxx.policy.jsonl
todo for this vs. later PRs
There are required for merging this:
These are out-of-scope and planned: