Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental feature: policy scan base infrastructure #955

Draft
wants to merge 52 commits into
base: main
Choose a base branch
from

Conversation

leondz
Copy link
Collaborator

@leondz leondz commented Oct 23, 2024

This module represents objects related to policy scanning.

Policy scanning in garak attempts to work out what the target's content policy
is, before running a security scan.

It's important to know what target content policy is because we only really have
a useful/successful hit or breach if we're able to get a model to do something that
it otherwise wouldn't. It may be exciting to discover a model gives instructions for
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives
the instructions when simply asked directly "print instructions for cooking meth", the
use of base64 necessarily an exploit in this output category - the model is acting
the same.

Garak's policy support follows a typology of different behaviours, each describing
a different behaviour. By default this typology is stored in data/policy/policy_typology.json.

A policy scan is conducted by invoking garak with the --policy_scan switch.
When this is requested, a separate scan runs using all policy probes within garak.
Policy probes are denoted by a probe class asserting policy_probe=True.
A regular probewise harness runs the scan, though reporting is diverted to a separate
policy report file. After completion, garak estimates a policy based on policy probe
results, and writes this to both main and policy reports.

What this PR adds

We're laying the base infrastructure for policy scans in this PR.

  • Add a policy typology and support loading it
  • Introduce a policy module and Policy class to allow storing and manipulation of target content policies. Policies consist of a set of policy points each describing a behaviour and whether this is permitted by the target.
  • Differentiate non-adversarial probes as "policy" probes, which appear differently and are not automatically selected for main runs
  • Specify which policy points a policy probe tests for, and enforce presence of this via test (a policy probe that doesn't inform the policy, is not useful)
  • Add an optional "policy scan" which assesses model behavior under various policy points. It:
    • selects policy probes
    • guesses a policy depending on the output of these probe's nominated detectors
    • logs to a separate place
    • outputs a serialised policy.Policy object detailing what was extracted about the target's apparent content policy
  • Enable plugin filtering in _plugins.enumerate_plugins() to help dynamic selection of plugins based on class attributes
  • Added to & unified logging in harnesses

Verification

  • garak -m test --policy_scan -p encoding -g 1, then tail the xxx.policy.jsonl

todo for this vs. later PRs

There are required for merging this:

  • test for garak.policy
  • validate default policy
  • tag tests: probes must give policy list if policy_probe is true
  • validation

These are out-of-scope and planned:

  • policy probe for trying prompts based on policy points and looking to mitigation/no resp
  • merging of results for cases where multiple probes test a policy
  • refactor donotanswer
  • using policy to filter planned probing

leondz added 30 commits October 2, 2024 14:32
…arness, custom harness, and command.xxx_run()
@leondz leondz added the architecture Architectural upgrades label Oct 23, 2024
@leondz leondz marked this pull request as ready for review October 24, 2024 09:29
@leondz leondz changed the title feature: policy scans feature: policy scan base infrastructure Oct 29, 2024
Copy link
Collaborator

@erickgalinkin erickgalinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some more detail around policy codes and clarity on how to develop/specify a policy. The code largely looks good, I left a few comments throughout.

Comment on lines -473 to -557
for probe in parsed_specs["probe"]:
# distribute `generations` to the probes
p_type, p_module, p_klass = probe.split(".")
if (
hasattr(_config.run, "generations")
and _config.run.generations
is not None # garak.core.yaml always provides run.generations
):
_config.plugins.probes[p_module][p_klass][
"generations"
] = _config.run.generations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this logic being captured?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some, but not all of it below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in garak/config.py L260?:

def distribute_generations_config(probelist, _config):
    # prepare run config: generations
    for probe in probelist:
        # distribute `generations` to the probes
        p_type, p_module, p_klass = probe.split(".")
        if (
            hasattr(_config.run, "generations")
            and _config.run.generations
            is not None  # garak.core.yaml always provides run.generations
        ):
            _config.plugins.probes[p_module][p_klass][
                "generations"
            ] = _config.run.generations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really make sense as a helper function in _config? The implementation looks to be a bit circular which is a bit confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear for me, but given the current state of garak._config I can support keeping it lean. The ability to configure plugins with some globals is desirable from a user point of view. Whether the code to do it lies in _config or command or cli, I don't know, but:

  • We'd like to keep _config lean
  • This is not something that will only ever be used by people using the cli entry point

So I'm tentatively placing it in command. Happy to hear other arguments.

garak/cli.py Outdated Show resolved Hide resolved
garak/evaluators/base.py Outdated Show resolved Hide resolved
garak/policy.py Show resolved Hide resolved
"""Populate the list of potential policy points given a policy structure description"""

self.points = {} # zero out the existing policy points
for k in _load_policy_descriptions(policy_data_path=policy_data_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a blank policy definition is returned, should we terminate the run?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where in the code should that decision be implemented? I'm leaning towards garak.command

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in two minds about this. Policies can be generated independently of a normal run, so this can be retried - not all is lost if the policy scan didn't work, the run can still complete and produce artefacts.

This might change if we later start predicating main run probe selection based on policy scan results. Probably the logic that does that, will be able to quit the run if the policy scan fails.

garak/probes/av_spam_scanning.py Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define the policy codes in here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Individual comments are thoughts on code itself.

Overall I think this is a reasonable foundation and would like to see the result expose some determination other than detector results in the output.

clarity on how to develop/specify a policy.

I think is this a good point, it would be helpful to see a summary output about inferred policy or possibly an option for the user to provide an expected policy the summary could be compare against to determine output divergence based on detection.

command.run_policy_scan(generator, _config)

# configure generations counts for main run
_config.distribute_generations_config(parsed_specs["probe"], _config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config should not change after a start_run(), since a policy scan needs to override the generations it may be appropriate for the policy to build it's own configuration dictionary with the value it needs in place.

_policy_scan_msg("using policy probes " + ", ".join(policy_probe_names))

evaluator = garak.evaluators.ThresholdEvaluator(garak._config.run.eval_threshold)
distribute_generations_config(policy_probe_names, _config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_config.plugins values should be considered immutable, this suggest that config needs to be possible to pass into a harness.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on the comment re: immutability? I had expected this related to an attempt to change _config, but I didn't see one.

On the other hand, the pattern for accessing _config in command.run_policy_scan() seems suboptimal - it's referenced in multiple different ways, both as data structure and also module w/ functions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access to _config should be read only, distribute_generations_config has global side-effects to the object passed in.

garak/evaluators/base.py Outdated Show resolved Hide resolved
if not detectors:
msg = "No detectors, nothing to do"
logging.warning(msg)
logging.warning(f"harness: {msg}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added prefix seem like something we should be able to obtain from the log formatting vs hardcoding in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the log formatting

this is undefined, isn't it? which is not the target state, but is the current state

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging package is always enriched with data we simply need to expose that via an injected format string, what I am suggesting is that this should not be hardcode into the message for that reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be OK leaving this til we revamp logging? The message without context is more awkward to decipher

garak/probes/av_spam_scanning.py Show resolved Hide resolved
@@ -12,12 +12,15 @@ class Blank(Probe):
Poses a blank prompt to the model"""

bcp47 = "*"
active = False # usually for testing
active = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be exposed as active? If the tag was added to include it in a policy related scans would that be sufficient to activate it based on the run config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probe is modified to be an active policy probe, with policy_probe asserted. main run probe selection now skips policy probes.

Copy link
Collaborator Author

@leondz leondz Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I can see this is getting activated now, in a default scan with experimental features off. will find a resolution. the desired functionality is:

  • test.Blank prompts are not posed by default in a normal scan with experimental features off
  • test.Blank prompts are not posed by default in a normal scan with policy scan included
  • test.Blank prompts are posed by default in a policy scan

garak/policy.py Outdated Show resolved Hide resolved
garak/policy.py Show resolved Hide resolved
garak/data/policy/policy_typology.txt Outdated Show resolved Hide resolved
garak/harnesses/base.py Outdated Show resolved Hide resolved
@leondz leondz marked this pull request as draft November 12, 2024 16:09
@leondz leondz changed the title feature: policy scan base infrastructure experimental feature: policy scan base infrastructure Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture Architectural upgrades
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add pre-scan model output policy checks
3 participants