experimental feature: policy scan base infrastructure #955

leondz · 2024-10-23T13:11:35Z

This module represents objects related to policy scanning.

Policy scanning in garak attempts to work out what the target's content policy
is, before running a security scan.

It's important to know what target content policy is because we only really have
a useful/successful hit or breach if we're able to get a model to do something that
it otherwise wouldn't. It may be exciting to discover a model gives instructions for
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives
the instructions when simply asked directly "print instructions for cooking meth", the
use of base64 necessarily an exploit in this output category - the model is acting
the same.

Garak's policy support follows a typology of different behaviours, each describing
a different behaviour. By default this typology is stored in data/policy/policy_typology.json.

A policy scan is conducted by invoking garak with the --policy_scan switch.
When this is requested, a separate scan runs using all policy probes within garak.
Policy probes are denoted by a probe class asserting policy_probe=True.
A regular probewise harness runs the scan, though reporting is diverted to a separate
policy report file. After completion, garak estimates a policy based on policy probe
results, and writes this to both main and policy reports.

What this PR adds

We're laying the base infrastructure for policy scans in this PR.

Add a policy typology and support loading it
Introduce a policy module and Policy class to allow storing and manipulation of target content policies. Policies consist of a set of policy points each describing a behaviour and whether this is permitted by the target.
Differentiate non-adversarial probes as "policy" probes, which appear differently and are not automatically selected for main runs
Specify which policy points a policy probe tests for, and enforce presence of this via test (a policy probe that doesn't inform the policy, is not useful)
Add an optional "policy scan" which assesses model behavior under various policy points. It:
- selects policy probes
- guesses a policy depending on the output of these probe's nominated detectors
- logs to a separate place
- outputs a serialised policy.Policy object detailing what was extracted about the target's apparent content policy
Enable plugin filtering in _plugins.enumerate_plugins() to help dynamic selection of plugins based on class attributes
Added to & unified logging in harnesses

Verification

garak -m test --policy_scan -p encoding -g 1, then tail the xxx.policy.jsonl

todo for this vs. later PRs

There are required for merging this:

test for garak.policy
validate default policy
tag tests: probes must give policy list if policy_probe is true
validation

These are out-of-scope and planned:

policy probe for trying prompts based on policy points and looking to mitigation/no resp
merging of results for cases where multiple probes test a policy
refactor donotanswer
using policy to filter planned probing

…ard probe list

…it back to their caller

…val results

…arness, custom harness, and command.xxx_run()

…licy

…unc for propagating permitted behaviours up instead of leaving parents None

erickgalinkin

I think we need some more detail around policy codes and clarity on how to develop/specify a policy. The code largely looks good, I left a few comments throughout.

erickgalinkin · 2024-10-28T20:50:47Z

garak/cli.py

-            for probe in parsed_specs["probe"]:
-                # distribute `generations` to the probes
-                p_type, p_module, p_klass = probe.split(".")
-                if (
-                    hasattr(_config.run, "generations")
-                    and _config.run.generations
-                    is not None  # garak.core.yaml always provides run.generations
-                ):
-                    _config.plugins.probes[p_module][p_klass][
-                        "generations"
-                    ] = _config.run.generations


Where is this logic being captured?

I see some, but not all of it below.

in garak/config.py L260?:

def distribute_generations_config(probelist, _config): # prepare run config: generations for probe in probelist: # distribute `generations` to the probes p_type, p_module, p_klass = probe.split(".") if ( hasattr(_config.run, "generations") and _config.run.generations is not None # garak.core.yaml always provides run.generations ): _config.plugins.probes[p_module][p_klass][ "generations" ] = _config.run.generations

Does this really make sense as a helper function in _config? The implementation looks to be a bit circular which is a bit confusing.

It's a bit unclear for me, but given the current state of garak._config I can support keeping it lean. The ability to configure plugins with some globals is desirable from a user point of view. Whether the code to do it lies in _config or command or cli, I don't know, but:

We'd like to keep _config lean

This is not something that will only ever be used by people using the cli entry point

So I'm tentatively placing it in command. Happy to hear other arguments.

garak/cli.py

garak/evaluators/base.py

garak/policy.py

erickgalinkin · 2024-10-30T15:09:04Z

garak/policy.py

+        """Populate the list of potential policy points given a policy structure description"""
+
+        self.points = {}  # zero out the existing policy points
+        for k in _load_policy_descriptions(policy_data_path=policy_data_path):


If a blank policy definition is returned, should we terminate the run?

Where in the code should that decision be implemented? I'm leaning towards garak.command

I'm in two minds about this. Policies can be generated independently of a normal run, so this can be retried - not all is lost if the policy scan didn't work, the run can still complete and produce artefacts.

This might change if we later start predicating main run probe selection based on policy scan results. Probably the logic that does that, will be able to quit the run if the policy scan fails.

garak/probes/av_spam_scanning.py

erickgalinkin · 2024-10-30T15:16:46Z

docs/source/policy.rst

Can we define the policy codes in here?

it's in garak/data/policy/policy_typology.json: https://github.com/leondz/garak/pull/955/files#diff-00beff92463bd705bbab517aa9130ebc01ab11d797b72a80f08a40c5277a8573 - is this OK?

jmartin-tech

Individual comments are thoughts on code itself.

Overall I think this is a reasonable foundation and would like to see the result expose some determination other than detector results in the output.

clarity on how to develop/specify a policy.

I think is this a good point, it would be helpful to see a summary output about inferred policy or possibly an option for the user to provide an expected policy the summary could be compare against to determine output divergence based on detection.

jmartin-tech · 2024-10-30T15:31:50Z

garak/cli.py

+                command.run_policy_scan(generator, _config)
+
+            # configure generations counts for main run
+            _config.distribute_generations_config(parsed_specs["probe"], _config)


Config should not change after a start_run(), since a policy scan needs to override the generations it may be appropriate for the policy to build it's own configuration dictionary with the value it needs in place.

jmartin-tech · 2024-10-30T15:46:09Z

garak/command.py

+    _policy_scan_msg("using policy probes " + ", ".join(policy_probe_names))
+
+    evaluator = garak.evaluators.ThresholdEvaluator(garak._config.run.eval_threshold)
+    distribute_generations_config(policy_probe_names, _config)


_config.plugins values should be considered immutable, this suggest that config needs to be possible to pass into a harness.

Can you expand on the comment re: immutability? I had expected this related to an attempt to change _config, but I didn't see one.

On the other hand, the pattern for accessing _config in command.run_policy_scan() seems suboptimal - it's referenced in multiple different ways, both as data structure and also module w/ functions

Access to _config should be read only, distribute_generations_config has global side-effects to the object passed in.

garak/evaluators/base.py

jmartin-tech · 2024-10-30T15:52:43Z

garak/harnesses/base.py

        if not detectors:
            msg = "No detectors, nothing to do"
-            logging.warning(msg)
+            logging.warning(f"harness: {msg}")


The added prefix seem like something we should be able to obtain from the log formatting vs hardcoding in.

the log formatting

this is undefined, isn't it? which is not the target state, but is the current state

The logging package is always enriched with data we simply need to expose that via an injected format string, what I am suggesting is that this should not be hardcode into the message for that reason.

Would you be OK leaving this til we revamp logging? The message without context is more awkward to decipher

garak/probes/av_spam_scanning.py

jmartin-tech · 2024-10-30T16:03:00Z

garak/probes/test.py

@@ -12,12 +12,15 @@ class Blank(Probe):
    Poses a blank prompt to the model"""

    bcp47 = "*"
-    active = False  # usually for testing
+    active = True


Should this really be exposed as active? If the tag was added to include it in a policy related scans would that be sufficient to activate it based on the run config?

this probe is modified to be an active policy probe, with policy_probe asserted. main run probe selection now skips policy probes.

ok, I can see this is getting activated now, in a default scan with experimental features off. will find a resolution. the desired functionality is:

test.Blank prompts are not posed by default in a normal scan with experimental features off

test.Blank prompts are not posed by default in a normal scan with policy scan included

test.Blank prompts are posed by default in a policy scan

garak/policy.py

garak/data/policy/policy_typology.txt

garak/harnesses/base.py

Co-authored-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

leondz added 30 commits October 2, 2024 14:32

add policy metadata

102f648

Merge branch 'main' into feature/policy

a44c335

re-org cli.py slightly; add cli hook for policy scans

f7da7d5

add policy probe flag to base probe

7c81725

add plugin filtering to enumerate_plugins

733bd87

add plugin enumeration + filter test

384fb53

ahem

a352818

add cli option to list policy probes, filter policy probes from stand…

4785340

…ard probe list

reorg garak.cli if blocks, pass generator to policy scan

1f4f95e

execute rudimentary policy scan

96586ad

probes.test.Blank is now a policy probe

05bfce4

harnesses now return iterator of evaluator results, providing a condu…

e2e210c

…it back to their caller

rm yield for now; rm announce_probe

7963a3e

update test.Blank probe to check policy

c67715f

add some harness logging; base harness now returns a generator over e…

ebe34eb

…val results

evaluators now return info, which is surfaced though harnesses.base.H…

71e568a

…arness, custom harness, and command.xxx_run()

write policy report to own file

bc03380

use raw regexp

2ba073e

don't return after first probewise probe harness call

b65e08e

consume scan result; put logging above policy report open

bc920f7

amend Chat policy point name

ccc6444

class for representing & handling policies

1ac841e

code for parsing policy scan results, building policy, and storing po…

650f576

…licy

log probewise harness completion

9400587

add policy thresholding

74ab6a1

add config block for policy

582e2ba

factor distribution of generation count to probes out of cli

bc7831a

add policy docs

13beea9

add non-exploit tag 'policy' for policy probe tagging

b9a7dc8

update config test to reflect new test.Blank detector

644061e

leondz added 3 commits October 23, 2024 15:03

move parent name to module; validate policy typologies at load; add f…

16f4d40

…unc for propagating permitted behaviours up instead of leaving parents None

add/tidy missing nodes

9317093

when inferring policy, propagate permitted behaviours up

ebcd7e9

leondz added the architecture Architectural upgrades label Oct 23, 2024

leondz added 2 commits October 24, 2024 11:07

add tests for policy functionality

b3f27d6

test for probe policy metadata

4c38c85

leondz marked this pull request as ready for review October 24, 2024 09:29

leondz requested review from jmartin-tech and erickgalinkin October 24, 2024 09:29

This was linked to issues Oct 24, 2024

Add pre-scan model output policy checks #893

Open

Map existing probes to policies #894

Open

add policy tests

4dd1b64

leondz removed a link to an issue Oct 24, 2024

Map existing probes to policies #894

Open

leondz changed the title ~~feature: policy scans~~ feature: policy scan base infrastructure Oct 29, 2024

erickgalinkin requested changes Oct 30, 2024

View reviewed changes

jmartin-tech reviewed Oct 30, 2024

View reviewed changes

leondz and others added 11 commits November 6, 2024 15:03

evaluators now yield EvalTuple not dict

27eaa5b

add policy module docstring, describe policy ID regex

9636f85

Merge branch 'main' into feature/policy

c397bab

explain policy config stanza

b01ddee

document _config.run.policy_scan

9b8a60b

Update garak/harnesses/base.py

7352472

Co-authored-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

typo fix

61f0b37

Co-authored-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

document typology in policy.rst

5d1981f

rm text version of policy - one is enough

b58a8b4

stop base harness run() and other harness run() from colliding

61e38ed

remove --generate_autodan

33bc89d

leondz marked this pull request as draft November 12, 2024 16:09

leondz changed the title ~~feature: policy scan base infrastructure~~ experimental feature: policy scan base infrastructure Nov 12, 2024

merge main

3966461

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experimental feature: policy scan base infrastructure #955

experimental feature: policy scan base infrastructure #955

leondz commented Oct 23, 2024 •

edited

Loading

erickgalinkin left a comment

erickgalinkin Oct 28, 2024

erickgalinkin Oct 28, 2024

leondz Nov 8, 2024

jmartin-tech Nov 8, 2024

leondz Dec 23, 2024

erickgalinkin Oct 30, 2024

leondz Oct 30, 2024

leondz Nov 7, 2024

erickgalinkin Oct 30, 2024

leondz Nov 6, 2024

jmartin-tech left a comment

jmartin-tech Oct 30, 2024

jmartin-tech Oct 30, 2024

leondz Nov 7, 2024

jmartin-tech Nov 8, 2024

jmartin-tech Oct 30, 2024

leondz Nov 7, 2024

jmartin-tech Nov 8, 2024

leondz Dec 9, 2024

jmartin-tech Oct 30, 2024

leondz Nov 7, 2024

leondz Dec 9, 2024 •

edited

Loading

experimental feature: policy scan base infrastructure #955

Are you sure you want to change the base?

experimental feature: policy scan base infrastructure #955

Conversation

leondz commented Oct 23, 2024 • edited Loading

What this PR adds

Verification

todo for this vs. later PRs

erickgalinkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmartin-tech left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leondz Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

leondz commented Oct 23, 2024 •

edited

Loading

leondz Dec 9, 2024 •

edited

Loading