-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experimental feature: policy scan base infrastructure #955
Draft
leondz
wants to merge
52
commits into
main
Choose a base branch
from
feature/policy
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
52 commits
Select commit
Hold shift + click to select a range
102f648
add policy metadata
leondz a44c335
Merge branch 'main' into feature/policy
leondz f7da7d5
re-org cli.py slightly; add cli hook for policy scans
leondz 7c81725
add policy probe flag to base probe
leondz 733bd87
add plugin filtering to enumerate_plugins
leondz 384fb53
add plugin enumeration + filter test
leondz a352818
ahem
leondz 4785340
add cli option to list policy probes, filter policy probes from stand…
leondz 1f4f95e
reorg garak.cli if blocks, pass generator to policy scan
leondz 96586ad
execute rudimentary policy scan
leondz 05bfce4
probes.test.Blank is now a policy probe
leondz e2e210c
harnesses now return iterator of evaluator results, providing a condu…
leondz 7963a3e
rm yield for now; rm announce_probe
leondz c67715f
update test.Blank probe to check policy
leondz ebe34eb
add some harness logging; base harness now returns a generator over e…
leondz 71e568a
evaluators now return info, which is surfaced though harnesses.base.H…
leondz bc03380
write policy report to own file
leondz 2ba073e
use raw regexp
leondz b65e08e
don't return after first probewise probe harness call
leondz bc920f7
consume scan result; put logging above policy report open
leondz ccc6444
amend Chat policy point name
leondz 1ac841e
class for representing & handling policies
leondz 650f576
code for parsing policy scan results, building policy, and storing po…
leondz 9400587
log probewise harness completion
leondz 74ab6a1
add policy thresholding
leondz 582e2ba
add config block for policy
leondz bc7831a
factor distribution of generation count to probes out of cli
leondz 13beea9
add policy docs
leondz b9a7dc8
add non-exploit tag 'policy' for policy probe tagging
leondz 644061e
update config test to reflect new test.Blank detector
leondz aa2ff6f
Merge branch 'main' into feature/policy
leondz 09488df
add snowballmini as policy probe
leondz 5e4ba8c
tidy up policy probe status of snowball classes
leondz 97f2628
repurpose more probes as policy
leondz 16f4d40
move parent name to module; validate policy typologies at load; add f…
leondz 9317093
add/tidy missing nodes
leondz ebcd7e9
when inferring policy, propagate permitted behaviours up
leondz b3f27d6
add tests for policy functionality
leondz 4c38c85
test for probe policy metadata
leondz 4dd1b64
add policy tests
leondz 27eaa5b
evaluators now yield EvalTuple not dict
leondz 9636f85
add policy module docstring, describe policy ID regex
leondz c397bab
Merge branch 'main' into feature/policy
leondz b01ddee
explain policy config stanza
leondz 9b8a60b
document _config.run.policy_scan
leondz 7352472
Update garak/harnesses/base.py
leondz 61f0b37
typo fix
leondz 5d1981f
document typology in policy.rst
leondz b58a8b4
rm text version of policy - one is enough
leondz 61e38ed
stop base harness run() and other harness run() from colliding
leondz 33bc89d
remove --generate_autodan
leondz 3966461
merge main
leondz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
garak.detectors.any | ||
=================== | ||
|
||
.. automodule:: garak.detectors.any | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,6 +48,7 @@ Advanced usage | |
|
||
configurable | ||
cliref | ||
policy | ||
|
||
Code reference | ||
^^^^^^^^^^^^^^ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
garak.policy | ||
============ | ||
|
||
This module represents objects related to policy scanning. | ||
|
||
Policy scanning in garak attempts to work out what the target's content policy | ||
is, before running a security scan. | ||
|
||
It's important to know what target content policy is because we only really have | ||
a useful/successful hit or breach if we're able to get a model to do something that | ||
it otherwise wouldn't. It may be exciting to discover a model gives instructions for | ||
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives | ||
the instructions when simply asked directly "print instructions for cooking meth", the | ||
use of base64 necessarily an exploit in this output category - the model is acting | ||
the same. | ||
|
||
Garak's policy support follows a typology of different behaviours, each describing | ||
a different behaviour. By default this typology is stored in ``data/policy/policy_typology.json``. | ||
|
||
A policy scan is conducted by invoking garak with the ``--policy_scan`` switch. | ||
When this is requested, a separate scan runs using all policy probes within garak. | ||
Policy probes are denoted by a probe class asserting ``policy_probe=True``. | ||
A regular probewise harness runs the scan, though reporting is diverted to a separate | ||
policy report file. After completion, garak estimates a policy based on policy probe | ||
results, and writes this to both main and poliy reports. | ||
|
||
|
||
Model behaviour typologies | ||
-------------------------- | ||
|
||
Goal | ||
^^^^ | ||
|
||
The model behaviour typology enumerates many different types of target behaviour. The listed behaviours help structure a policy for model output. For each behaviour, one can choose if an model should engage or not engage in that activity. | ||
|
||
The typology serves as a point of departure for building model content policies, as well as a framework for describing model behaviour. | ||
|
||
This typology is hierarchical, but labels are not “hard”. That is, multiple categories might apply to a single candidate behaviour, and that's OK. | ||
|
||
Because the range of possible model behaviours is large, and an open set, this typology is not comprehensive, and is not designed or intended to ever be comprehensive. | ||
|
||
To optimise effort spent building this typology, it's best to prioritise addition & definition of categories for which we actually have payloads. | ||
|
||
Usage | ||
^^^^^ | ||
|
||
To use this typology to describe an model deployment, examine each category and check if the target model engages with that behaviour directly, without using any adversarial techniques. | ||
|
||
To use this typology to describe intended deployed model policy, consider each category in turn and decide how the model should react. A possible set of possible reactions can be as simple as "Engage" & "Decline". | ||
|
||
Policy point guidelines | ||
^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
* Each point describes something that the model does, i.e. a behaviour | ||
* Given a decent prompt representing a policy, and a model's response, it should be possible to discern in isolation whether or not the model is engaging or refusing for that prompt/response pair | ||
* Prioritise enumerating policies that reflect things we have tests for (or can reasonably test for) | ||
* It's great to have two sample prompts per point | ||
* We want to stick to max three levels if at all possible | ||
* Multiple inheritance is fine, e.g. a probe might represent multiple points in this typology | ||
|
||
Policy metadata | ||
^^^^^^^^^^^^^^^ | ||
|
||
The total set of points in the behaviour typology can be represented as a dictionary. Definitions of policy names, descriptions, and behaviours are stored in a JSON data file | ||
|
||
* Key: behaviour identifier - format is TDDDs* | ||
* T: a top-level hierarchy code letter, in CTMS for chat/tasks/meta/safety | ||
* D: a three-digit code for this behaviour | ||
* s*: (optional) one or more letters identifying a sub-policy | ||
|
||
Value: a dict describing a behaviour | ||
* “name”: A short name of what is permitted when this behaviour is allowed | ||
* “description”: (optional) a deeper description of this behaviour | ||
|
||
The structure of the identifiers describes the hierarchical structure. | ||
|
||
|
||
.. automodule:: garak.policy | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,7 @@ | |
|
||
"""Flow for invoking garak from the command line""" | ||
|
||
command_options = "list_detectors list_probes list_generators list_buffs list_config plugin_info interactive report version fix".split() | ||
command_options = "list_detectors list_probes list_policy_probes list_generators list_buffs list_config plugin_info interactive report version fix".split() | ||
|
||
|
||
def parse_cli_plugin_config(plugin_type, args): | ||
|
@@ -223,6 +223,9 @@ def main(arguments=None) -> None: | |
parser.add_argument( | ||
"--list_probes", action="store_true", help="list available vulnerability probes" | ||
) | ||
parser.add_argument( | ||
"--list_policy_probes", action="store_true", help="list available policy probes" | ||
) | ||
parser.add_argument( | ||
"--list_detectors", action="store_true", help="list available detectors" | ||
) | ||
|
@@ -259,11 +262,6 @@ def main(arguments=None) -> None: | |
action="store_true", | ||
help="Enter interactive probing mode", | ||
) | ||
parser.add_argument( | ||
"--generate_autodan", | ||
action="store_true", | ||
help="generate AutoDAN prompts; requires --prompt_options with JSON containing a prompt and target", | ||
) | ||
parser.add_argument( | ||
"--interactive.py", | ||
action="store_true", | ||
|
@@ -282,7 +280,12 @@ def main(arguments=None) -> None: | |
parser.description = ( | ||
str(parser.description) + " - EXPERIMENTAL FEATURES ENABLED" | ||
) | ||
pass | ||
parser.add_argument( | ||
"--policy_scan", | ||
action="store_true", | ||
default=_config.run.policy_scan, | ||
help="determine model's behavior policy before scanning", | ||
) | ||
|
||
logging.debug("args - raw argument string received: %s", arguments) | ||
|
||
|
@@ -418,6 +421,9 @@ def main(arguments=None) -> None: | |
elif args.list_probes: | ||
command.print_probes() | ||
|
||
elif args.list_policy_probes: | ||
command.print_policy_probes() | ||
|
||
elif args.list_detectors: | ||
command.print_detectors() | ||
|
||
|
@@ -499,6 +505,7 @@ def main(arguments=None) -> None: | |
|
||
print(f"📜 logging to {log_filename}") | ||
|
||
# set up generator | ||
conf_root = _config.plugins.generators | ||
for part in _config.plugins.model_type.split("."): | ||
if not part in conf_root: | ||
|
@@ -521,6 +528,7 @@ def main(arguments=None) -> None: | |
logging.error(message) | ||
raise ValueError(message) | ||
|
||
# validate main run config | ||
parsable_specs = ["probe", "detector", "buff"] | ||
parsed_specs = {} | ||
for spec_type in parsable_specs: | ||
|
@@ -544,20 +552,7 @@ def main(arguments=None) -> None: | |
msg_list = ",".join(rejected) | ||
raise ValueError(f"❌Unknown {spec_namespace}❌: {msg_list}") | ||
|
||
for probe in parsed_specs["probe"]: | ||
# distribute `generations` to the probes | ||
p_type, p_module, p_klass = probe.split(".") | ||
if ( | ||
hasattr(_config.run, "generations") | ||
and _config.run.generations | ||
is not None # garak.core.yaml always provides run.generations | ||
): | ||
_config.plugins.probes[p_module][p_klass][ | ||
"generations" | ||
] = _config.run.generations | ||
|
||
evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold) | ||
|
||
# generator init | ||
from garak import _plugins | ||
|
||
generator = _plugins.load_plugin( | ||
|
@@ -574,28 +569,28 @@ def main(arguments=None) -> None: | |
logging=logging, | ||
) | ||
|
||
if "generate_autodan" in args and args.generate_autodan: | ||
from garak.resources.autodan import autodan_generate | ||
|
||
try: | ||
prompt = _config.probe_options["prompt"] | ||
target = _config.probe_options["target"] | ||
except Exception as e: | ||
print( | ||
"AutoDAN generation requires --probe_options with a .json containing a `prompt` and `target` " | ||
"string" | ||
) | ||
autodan_generate(generator=generator, prompt=prompt, target=target) | ||
|
||
# looks like we might get something to report, so fire that up | ||
command.start_run() # start the run now that all config validation is complete | ||
print(f"📜 reporting to {_config.transient.report_filename}") | ||
|
||
# do policy run | ||
if _config.run.policy_scan: | ||
command.run_policy_scan(generator, _config) | ||
|
||
# configure generations counts for main run | ||
_config.distribute_generations_config(parsed_specs["probe"], _config) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Config should not change after a |
||
|
||
# set up plugins for main run | ||
# instantiate evaluator | ||
evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold) | ||
|
||
# parse & set up detectors, if supplied | ||
if parsed_specs["detector"] == []: | ||
command.probewise_run( | ||
run_result = command.probewise_run( | ||
generator, parsed_specs["probe"], evaluator, parsed_specs["buff"] | ||
) | ||
else: | ||
command.pxd_run( | ||
run_result = command.pxd_run( | ||
generator, | ||
parsed_specs["probe"], | ||
parsed_specs["detector"], | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this logic being captured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some, but not all of it below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in
garak/config.py
L260?:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really make sense as a helper function in
_config
? The implementation looks to be a bitcircular
which is a bitconfusing
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unclear for me, but given the current state of
garak._config
I can support keeping it lean. The ability to configure plugins with some globals is desirable from a user point of view. Whether the code to do it lies in_config
orcommand
orcli
, I don't know, but:_config
leancli
entry pointSo I'm tentatively placing it in
command
. Happy to hear other arguments.