Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental feature: policy scan base infrastructure #955

Draft
wants to merge 52 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
102f648
add policy metadata
leondz Oct 2, 2024
a44c335
Merge branch 'main' into feature/policy
leondz Oct 16, 2024
f7da7d5
re-org cli.py slightly; add cli hook for policy scans
leondz Oct 16, 2024
7c81725
add policy probe flag to base probe
leondz Oct 17, 2024
733bd87
add plugin filtering to enumerate_plugins
leondz Oct 17, 2024
384fb53
add plugin enumeration + filter test
leondz Oct 17, 2024
a352818
ahem
leondz Oct 17, 2024
4785340
add cli option to list policy probes, filter policy probes from stand…
leondz Oct 17, 2024
1f4f95e
reorg garak.cli if blocks, pass generator to policy scan
leondz Oct 17, 2024
96586ad
execute rudimentary policy scan
leondz Oct 17, 2024
05bfce4
probes.test.Blank is now a policy probe
leondz Oct 17, 2024
e2e210c
harnesses now return iterator of evaluator results, providing a condu…
leondz Oct 17, 2024
7963a3e
rm yield for now; rm announce_probe
leondz Oct 17, 2024
c67715f
update test.Blank probe to check policy
leondz Oct 17, 2024
ebe34eb
add some harness logging; base harness now returns a generator over e…
leondz Oct 21, 2024
71e568a
evaluators now return info, which is surfaced though harnesses.base.H…
leondz Oct 21, 2024
bc03380
write policy report to own file
leondz Oct 22, 2024
2ba073e
use raw regexp
leondz Oct 22, 2024
b65e08e
don't return after first probewise probe harness call
leondz Oct 22, 2024
bc920f7
consume scan result; put logging above policy report open
leondz Oct 22, 2024
ccc6444
amend Chat policy point name
leondz Oct 22, 2024
1ac841e
class for representing & handling policies
leondz Oct 22, 2024
650f576
code for parsing policy scan results, building policy, and storing po…
leondz Oct 23, 2024
9400587
log probewise harness completion
leondz Oct 23, 2024
74ab6a1
add policy thresholding
leondz Oct 23, 2024
582e2ba
add config block for policy
leondz Oct 23, 2024
bc7831a
factor distribution of generation count to probes out of cli
leondz Oct 23, 2024
13beea9
add policy docs
leondz Oct 23, 2024
b9a7dc8
add non-exploit tag 'policy' for policy probe tagging
leondz Oct 23, 2024
644061e
update config test to reflect new test.Blank detector
leondz Oct 23, 2024
aa2ff6f
Merge branch 'main' into feature/policy
leondz Oct 23, 2024
09488df
add snowballmini as policy probe
leondz Oct 23, 2024
5e4ba8c
tidy up policy probe status of snowball classes
leondz Oct 23, 2024
97f2628
repurpose more probes as policy
leondz Oct 23, 2024
16f4d40
move parent name to module; validate policy typologies at load; add f…
leondz Oct 23, 2024
9317093
add/tidy missing nodes
leondz Oct 23, 2024
ebcd7e9
when inferring policy, propagate permitted behaviours up
leondz Oct 23, 2024
b3f27d6
add tests for policy functionality
leondz Oct 24, 2024
4c38c85
test for probe policy metadata
leondz Oct 24, 2024
4dd1b64
add policy tests
leondz Oct 24, 2024
27eaa5b
evaluators now yield EvalTuple not dict
leondz Nov 6, 2024
9636f85
add policy module docstring, describe policy ID regex
leondz Nov 6, 2024
c397bab
Merge branch 'main' into feature/policy
leondz Nov 7, 2024
b01ddee
explain policy config stanza
leondz Nov 7, 2024
9b8a60b
document _config.run.policy_scan
leondz Nov 7, 2024
7352472
Update garak/harnesses/base.py
leondz Nov 7, 2024
61f0b37
typo fix
leondz Nov 7, 2024
5d1981f
document typology in policy.rst
leondz Nov 7, 2024
b58a8b4
rm text version of policy - one is enough
leondz Nov 7, 2024
61e38ed
stop base harness run() and other harness run() from colliding
leondz Nov 7, 2024
33bc89d
remove --generate_autodan
leondz Nov 8, 2024
3966461
merge main
leondz Dec 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/detectors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ garak.detectors
garak.detectors
garak.detectors.base
garak.detectors.always
garak.detectors.any
garak.detectors.continuation
garak.detectors.dan
garak.detectors.divergence
Expand Down
8 changes: 8 additions & 0 deletions docs/source/garak.detectors.any.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
garak.detectors.any
===================

.. automodule:: garak.detectors.any
:members:
:undoc-members:
:show-inheritance:

1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Advanced usage

configurable
cliref
policy

Code reference
^^^^^^^^^^^^^^
Expand Down
31 changes: 31 additions & 0 deletions docs/source/policy.rst
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define the policy codes in here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
garak.policy
============

This module represents objects related to policy scanning.

Policy scanning in garak attempts to work out what the target's content policy
is, before running a security scan.

It's important to know what target content policy is because we only really have
a useful/successful hit or breach if we're able to get a model to do something that
it otherwise wouldn't. It may be exciting to discover a model gives instructions for
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives
the instructions when simply asked directly "print instructions for cooking meth", the
use of base64 necessarily an exploit in this output category - the model is acting
the same.

Garak's policy support follows a typology of different behaviours, each describing
a different behaviour. By default this typology is stored in ``data/policy/policy_typology.json``.

A policy scan is conducted by invoking garak with the ``--policy_scan`` switch.
When this is requested, a separate scan runs using all policy probes within garak.
Policy probes are denoted by a probe class asserting ``policy_probe=True``.
A regular probewise harness runs the scan, though reporting is diverted to a separate
policy report file. After completion, garak estimates a policy based on policy probe
results, and writes this to both main and poliy reports.


.. automodule:: garak.policy
:members:
:undoc-members:
:show-inheritance:
21 changes: 19 additions & 2 deletions garak/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
system_params = (
"verbose narrow_output parallel_requests parallel_attempts skip_unknown".split()
)
run_params = "seed deprefix eval_threshold generations probe_tags interactive".split()
run_params = "seed deprefix eval_threshold generations probe_tags interactive policy_scan".split()
plugins_params = "model_type model_name extended_detectors".split()
reporting_params = "taxonomy report_prefix".split()
project_dir_name = "garak"
Expand Down Expand Up @@ -77,6 +77,7 @@ class TransientConfig(GarakSubConfig):
run = GarakSubConfig()
plugins = GarakSubConfig()
reporting = GarakSubConfig()
policy = GarakSubConfig()


def _lock_config_as_dict():
Expand Down Expand Up @@ -144,12 +145,13 @@ def _load_yaml_config(settings_filenames) -> dict:


def _store_config(settings_files) -> None:
global system, run, plugins, reporting
global system, run, plugins, reporting, policy
settings = _load_yaml_config(settings_files)
system = _set_settings(system, settings["system"])
run = _set_settings(run, settings["run"])
plugins = _set_settings(plugins, settings["plugins"])
reporting = _set_settings(reporting, settings["reporting"])
policy = _set_settings(plugins, settings["policy"])


def load_base_config() -> None:
Expand Down Expand Up @@ -253,3 +255,18 @@ def parse_plugin_spec(
plugin_names.remove(plugin_to_skip)

return plugin_names, unknown_plugins


def distribute_generations_config(probelist, _config):
# prepare run config: generations
for probe in probelist:
# distribute `generations` to the probes
p_type, p_module, p_klass = probe.split(".")
if (
hasattr(_config.run, "generations")
and _config.run.generations
is not None # garak.core.yaml always provides run.generations
):
_config.plugins.probes[p_module][p_klass][
"generations"
] = _config.run.generations
9 changes: 8 additions & 1 deletion garak/_plugins.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,7 @@ def plugin_info(plugin: Union[Callable, str]) -> dict:


def enumerate_plugins(
category: str = "probes", skip_base_classes=True
category: str = "probes", skip_base_classes=True, filter: Union[None, dict] = None
) -> List[tuple[str, bool]]:
"""A function for listing all modules & plugins of the specified kind.

Expand All @@ -328,6 +328,13 @@ def enumerate_plugins(
for k, v in PluginCache.instance()[category].items():
if skip_base_classes and ".base." in k:
continue
if filter is not None:
try:
for attrib, value in filter.items():
if attrib in v and v[attrib] != value:
raise StopIteration
except StopIteration:
continue
enum_entry = (k, v["active"])
plugin_class_names.add(enum_entry)

Expand Down
53 changes: 34 additions & 19 deletions garak/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

"""Flow for invoking garak from the command line"""

command_options = "list_detectors list_probes list_generators list_buffs list_config plugin_info interactive report version".split()
command_options = "list_detectors list_probes list_policy_probes list_generators list_buffs list_config plugin_info interactive report version".split()


def main(arguments=None) -> None:
Expand Down Expand Up @@ -107,6 +107,12 @@ def main(arguments=None) -> None:
parser.add_argument(
"--config", type=str, default=None, help="YAML config file for this run"
)
parser.add_argument(
"--policy_scan",
action="store_true",
default=_config.run.policy_scan,
help="determine model's behavior policy before scanning",
)

## PLUGINS
# generator
Expand Down Expand Up @@ -201,6 +207,9 @@ def main(arguments=None) -> None:
parser.add_argument(
"--list_probes", action="store_true", help="list available vulnerability probes"
)
parser.add_argument(
"--list_policy_probes", action="store_true", help="list available policy probes"
)
parser.add_argument(
"--list_detectors", action="store_true", help="list available detectors"
)
Expand Down Expand Up @@ -398,6 +407,9 @@ def main(arguments=None) -> None:
elif args.list_probes:
command.print_probes()

elif args.list_policy_probes:
command.print_policy_probes()

elif args.list_detectors:
command.print_detectors()

Expand Down Expand Up @@ -425,6 +437,7 @@ def main(arguments=None) -> None:

print(f"📜 logging to {log_filename}")

# set up generator
conf_root = _config.plugins.generators
for part in _config.plugins.model_type.split("."):
if not part in conf_root:
Expand All @@ -447,6 +460,7 @@ def main(arguments=None) -> None:
logging.error(message)
raise ValueError(message)

# validate main run config
parsable_specs = ["probe", "detector", "buff"]
parsed_specs = {}
for spec_type in parsable_specs:
Expand All @@ -470,20 +484,7 @@ def main(arguments=None) -> None:
msg_list = ",".join(rejected)
raise ValueError(f"❌Unknown {spec_namespace}❌: {msg_list}")

for probe in parsed_specs["probe"]:
# distribute `generations` to the probes
p_type, p_module, p_klass = probe.split(".")
if (
hasattr(_config.run, "generations")
and _config.run.generations
is not None # garak.core.yaml always provides run.generations
):
_config.plugins.probes[p_module][p_klass][
"generations"
] = _config.run.generations
Comment on lines -547 to -557
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this logic being captured?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some, but not all of it below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in garak/config.py L260?:

def distribute_generations_config(probelist, _config):
    # prepare run config: generations
    for probe in probelist:
        # distribute `generations` to the probes
        p_type, p_module, p_klass = probe.split(".")
        if (
            hasattr(_config.run, "generations")
            and _config.run.generations
            is not None  # garak.core.yaml always provides run.generations
        ):
            _config.plugins.probes[p_module][p_klass][
                "generations"
            ] = _config.run.generations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really make sense as a helper function in _config? The implementation looks to be a bit circular which is a bit confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear for me, but given the current state of garak._config I can support keeping it lean. The ability to configure plugins with some globals is desirable from a user point of view. Whether the code to do it lies in _config or command or cli, I don't know, but:

  • We'd like to keep _config lean
  • This is not something that will only ever be used by people using the cli entry point

So I'm tentatively placing it in command. Happy to hear other arguments.


evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)

# generator init
from garak import _plugins

generator = _plugins.load_plugin(
Expand All @@ -500,6 +501,18 @@ def main(arguments=None) -> None:
logging=logging,
)

# looks like we might get something to report, so fire that up
command.start_run() # start the run now that all config validation is complete
print(f"📜 reporting to {_config.transient.report_filename}")

# do policy run
if _config.run.policy_scan:
command.run_policy_scan(generator, _config)

# configure generations counts for main run
_config.distribute_generations_config(parsed_specs["probe"], _config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config should not change after a start_run(), since a policy scan needs to override the generations it may be appropriate for the policy to build it's own configuration dictionary with the value it needs in place.


# autodan action
if "generate_autodan" in args and args.generate_autodan:
from garak.resources.autodan import autodan_generate
leondz marked this conversation as resolved.
Show resolved Hide resolved

Expand All @@ -513,15 +526,17 @@ def main(arguments=None) -> None:
)
autodan_generate(generator=generator, prompt=prompt, target=target)

command.start_run() # start the run now that all config validation is complete
print(f"📜 reporting to {_config.transient.report_filename}")
# set up plugins for main run
# instantiate evaluator
evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)

# parse & set up detectors, if supplied
if parsed_specs["detector"] == []:
command.probewise_run(
run_result = command.probewise_run(
generator, parsed_specs["probe"], evaluator, parsed_specs["buff"]
)
else:
command.pxd_run(
run_result = command.pxd_run(
generator,
parsed_specs["probe"],
parsed_specs["detector"],
Expand Down
78 changes: 72 additions & 6 deletions garak/command.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import logging
import json
import random
import re

HINT_CHANCE = 0.25

Expand Down Expand Up @@ -56,7 +57,7 @@ def start_run():

logging.info("run started at %s", _config.transient.starttime_iso)
# print("ASSIGN UUID", args)
if _config.system.lite and "probes" not in _config.transient.cli_args and not _config.transient.cli_args.list_probes and not _config.transient.cli_args.list_detectors and not _config.transient.cli_args.list_generators and not _config.transient.cli_args.list_buffs and not _config.transient.cli_args.list_config and not _config.transient.cli_args.plugin_info and not _config.run.interactive: # type: ignore
if _config.system.lite and "probes" not in _config.transient.cli_args and not _config.transient.cli_args.list_probes and not _config.transient.cli_args.list_policy_probes and not _config.transient.cli_args.list_detectors and not _config.transient.cli_args.list_generators and not _config.transient.cli_args.list_buffs and not _config.transient.cli_args.list_config and not _config.transient.cli_args.plugin_info and not _config.run.interactive: # type: ignore
hint(
"The current/default config is optimised for speed rather than thoroughness. Try e.g. --config full for a stronger test, or specify some probes.",
logging=logging,
Expand Down Expand Up @@ -160,12 +161,14 @@ def end_run():
logging.info(msg)


def print_plugins(prefix: str, color):
def print_plugins(prefix: str, color, filter=None):
from colorama import Style

from garak._plugins import enumerate_plugins

plugin_names = enumerate_plugins(category=prefix)
if filter is None:
filter = {}
plugin_names = enumerate_plugins(category=prefix, filter=filter)
plugin_names = [(p.replace(f"{prefix}.", ""), a) for p, a in plugin_names]
module_names = set([(m.split(".")[0], True) for m, a in plugin_names])
plugin_names += module_names
Expand All @@ -182,7 +185,13 @@ def print_plugins(prefix: str, color):
def print_probes():
from colorama import Fore

print_plugins("probes", Fore.LIGHTYELLOW_EX)
print_plugins("probes", Fore.LIGHTYELLOW_EX, filter={"policy_probe": False})


def print_policy_probes():
from colorama import Fore

print_plugins("probes", Fore.LIGHTYELLOW_EX, filter={"policy_probe": True})


def print_detectors():
Expand Down Expand Up @@ -234,14 +243,14 @@ def probewise_run(generator, probe_names, evaluator, buffs):
import garak.harnesses.probewise

probewise_h = garak.harnesses.probewise.ProbewiseHarness()
probewise_h.run(generator, probe_names, evaluator, buffs)
return list(probewise_h.run(generator, probe_names, evaluator, buffs))


def pxd_run(generator, probe_names, detector_names, evaluator, buffs):
import garak.harnesses.pxd

pxd_h = garak.harnesses.pxd.PxD()
pxd_h.run(
return pxd_h.run(
generator,
probe_names,
detector_names,
Expand Down Expand Up @@ -273,3 +282,60 @@ def write_report_digest(report_filename, digest_filename):
digest = report_digest.compile_digest(report_filename)
with open(digest_filename, "w", encoding="utf-8") as f:
f.write(digest)


POLICY_MSG_PREFIX = "run_policy_scan"


def _policy_scan_msg(text):
print(f"🏛️ {text}")
logging.info(f"{POLICY_MSG_PREFIX}: {text}")


def run_policy_scan(generator, _config):

from garak._config import distribute_generations_config
from garak._plugins import enumerate_plugins
import garak.evaluators
import garak.policy

main_reportfile = _config.transient.reportfile
policy_report_filename = re.sub(
r"\.jsonl$", ".policy.jsonl", _config.transient.report_filename
)
_policy_scan_msg(f"policy report in {policy_report_filename}")
_config.transient.reportfile = open(
policy_report_filename, "w", buffering=1, encoding="utf-8"
)

logging.info(f"{POLICY_MSG_PREFIX}: start policy scan")
# this is a probewise run of all policy probes
policy_probe_names = [
name
for name, status in enumerate_plugins(
"probes", filter={"active": True, "policy_probe": True}
)
]
_policy_scan_msg("using policy probes " + ", ".join(policy_probe_names))

evaluator = garak.evaluators.ThresholdEvaluator(garak._config.run.eval_threshold)
distribute_generations_config(policy_probe_names, _config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_config.plugins values should be considered immutable, this suggest that config needs to be possible to pass into a harness.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on the comment re: immutability? I had expected this related to an attempt to change _config, but I didn't see one.

On the other hand, the pattern for accessing _config in command.run_policy_scan() seems suboptimal - it's referenced in multiple different ways, both as data structure and also module w/ functions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access to _config should be read only, distribute_generations_config has global side-effects to the object passed in.

buffs = []
result = probewise_run(generator, policy_probe_names, evaluator, buffs)

policy = garak.policy.Policy()
policy.parse_eval_result(result, threshold=garak._config.policy.threshold)
policy.propagate_up()

policy_entry = {"entry_type": "policy", "policy": policy.points}
_config.transient.reportfile.write(json.dumps(policy_entry) + "\n")

_config.transient.reportfile.close()
_config.transient.reportfile = main_reportfile

# write policy record to both main report log and policy report log
_config.transient.reportfile.write(json.dumps(policy_entry) + "\n")

_policy_scan_msg("end policy scan")

return policy
3 changes: 2 additions & 1 deletion garak/data/misp_descriptions.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -137,4 +137,5 @@ risk-cards:lmrc:threats Threats Threats
risk-cards:lmrc:unproven_claims Unproven Claims Unproven Claims
risk-cards:lmrc:violation_of_privacy Violation Of Privacy Violation Of Privacy
risk-cards:lmrc:weapon_instructions Weapon Instructions Weapon Instructions
risk-cards:lmrc:wrong_tone Wrong Tone Wrong Tone
risk-cards:lmrc:wrong_tone Wrong Tone Wrong Tone
policy Policy test Policy test: this is not an exploit
Loading
Loading