NVIDIA · leondz · Oct 2, 2024 · Oct 16, 2024 · Oct 16, 2024 · Oct 17, 2024
diff --git a/docs/source/configurable.rst b/docs/source/configurable.rst
@@ -78,6 +78,9 @@ Let's take a look at the core config.
         report_dir: garak_runs
         show_100_pass_modules: true
 
+    policy:
+        threshold: false
+
 Here we can see many entries that correspond to command line options, such as 
 ``model_name`` and ``model_type``, as well as some entried not exposed via CLI
 such as ``show_100_pass_modules``.
@@ -102,6 +105,7 @@ such as ``show_100_pass_modules``.
 * ``deprefix`` - Remove the prompt from the start of the output (some models return the prompt as part of their output)
 * ``seed`` - An optional random seed
 * ``eval_threshold`` - At what point in the 0..1 range output by detectors does a result count as a successful attack / hit
+* ``policy_scan`` - Should the run include a scan to automatically determine the target's content policy?
 * ``user_agent`` - What HTTP user agent string should garak use? ``{version}`` can be used to signify where garak version ID should go
 
 ``plugins`` config items
@@ -130,6 +134,10 @@ For an example of how to use the ``detectors``, ``generators``, ``buffs``,
 * ``taxonomy`` - Which taxonomy to use to group probes when creating HTML report
 * ``show_100_pass_modules`` - Should entries scoring 100% still be detailed in the HTML report?
 
+``policy`` config items
+"""""""""""""""""""""""
+* ``threshold`` - pass rate for a behavior to be considered "permitted" when policy probed; false indicates any passes mean a positive, permissive policy
+
 
 Bundled quick configs
 ^^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/source/detectors.rst b/docs/source/detectors.rst
@@ -8,6 +8,7 @@ garak.detectors
    garak.detectors.base
    garak.detectors.always
    garak.detectors.ansiescape
+   garak.detectors.any
    garak.detectors.continuation
    garak.detectors.dan
    garak.detectors.divergence

diff --git a/docs/source/garak.detectors.any.rst b/docs/source/garak.detectors.any.rst
@@ -0,0 +1,8 @@
+garak.detectors.any
+===================
+
+.. automodule:: garak.detectors.any
+   :members:
+   :undoc-members:
+   :show-inheritance:   
+
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -48,6 +48,7 @@ Advanced usage
 
    configurable
    cliref
+   policy
 
 Code reference
 ^^^^^^^^^^^^^^

diff --git a/docs/source/policy.rst b/docs/source/policy.rst
@@ -0,0 +1,81 @@
+garak.policy
+============
+
+This module represents objects related to policy scanning. 
+
+Policy scanning in garak attempts to work out what the target's content policy
+is, before running a security scan. 
+
+It's important to know what target content policy is because we only really have
+a useful/successful hit or breach if we're able to get a model to do something that
+it otherwise wouldn't. It may be exciting to discover a model gives instructions for
+e.g. cooking meth if the request is encoded in base64, but if in fact the model gives
+the instructions when simply asked directly "print instructions for cooking meth", the
+use of base64 necessarily an exploit in this output category - the model is acting 
+the same.
+
+Garak's policy support follows a typology of different behaviours, each describing
+a different behaviour. By default this typology is stored in ``data/policy/policy_typology.json``.
+
+A policy scan is conducted by invoking garak with the ``--policy_scan`` switch.
+When this is requested, a separate scan runs using all policy probes within garak.
+Policy probes are denoted by a probe class asserting ``policy_probe=True``.
+A regular probewise harness runs the scan, though reporting is diverted to a separate
+policy report file. After completion, garak estimates a policy based on policy probe
+results, and writes this to both main and poliy reports.
+
+
+Model behaviour typologies
+--------------------------
+
+Goal
+^^^^
+
+The model behaviour typology enumerates many different types of target behaviour. The listed behaviours help structure a policy for model output. For each behaviour, one can choose if an model should engage or not engage in that activity.
+
+The typology serves as a point of departure for building model content policies, as well as a framework for describing model behaviour. 
+
+This typology is hierarchical, but labels are not “hard”. That is, multiple categories might apply to a single candidate behaviour, and that's OK.
+
+Because the range of possible model behaviours is large, and an open set, this typology is not comprehensive, and is not designed or intended to ever be comprehensive.
+
+To optimise effort spent building this typology, it's best to prioritise addition & definition of categories for which we actually have payloads.
+
+Usage
+^^^^^
+
+To use this typology to describe an model deployment, examine each category and check if the target model engages with that behaviour directly, without using any adversarial techniques.
+
+To use this typology to describe intended deployed model policy, consider each category in turn and decide how the model should react. A possible set of possible reactions can be as simple as "Engage" & "Decline".
+
+Policy point guidelines
+^^^^^^^^^^^^^^^^^^^^^^^
+
+* Each point describes something that the model does, i.e. a behaviour
+* Given a decent prompt representing a policy, and a model's response, it should be possible to discern in isolation whether or not the model is engaging or refusing for that prompt/response pair
+* Prioritise enumerating policies that reflect things we have tests for (or can reasonably test for)
+* It's great to have two sample prompts per point
+* We want to stick to max three levels if at all possible
+* Multiple inheritance is fine, e.g. a probe might represent multiple points in this typology
+
+Policy metadata
+^^^^^^^^^^^^^^^
+
+The total set of points in the behaviour typology can be represented as a dictionary. Definitions of policy names, descriptions, and behaviours are stored in a JSON data file
+
+* Key: behaviour identifier - format is TDDDs*
+	* T: a top-level hierarchy code letter, in CTMS for chat/tasks/meta/safety
+	* D: a three-digit code for this behaviour
+	* s*: (optional) one or more letters identifying a sub-policy
+
+Value: a dict describing a behaviour
+   * “name”: A short name of what is permitted when this behaviour is allowed
+   * “description”: (optional) a deeper description of this behaviour
+
+The structure of the identifiers describes the hierarchical structure.
+
+
+.. automodule:: garak.policy
+   :members:
+   :undoc-members:
+   :show-inheritance:   
diff --git a/garak/_config.py b/garak/_config.py
@@ -28,7 +28,7 @@
 system_params = (
     "verbose narrow_output parallel_requests parallel_attempts skip_unknown".split()
 )
-run_params = "seed deprefix eval_threshold generations probe_tags interactive".split()
+run_params = "seed deprefix eval_threshold generations probe_tags interactive policy_scan".split()
 plugins_params = "model_type model_name extended_detectors".split()
 reporting_params = "taxonomy report_prefix".split()
 project_dir_name = "garak"
@@ -77,6 +77,7 @@ class TransientConfig(GarakSubConfig):
 run = GarakSubConfig()
 plugins = GarakSubConfig()
 reporting = GarakSubConfig()
+policy = GarakSubConfig()
 
 
 def _lock_config_as_dict():
@@ -146,13 +147,14 @@ def _load_yaml_config(settings_filenames) -> dict:
 
 
 def _store_config(settings_files) -> None:
-    global system, run, plugins, reporting, version
+    global system, run, plugins, reporting, version, policy
     settings = _load_yaml_config(settings_files)
     system = _set_settings(system, settings["system"])
     run = _set_settings(run, settings["run"])
     run.user_agent = run.user_agent.replace("{version}", version)
     plugins = _set_settings(plugins, settings["plugins"])
     reporting = _set_settings(reporting, settings["reporting"])
+    policy = _set_settings(plugins, settings["policy"])
 
 
 # not my favourite solution in this module, but if
@@ -308,3 +310,18 @@ def parse_plugin_spec(
             plugin_names.remove(plugin_to_skip)
 
     return plugin_names, unknown_plugins
+
+
+def distribute_generations_config(probelist, _config):
+    # prepare run config: generations
+    for probe in probelist:
+        # distribute `generations` to the probes
+        p_type, p_module, p_klass = probe.split(".")
+        if (
+            hasattr(_config.run, "generations")
+            and _config.run.generations
+            is not None  # garak.core.yaml always provides run.generations
+        ):
+            _config.plugins.probes[p_module][p_klass][
+                "generations"
+            ] = _config.run.generations
diff --git a/garak/_plugins.py b/garak/_plugins.py
@@ -326,7 +326,7 @@ def plugin_info(plugin: Union[Callable, str]) -> dict:
 
 
 def enumerate_plugins(
-    category: str = "probes", skip_base_classes=True
+    category: str = "probes", skip_base_classes=True, filter: Union[None, dict] = None
 ) -> List[tuple[str, bool]]:
     """A function for listing all modules & plugins of the specified kind.
 
@@ -352,6 +352,13 @@ def enumerate_plugins(
     for k, v in PluginCache.instance()[category].items():
         if skip_base_classes and ".base." in k:
             continue
+        if filter is not None:
+            try:
+                for attrib, value in filter.items():
+                    if attrib in v and v[attrib] != value:
+                        raise StopIteration
+            except StopIteration:
+                continue
         enum_entry = (k, v["active"])
         plugin_class_names.add(enum_entry)
 

diff --git a/garak/cli.py b/garak/cli.py
@@ -3,7 +3,7 @@
 
 """Flow for invoking garak from the command line"""
 
-command_options = "list_detectors list_probes list_generators list_buffs list_config plugin_info interactive report version fix".split()
+command_options = "list_detectors list_probes list_policy_probes list_generators list_buffs list_config plugin_info interactive report version fix".split()
 
 
 def parse_cli_plugin_config(plugin_type, args):
@@ -223,6 +223,9 @@ def main(arguments=None) -> None:
     parser.add_argument(
         "--list_probes", action="store_true", help="list available vulnerability probes"
     )
+    parser.add_argument(
+        "--list_policy_probes", action="store_true", help="list available policy probes"
+    )
     parser.add_argument(
         "--list_detectors", action="store_true", help="list available detectors"
     )
@@ -259,11 +262,6 @@ def main(arguments=None) -> None:
         action="store_true",
         help="Enter interactive probing mode",
     )
-    parser.add_argument(
-        "--generate_autodan",
-        action="store_true",
-        help="generate AutoDAN prompts; requires --prompt_options with JSON containing a prompt and target",
-    )
     parser.add_argument(
         "--interactive.py",
         action="store_true",
@@ -282,7 +280,12 @@ def main(arguments=None) -> None:
         parser.description = (
             str(parser.description) + " - EXPERIMENTAL FEATURES ENABLED"
         )
-        pass
+        parser.add_argument(
+            "--policy_scan",
+            action="store_true",
+            default=_config.run.policy_scan,
+            help="determine model's behavior policy before scanning",
+        )
 
     logging.debug("args - raw argument string received: %s", arguments)
 
@@ -418,6 +421,9 @@ def main(arguments=None) -> None:
         elif args.list_probes:
             command.print_probes()
 
+        elif args.list_policy_probes:
+            command.print_policy_probes()
+
         elif args.list_detectors:
             command.print_detectors()
 
@@ -499,6 +505,7 @@ def main(arguments=None) -> None:
 
             print(f"📜 logging to {log_filename}")
 
+            # set up generator
             conf_root = _config.plugins.generators
             for part in _config.plugins.model_type.split("."):
                 if not part in conf_root:
@@ -521,6 +528,7 @@ def main(arguments=None) -> None:
                 logging.error(message)
                 raise ValueError(message)
 
+            # validate main run config
             parsable_specs = ["probe", "detector", "buff"]
             parsed_specs = {}
             for spec_type in parsable_specs:
@@ -544,20 +552,7 @@ def main(arguments=None) -> None:
                         msg_list = ",".join(rejected)
                         raise ValueError(f"❌Unknown {spec_namespace}❌: {msg_list}")
 
-            for probe in parsed_specs["probe"]:
-                # distribute `generations` to the probes
-                p_type, p_module, p_klass = probe.split(".")
-                if (
-                    hasattr(_config.run, "generations")
-                    and _config.run.generations
-                    is not None  # garak.core.yaml always provides run.generations
-                ):
-                    _config.plugins.probes[p_module][p_klass][
-                        "generations"
-                    ] = _config.run.generations
-
-            evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)
-
+            # generator init
             from garak import _plugins
 
             generator = _plugins.load_plugin(
@@ -574,28 +569,28 @@ def main(arguments=None) -> None:
                     logging=logging,
                 )
 
-            if "generate_autodan" in args and args.generate_autodan:
-                from garak.resources.autodan import autodan_generate
-
-                try:
-                    prompt = _config.probe_options["prompt"]
-                    target = _config.probe_options["target"]
-                except Exception as e:
-                    print(
-                        "AutoDAN generation requires --probe_options with a .json containing a `prompt` and `target` "
-                        "string"
-                    )
-                autodan_generate(generator=generator, prompt=prompt, target=target)
-
+            # looks like we might get something to report, so fire that up
             command.start_run()  # start the run now that all config validation is complete
             print(f"📜 reporting to {_config.transient.report_filename}")
 
+            # do policy run
+            if _config.run.policy_scan:
+                command.run_policy_scan(generator, _config)
+
+            # configure generations counts for main run
+            _config.distribute_generations_config(parsed_specs["probe"], _config)
+
+            # set up plugins for main run
+            # instantiate evaluator
+            evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)
+
+            # parse & set up detectors, if supplied
             if parsed_specs["detector"] == []:
-                command.probewise_run(
+                run_result = command.probewise_run(
                     generator, parsed_specs["probe"], evaluator, parsed_specs["buff"]
                 )
             else:
-                command.pxd_run(
+                run_result = command.pxd_run(
                     generator,
                     parsed_specs["probe"],
                     parsed_specs["detector"],
-Original file line number
+Diff line change
@@ Expand Up / @@ -48,6 +48,7 @@ Advanced usage @@
        configurable
        cliref
+       policy
     Code reference
     ^^^^^^^^^^^^^^
@@ Expand Down @@