The BabyBLUE (Benchmark for Reliability and JailBreak halLUcination Evaluation) is a novel benchmark designed to assess the susceptibility of large language models (LLMs) to hallucinations and jailbreak attempts. Unlike traditional benchmarks that may misinterpret hallucinated outputs as genuine security threats, BabyBLUE focuses on distinguishing between actual harmful outputs and benign hallucinations.
- Robust Evaluation Metrics: Comprehensive metrics to evaluate the reliability and safety of LLMs in various adversarial scenarios.
- Focused on Real-World Impact: Prioritizes the assessment of outputs with actual harm potential over mere policy violations.
- Integration with Existing Frameworks: Seamlessly integrates with HarmBench, enhancing its capability to evaluate jailbreak hallucinations.
- Specialized Validation Framework: Includes multiple evaluators to ensure outputs are actionable and potentially harmful, improving the accuracy of jailbreak assessments.
To evaluate the completions generated by your models, follow these steps:
-
Prepare Results File:
- Ensure that the completions result file is placed in the
results
directory. The file should contain the output from the models you wish to evaluate.
- Ensure that the completions result file is placed in the
-
Run Evaluation Script:
- Execute the evaluation script by running the following command in your terminal:
python evaluate.py
- This script will process the results file and provide an evaluation based on the BabyBLUE benchmark metrics.
- Execute the evaluation script by running the following command in your terminal:
To generate completions using the benchmark, follow these steps:
-
Clone HarmBench Repository:
- First, clone the HarmBench repository to your local machine:
git clone https://github.com/centerforaisafety/HarmBench
- First, clone the HarmBench repository to your local machine:
-
Setup Environment:
- Copy all the project files from this repository into the HarmBench directory:
cp -r * /path/to/HarmBench
- Ensure that the files are correctly placed within the HarmBench directory structure.
- Copy all the project files from this repository into the HarmBench directory:
-
Generate Results:
- Follow the documentation provided in the HarmBench repository to generate the necessary results. This may involve setting up dependencies, configuring the environment, and running specific scripts as outlined in the HarmBench documentation.
-
Run Pipeline on SLURM Cluster:
- To generate and evaluate completions using a SLURM cluster, execute the following command:
python ./scripts/run_pipeline.py --methods ZeroShot,PEZ,TAP --models baichuan2_7b,mistral_7b,llama2_70b --step 2_and_3 --mode slurm
- This command will run the specified methods and models, generating completions and evaluating them according to the Baby'BLUE benchmark.
- To generate and evaluate completions using a SLURM cluster, execute the following command:
This project builds on the work provided by the following repositories:
Please refer to these projects for additional tools and insights related to the field of AI safety and large language model evaluation.