Skip to content

A framework for evaluating LLM outputs using LLMs as 🗣️ advocates, featuring a ⚖️ judge and jury system for dynamic assessment

License

Notifications You must be signed in to change notification settings

abirharrasse/LLM_Judging_Architectures

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-Based-Judging-Architectures

Architectures Diagram

Background

As large language models (LLMs) evolve, evaluating their outputs becomes increasingly complex. Traditional methods, such as human assessments, are often costly and inconsistent, while automated metrics may fail to capture the nuances of LLM performance.

To address these challenges, we propose a novel framework that uses LLMs as advocates in a dynamic, courtroom-inspired system. This approach involves LLMs acting as advocates, judges, and juries to provide a comprehensive evaluation of model outputs.

This framework integrates decision theory, bounded rationality, and economic incentives to offer a robust evaluation method. Subsequent sections will detail the architecture and experiments validating its effectiveness.

Data

Dataset

This project uses the lmsys/mt_bench_human_judgments dataset from Hugging Face.

Preprocessing

The script src/preprocess_mt_bench.py processes the raw data into an Excel file (data/mt_bench_human_judgments.xlsx) with the following columns:

  • Question: Aggregated user questions.
  • Response_A: Responses from Model A.
  • Response_B: Responses from Model B.
  • Model_A_Score: Binary score for Model A (1 for a win, 0 for a loss).
  • Model_B_Score: Binary score for Model B (1 for a win, 0 for a loss).

Run the preprocessing with:

python data/preprocess_mt_bench.py

Installation and Setup

Our agentic architecture is built on MetaGPT, a framework designed to efficiently manage interactions between agents using shared memory.

To use the MetaGPT framework, follow these steps:

  1. Clone the Repository

First, clone the repository and navigate to the relevant directory:

git clone https://github.com/abirharrasse/LLM-Judging-Architectures  && cd LLM-Judging-Architectures/MetaGPT_LLM_advocates
  1. Install Dependencies

Install the necessary packages:

pip install --upgrade -e .
pip install together -q
  1. Initialize Configuration

Set up the configuration file for the MetaGPT framework:

metagpt --init-config

Before running this command, ensure you're in the correct directory:

import os
os.chdir('/content/LLM-Judging-Architectures/MetaGPT_LLM_advocates')
print(os.getcwd())
  1. Set API Keys

Set the required API keys to run your experiments:

os.environ['OPENAI_API_KEY'] = 'sk-'
os.environ['TOGETHER_API_KEY'] = '' 
os.environ['CLAUDE_API_KEY'] = ''
os.environ['GEMINI_API_KEY'] = ''
os.environ['COHERE_API_KEY'] = ''

Running the evaluation architectures

Single Advocate Multi-Round Evaluation (SAMRE):

To run the SAMRE evaluation framework: Begin by accessing the directory LLM_Judging_Architectures:

import os
os.chdir('/content/LLM_Judging_Architectures')
print(os.getcwd())

then call the samre_experiment function with the appropriate parameters:

! python utils/experiments.py --experiment samre --model_pl "mistral" --temperature 0.7 --question "What is the impact of AI on healthcare?" --answer1 "AI can improve diagnostic accuracy." --answer2 "AI might introduce bias in diagnosis." --investment 0.1 --n_rounds 4 --n_juries 3

Where:

  • model: Select one of the models supported by our framework, accessible via Together or OpenAI.
  • question: The question to which answer1 and answer2 respond.
  • answer1 and answer2: The answers to be evaluated.
  • investment: The maximum cost allocated for the experiment.
  • n_rounds: The number of evaluation rounds to conduct.
  • n_juries: The number of juries involved in the evaluation.

Multi-Advocate One-Round Evaluation (MORE)

To run the MORE evaluation framework: Begin by accessing the directory LLM_Judging_Architectures:

import os
os.chdir('/content/LLM_Judging_Architectures')
print(os.getcwd())

then call the more_experiment function with the appropriate parameters:

! python utils/experiments.py --experiment more --model_pl "mistral" --temperature 0.7 --question "How does AI influence education?" --answer1 "AI personalizes learning." --answer2 "AI could limit creativity." --n_advocates 2 --investment 3  --n_rounds 1

Where:

  • n_advocates: The number of advocates involved in the evaluation.
  • n_rounds: The number of evaluation rounds to conduct.

Full Dataset Evaluation

To execute our architectures on the entire mt-bench dataset and ompare it to the basemodel.py, use the following code:

!python utils/main.py

About

A framework for evaluating LLM outputs using LLMs as 🗣️ advocates, featuring a ⚖️ judge and jury system for dynamic assessment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published