H2O Large Language Model (LLM) Evaluation

In an era where Large Language Models (LLMs) are rapidly gaining traction for diverse applications, the need for comprehensive evaluation and comparison of these models has never been more critical. This repository is an effort in that direction, providing an evaluation method and the toolkit for the assessment of Large Language Models.

Please read the Blog Post for more context.

EvalGPT.ai
Docker Compose Setup
Local Setup
Reproducing Leaderboard
Roadmap

EvalGPT.ai

evalgpt.ai hosts the Leaderboard of some of the top LLMs ranked by their Elo scores. The leaderboard is updated frequently and provides a comprehensive and fair assessment of Large Language Models. Different features of the website are described below.

Elo Leaderboard

The Elo Leaderboard provides a ranking of the top LLMs based on their Elo scores. The Elo scores are computed from the results of A/B tests, wherein the LLMs are pitted against each other in a series of games. The ranking system employed is based on the Elo Rating System. The procedure for Elo score computation closely follows the methodology outlined at this resource.

Prompts

Prompts tab has the list of 60 prompts used to evaluate the LLMs. The prompts are categorized into different categories based on the type of task they are designed for.

Responses

In the Responses section, you can see the responses generated by the LLMs for the prompts. You can also select the LLMs and prompts to compare the responses.

Click on the "Select Models" button to select the LLMs to compare. You can also select a different prompt using the "Previous" and "Next" buttons.

For any two selected models and the prompt, you can see the evaluation by GPT4 by clicking on the "Show GPT Eval" button on the top right.

A/B Tests

"Which is Better: A or B?" provides the interface to perform human evaluation of the LLMs. Each A/B test consists of a prompt and two responses generated by two different LLMs. The user is asked to select the better response among the two.

Docker Compose Setup

1. Clone the repository

git clone https://github.com/h2oai/h2o-LLM-eval.git
cd h2o-LLM-eval

2. Run Docker Compose

docker compose up -d

Navigate to http://localhost:10101/ in your browser

Local Setup

1. Clone the repository

git clone https://github.com/h2oai/h2o-LLM-eval.git

2. Setup Database

a. Create a docker volume for the database

docker volume create llm-eval-db-data

b. Start PostgreSQL 14 in docker

docker run -d --name=llm-eval-db -p 5432:5432 -v llm-eval-db-data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=pgpassword postgres:14

c. Install PostgreSQL client

On Ubuntu:

sudo apt update
sudo apt install postgresql-client

On macOS:

brew install libpq
echo 'export PATH="/usr/local/opt/libpq/bin:$PATH"' >> ~/.zshrc

d. Load the latest data dump into the database

PGPASSWORD=pgpassword psql --host=localhost --port=5432 --username=postgres < data/10_init.sql

3. Setup the environment

The setup is tested on Python 3.10

python -m venv .venv

. .venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt

4. Run the App

POSTGRES_HOST=localhost POSTGRES_USER=maker POSTGRES_PASSWORD=makerpassword POSTGRES_DB=llm_eval_db H2O_WAVE_NO_LOG=true wave run llm_eval/app.py

Navigate to http://localhost:10101/ in your browser

Reproducing Leaderboard Results

We provide notebooks to generate leaderboard results and reproduce evalgpt.ai.

Run run_all_evaluations.ipynb to evaluate any A/B tests that have not yet been evaluated by a chosen evaluation model and insert the outcomes into the database. An A/B test is considered unevaluated by the given model if no evaluation by the model exists for the given combination of models and prompt. After adding a model, running this evaluates all A/B tests for the model against all other models.
Run all cells in calculate_elo_rating_public_leaderboard.ipynb to get the Elo leaderboard and relevant charts given the evaluations in the database.

Roadmap

Models

Add FreeWilly2 to the Leaderboard

Application

v2 architecture
Option for users to submit new models

Eval

More prompts in each category
Document Q/A and Retrieval Category with ground truth
Document Summarization Category

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
db		db
docs/images		docs/images
llm_eval		llm_eval
notebooks		notebooks
src		src
tests		tests
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pylintrc		.pylintrc
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

H2O Large Language Model (LLM) Evaluation

EvalGPT.ai

Elo Leaderboard

Prompts

Responses

A/B Tests

Docker Compose Setup

1. Clone the repository

2. Run Docker Compose

Local Setup

1. Clone the repository

2. Setup Database

a. Create a docker volume for the database

b. Start PostgreSQL 14 in docker

c. Install PostgreSQL client

d. Load the latest data dump into the database

3. Setup the environment

4. Run the App

Reproducing Leaderboard Results

Roadmap

Models

Application

Eval

About

Releases

Packages

Contributors 3

Languages

License

h2oai/h2o-LLM-eval

Folders and files

Latest commit

History

Repository files navigation

H2O Large Language Model (LLM) Evaluation

EvalGPT.ai

Elo Leaderboard

Prompts

Responses

A/B Tests

Docker Compose Setup

1. Clone the repository

2. Run Docker Compose

Local Setup

1. Clone the repository

2. Setup Database

a. Create a docker volume for the database

b. Start PostgreSQL 14 in docker

c. Install PostgreSQL client

d. Load the latest data dump into the database

3. Setup the environment

4. Run the App

Reproducing Leaderboard Results

Roadmap

Models

Application

Eval

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages