In an era where Large Language Models (LLMs) are rapidly gaining traction for diverse applications, the need for comprehensive evaluation and comparison of these models has never been more critical. This repository is an effort in that direction, providing an evaluation method and the toolkit for the assessment of Large Language Models.
Please read the Blog Post for more context.
evalgpt.ai hosts the Leaderboard of some of the top LLMs ranked by their Elo scores. The leaderboard is updated frequently and provides a comprehensive and fair assessment of Large Language Models. Different features of the website are described below.
The Elo Leaderboard provides a ranking of the top LLMs based on their Elo scores. The Elo scores are computed from the results of A/B tests, wherein the LLMs are pitted against each other in a series of games. The ranking system employed is based on the Elo Rating System. The procedure for Elo score computation closely follows the methodology outlined at this resource.
Prompts tab has the list of 60 prompts used to evaluate the LLMs. The prompts are categorized into different categories based on the type of task they are designed for.
In the Responses section, you can see the responses generated by the LLMs for the prompts. You can also select the LLMs and prompts to compare the responses.
Click on the "Select Models" button to select the LLMs to compare. You can also select a different prompt using the "Previous" and "Next" buttons.
For any two selected models and the prompt, you can see the evaluation by GPT4 by clicking on the "Show GPT Eval" button on the top right.
"Which is Better: A or B?" provides the interface to perform human evaluation of the LLMs. Each A/B test consists of a prompt and two responses generated by two different LLMs. The user is asked to select the better response among the two.
git clone https://github.com/h2oai/h2o-LLM-eval.git
cd h2o-LLM-eval
docker compose up -d
Navigate to http://localhost:10101/ in your browser
git clone https://github.com/h2oai/h2o-LLM-eval.git
docker volume create llm-eval-db-data
docker run -d --name=llm-eval-db -p 5432:5432 -v llm-eval-db-data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=pgpassword postgres:14
- On Ubuntu:
sudo apt update
sudo apt install postgresql-client
- On macOS:
brew install libpq
echo 'export PATH="/usr/local/opt/libpq/bin:$PATH"' >> ~/.zshrc
PGPASSWORD=pgpassword psql --host=localhost --port=5432 --username=postgres < data/10_init.sql
The setup is tested on Python 3.10
python -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
POSTGRES_HOST=localhost POSTGRES_USER=maker POSTGRES_PASSWORD=makerpassword POSTGRES_DB=llm_eval_db H2O_WAVE_NO_LOG=true wave run llm_eval/app.py
Navigate to http://localhost:10101/ in your browser
We provide notebooks to generate leaderboard results and reproduce evalgpt.ai.
-
Run run_all_evaluations.ipynb to evaluate any A/B tests that have not yet been evaluated by a chosen evaluation model and insert the outcomes into the database. An A/B test is considered unevaluated by the given model if no evaluation by the model exists for the given combination of models and prompt. After adding a model, running this evaluates all A/B tests for the model against all other models.
-
Run all cells in calculate_elo_rating_public_leaderboard.ipynb to get the Elo leaderboard and relevant charts given the evaluations in the database.
- Add FreeWilly2 to the Leaderboard
- v2 architecture
- Option for users to submit new models
- More prompts in each category
- Document Q/A and Retrieval Category with ground truth
- Document Summarization Category