[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model #1751

acylam · 2024-12-10T10:41:19Z

Motivation

Adapt the Bradley-Terry rating system from FastChat to the subjective evaluation setting, but replacing human evaluators with LLM-as-a-judge.

Added the Bradley-Terry rating method for subjective evaluation.
Added pairwise_bt_judge for both singleturn and multiturn evaluation for compass_arena_subjective_bench.
Added the keep_preds argument to the init method of LMEvaluator for an option to keep LLM predictions (from the inference stage) when saving judge responses during the evaluation stage. This is useful when the postprocessor needs to calculate metadata or metrics based on each LLM's predictions (e.g. response length).
Added a step at the end of CompassArenaBradleyTerrySummarizer.summarize to fit another BT model with the first base_model and combining matches from all subsets to produce a single rating for each judge_model-LLM combination. The ratings are returned in this format: {'CompassArenaSubjBenchBradleyTerry': {'judge_model': {'model_1': xxx, 'model_2': xxx}}}
See README for more information.

No breaking changes.

Use cases and examples shown in README

Before PR:

[x ] Pre-commit or other linting tools are used to fix the potential lint issues.
[x ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
[x ] The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
[x ] The documentation has been modified accordingly, like docstring or example tutorials.

…not changing for pairwise_bt_judge.py

…n overall score for each judger model

acylam added 2 commits December 10, 2024 10:29

fix lint issues

ae4e388

updated gitignore

6c75538

acylam added documentation Improvements or additions to documentation Dataset Support for new dataset labels Dec 10, 2024

acylam requested review from tonysy and bittersweet1999 December 10, 2024 10:41

mm-assistant bot assigned liushz Dec 10, 2024

acylam temporarily deployed to prod December 10, 2024 10:42 — with GitHub Actions Inactive

changed infer_order from random to double for the pairwise_judge.py (…

0896744

…not changing for pairwise_bt_judge.py

acylam temporarily deployed to prod December 10, 2024 11:24 — with GitHub Actions Inactive

added return statement to CompassArenaBradleyTerrySummarizer to retur…

1cbd955

…n overall score for each judger model

acylam temporarily deployed to prod December 11, 2024 06:33 — with GitHub Actions Inactive

bittersweet1999 approved these changes Dec 11, 2024

View reviewed changes

acylam merged commit 1bd594f into open-compass:main Dec 16, 2024
8 checks passed