[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model #1751
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Adapt the Bradley-Terry rating system from FastChat to the subjective evaluation setting, but replacing human evaluators with LLM-as-a-judge.
Modification
pairwise_bt_judge
for both singleturn and multiturn evaluation for compass_arena_subjective_bench.keep_preds
argument to the init method ofLMEvaluator
for an option to keep LLM predictions (from the inference stage) when saving judge responses during the evaluation stage. This is useful when the postprocessor needs to calculate metadata or metrics based on each LLM's predictions (e.g. response length).CompassArenaBradleyTerrySummarizer.summarize
to fit another BT model with the first base_model and combining matches from all subsets to produce a single rating for each judge_model-LLM combination. The ratings are returned in this format:{'CompassArenaSubjBenchBradleyTerry': {'judge_model': {'model_1': xxx, 'model_2': xxx}}}
BC-breaking (Optional)
No breaking changes.
Use cases (Optional)
Use cases and examples shown in README
Checklist
Before PR: