Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model #1751

Merged
merged 4 commits into from
Dec 16, 2024

Conversation

acylam
Copy link
Collaborator

@acylam acylam commented Dec 10, 2024

Motivation

Adapt the Bradley-Terry rating system from FastChat to the subjective evaluation setting, but replacing human evaluators with LLM-as-a-judge.

Modification

  • Added the Bradley-Terry rating method for subjective evaluation.
  • Added pairwise_bt_judge for both singleturn and multiturn evaluation for compass_arena_subjective_bench.
  • Added the keep_preds argument to the init method of LMEvaluator for an option to keep LLM predictions (from the inference stage) when saving judge responses during the evaluation stage. This is useful when the postprocessor needs to calculate metadata or metrics based on each LLM's predictions (e.g. response length).
  • Added a step at the end of CompassArenaBradleyTerrySummarizer.summarize to fit another BT model with the first base_model and combining matches from all subsets to produce a single rating for each judge_model-LLM combination. The ratings are returned in this format: {'CompassArenaSubjBenchBradleyTerry': {'judge_model': {'model_1': xxx, 'model_2': xxx}}}
  • See README for more information.

BC-breaking (Optional)

No breaking changes.

Use cases (Optional)

Use cases and examples shown in README

Checklist

Before PR:

  • [x ] Pre-commit or other linting tools are used to fix the potential lint issues.
  • [x ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • [x ] The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • [x ] The documentation has been modified accordingly, like docstring or example tutorials.

@acylam acylam added documentation Improvements or additions to documentation Dataset Support for new dataset labels Dec 10, 2024
@acylam acylam merged commit 1bd594f into open-compass:main Dec 16, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataset Support for new dataset documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants