[BFCL] How should we calculate the overall accuracy for V2-Live dataset? #602
HuanzhiMao
started this conversation in
General
Replies: 1 comment
-
The point on One way we can make overall_accuracy more realistic is by having five categories: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Would love to hear the thoughts from the community on this matter:
BFCL V2 • Live dataset features a diverse set of 2,251 question-function-answer pairs. It comprises of 258 simple, 7 multiple, 16 parallel, 24 parallel multiple, 875 irrelevance detection, and 41 relevance detection entries.
When determining the
ast_summary
score for the Live leaderboard (with only live categories), we take the weighted average of the four AST categories (simple, multiple. parallel. parallel multiple
). This is because we want the summary score to reflect the real-life composition. If people tend to do moremultiple
, then themultiple
category should carry more weight.The question is: How should we calculate the
overall_accuracy
for the BFCL V2 • Live dataset?Currently, it is calculated as the weighted average of all Live categories (
simple, multiple. parallel. parallel multiple, irrelevance, relevance
). However, because theirrelevance
category accounts for about half of the total entries, it would "unfairly" benefit the model that tends to get parsing/decoding errors. For some weak models that don't follow the formatting instructions (and thus the model handler cannot correctly decode their response), they would get almost 100% in theirrelevance
category and that means at least half points in theoverall_accuracy
field. This is not ideal because the number of entries forirrelevance
andrelevance
do not accurately reflect the real-life distributions, but rather, picked by us. So weight by the entry count seems unfair.Beta Was this translation helpful? Give feedback.
All reactions