[BFCL] How should we calculate the overall accuracy for V2-Live dataset? #602

HuanzhiMao · 2024-08-25T00:32:56Z

HuanzhiMao
Aug 25, 2024
Collaborator

Would love to hear the thoughts from the community on this matter:

BFCL V2 • Live dataset features a diverse set of 2,251 question-function-answer pairs. It comprises of 258 simple, 7 multiple, 16 parallel, 24 parallel multiple, 875 irrelevance detection, and 41 relevance detection entries.
When determining the ast_summary score for the Live leaderboard (with only live categories), we take the weighted average of the four AST categories (simple, multiple. parallel. parallel multiple). This is because we want the summary score to reflect the real-life composition. If people tend to do more multiple, then the multiple category should carry more weight.

The question is: How should we calculate the overall_accuracy for the BFCL V2 • Live dataset?

Currently, it is calculated as the weighted average of all Live categories (simple, multiple. parallel. parallel multiple, irrelevance, relevance). However, because the irrelevance category accounts for about half of the total entries, it would "unfairly" benefit the model that tends to get parsing/decoding errors. For some weak models that don't follow the formatting instructions (and thus the model handler cannot correctly decode their response), they would get almost 100% in the irrelevance category and that means at least half points in the overall_accuracy field. This is not ideal because the number of entries for irrelevance and relevance do not accurately reflect the real-life distributions, but rather, picked by us. So weight by the entry count seems unfair.

akshita-sukhlecha · 2024-08-25T16:56:34Z

akshita-sukhlecha
Aug 25, 2024

The point on irrelevance category unfairly improving the overall score is valid.

One way we can make overall_accuracy more realistic is by having five categories: simple, multiple, parallel, parallel multiple, effective_relevance.
And effective_relevance can be the harmonic mean of irrelevance and relevance (just like F1 score)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] How should we calculate the overall accuracy for V2-Live dataset? #602

{{title}}

Replies: 1 comment

{{title}}

Select a reply

[BFCL] How should we calculate the overall accuracy for V2-Live dataset? #602

HuanzhiMao Aug 25, 2024 Collaborator

Replies: 1 comment

akshita-sukhlecha Aug 25, 2024

HuanzhiMao
Aug 25, 2024
Collaborator

akshita-sukhlecha
Aug 25, 2024