Call compute_all_general_metrics on all requests, not just the last one #2172

brianwgoldman · 2023-12-20T21:52:10Z

This is a resolution to issue #1989. There are 5 categories of Stats where more data will now be collected:

compute_efficiency_metrics - how many tokens were sent and how long did the request take.
compute_finish_reason_metrics - did the request finish due to length, stop tokens, etc.
compute_truncation_metrics - if the request was truncated
num_train_trials - no idea why this is even a Stat
num_references - Count how many references the request had.

To me it seems like the first 3 should be per request and not per instance. The last two are more questionable, but it doesn't seem wrong to do per request.

…eference's request. There are 5 categories of Stats where more data will now be collected: * compute_efficiency_metrics - how many tokens were sent and how long did the request take. * compute_finish_reason_metrics - did the request finish due to length, stop tokens, etc. * compute_truncation_metrics - if the request was truncated * num_train_trials - no idea why this is even a Stat * num_references - Count how many references the request had. To me it seems like the first 3 should be per request and not per instance. The last two are more questionable, but it doesn't seem wrong to do per request.

percyliang

This change seems right. Does it affect any of the metric computations in practice?

brianwgoldman · 2023-12-21T17:26:03Z

This change seems right. Does it affect any of the metric computations in practice?

My read of the code is that anything using BasicMetric in combination with BinaryRankingAdapter, MultipleChoiceSeparateAdapter, or MultipleChoiceCalibratedAdapter will see different values. The following run_spec_function seem to use those:

msmarco
blimp - if you don't override the default
cleva - if the prompt template has "mul_as_gen" in meta set to False.

There are also around 8 more scenarios that accept method as a parameter that could be configured to use those adapters.

I've not done any runs to compare output, as I'm not sure what the best way to do that would be.

brianwgoldman requested a review from percyliang December 20, 2023 21:52

Merge branch 'main' into auxy/fix-general-metrics

2a43bef

percyliang approved these changes Dec 21, 2023

View reviewed changes

percyliang merged commit 69c583f into main Dec 23, 2023
6 checks passed

percyliang deleted the auxy/fix-general-metrics branch December 23, 2023 04:38

brianwgoldman mentioned this pull request Jan 4, 2024

Fix breakage caused by #2172 #2194

Merged

yifanmai pushed a commit that referenced this pull request Jan 9, 2024

Fix breakage caused by #2172 (#2194)

02982a4

brianwgoldman mentioned this pull request Jan 10, 2024

Unintended (or at least unclear) behavior in BasicMetric.evaluate_references #1989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call compute_all_general_metrics on all requests, not just the last one #2172

Call compute_all_general_metrics on all requests, not just the last one #2172

brianwgoldman commented Dec 20, 2023

percyliang left a comment

brianwgoldman commented Dec 21, 2023

Call compute_all_general_metrics on all requests, not just the last one #2172

Call compute_all_general_metrics on all requests, not just the last one #2172

Conversation

brianwgoldman commented Dec 20, 2023

percyliang left a comment

Choose a reason for hiding this comment

brianwgoldman commented Dec 21, 2023