Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call compute_all_general_metrics on all requests, not just the last one #2172

Merged
merged 2 commits into from
Dec 23, 2023

Conversation

brianwgoldman
Copy link
Contributor

This is a resolution to issue #1989. There are 5 categories of Stats where more data will now be collected:

  • compute_efficiency_metrics - how many tokens were sent and how long did the request take.
  • compute_finish_reason_metrics - did the request finish due to length, stop tokens, etc.
  • compute_truncation_metrics - if the request was truncated
  • num_train_trials - no idea why this is even a Stat
  • num_references - Count how many references the request had.

To me it seems like the first 3 should be per request and not per instance. The last two are more questionable, but it doesn't seem wrong to do per request.

…eference's request.

There are 5 categories of Stats where more data will now be collected:

* compute_efficiency_metrics - how many tokens were sent and how long did the request take.
* compute_finish_reason_metrics - did the request finish due to length, stop tokens, etc.
* compute_truncation_metrics - if the request was truncated
* num_train_trials - no idea why this is even a Stat
* num_references - Count how many references the request had.

To me it seems like the first 3 should be per request and not per instance. The last two are more questionable, but it doesn't seem wrong to do per request.
Copy link
Contributor

@percyliang percyliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems right. Does it affect any of the metric computations in practice?

@brianwgoldman
Copy link
Contributor Author

This change seems right. Does it affect any of the metric computations in practice?

My read of the code is that anything using BasicMetric in combination with BinaryRankingAdapter, MultipleChoiceSeparateAdapter, or MultipleChoiceCalibratedAdapter will see different values. The following run_spec_function seem to use those:

  • msmarco
  • blimp - if you don't override the default
  • cleva - if the prompt template has "mul_as_gen" in meta set to False.

There are also around 8 more scenarios that accept method as a parameter that could be configured to use those adapters.

I've not done any runs to compare output, as I'm not sure what the best way to do that would be.

@percyliang percyliang merged commit 69c583f into main Dec 23, 2023
6 checks passed
@percyliang percyliang deleted the auxy/fix-general-metrics branch December 23, 2023 04:38
yifanmai pushed a commit that referenced this pull request Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants