-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the IFEval scenario #3122
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks! Left some comments.
Could you open a separate pull request to add IFEval to schema_lite_v2.yaml
? Basically, follow what has already been done with gpqa
. You'll also need to add ifeval_strict_accuracy
to the metrics (follow exact_match
).
super().__init__() | ||
|
||
def get_instances(self, output_path: str) -> List[Instance]: | ||
# Get GPQA from HuggingFace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still seems to be here?
|
||
name = "ifeval" | ||
description = "Instruction-Following Evaluation for Large Language Models" | ||
tags = ["question answering"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"instruction following"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest change.
input=input, | ||
references=[], | ||
split=TEST_SPLIT, | ||
extra_data={"instruction_id_list": row["instruction_id_list"], "question_kwargs": row["kwargs"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
"instruction_id_list"
-> "instruction_ids"
"question_kwargs"
-> "instruction_kwargs"
Update the key names in the metrics as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest change.
def get_ifeval_metric_specs() -> List[MetricSpec]: | ||
return [MetricSpec(class_name="helm.benchmark.metrics.ifeval_metrics.IFEvalMetric")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is scenario-specific so don't put it in common_metric_specs
; just inline adapter_specs = [MetricSpec(class_name="helm.benchmark.metrics.ifeval_metrics.IFEvalMetric")]
in the run spec function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
I think you meant metric_specs = ...
, not adapter?
With that assumption, addressed in the latest change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Move these files to a
ifeval
subpackage withinmetrics
(exceptifeval_metrics.py
which can stay undermetrics
). - Don't run the linter on this file; just reproduce the raw contents exactly (except for import statements). This makes it easier for someone to audit that the code is unchanged using
diff
. - Add the following lines to the start of the file to skip the linter:
# flake8: noqa
# type: ignore
# The following code has reproduced with minor modifications to `import` statements from the following URL:
# https://github.com/google-research/google-research/blob/c7f60c013623e613732a096e2a0c2872491ec912/instruction_following_eval/instructions.py
Tip: you can get the permalink version of the GitHub URL with the githash by going to the latest version and pressing 'y' on your keyboard.
Likewise for the other ifeval_instructions*
files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest change.
P.S. Thanks for the tip, it's convenient! The shortcut for me is Shift+Ctrl+,
for some unknown reason, but it works too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be an OS / browser specific thing.
from helm.benchmark.metrics.metric_service import MetricService | ||
from helm.benchmark.metrics.statistic import Stat | ||
|
||
import src.helm.benchmark.metrics.ifeval_instructions_registry as instructions_registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove src.
from this import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest change.
import src.helm.benchmark.metrics.ifeval_instructions_util as instructions_util | ||
|
||
from src.helm.benchmark.metrics.ifeval_instructions_util import LANGUAGE_CODES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete src.
from these imports.
Likewise for the other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest change.
response = request_state.result.completions[0].text.strip() | ||
|
||
is_following_list = [] | ||
for index, instruction_id in enumerate(instruction_id_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest change.
else: | ||
is_following_list.append(0) | ||
|
||
return [Stat(MetricName("strict_accuracy")).add(sum(is_following_list) / len(is_following_list))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "strict_accuracy"
-> "ifeval_strict_accuracy"
- the name is generic enough that we should probably namespace it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest change.
Also I have no idea why the tests are failing... I'll look into it. |
I think this is because the IFEval code has these two dependencies not in helm yet: I installed them locally to make it run. |
Sure, I will do this after addressing all the comments. |
I looked into the testing failure errors and realized that the type checker was failing. In the current implementation, the @yifanmai Should we linearize IFEval's extra data, or should we update the type annotation of the |
Let's change |
950eec1
to
f542292
Compare
@yifanmai This PR is ready for review for merging |
This reverts commit 3f53b7c.
490501a
to
8a890a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
super().__init__() | ||
|
||
def get_instances(self, output_path: str) -> List[Instance]: | ||
# Get GPQA from HuggingFace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still seems to be here?
# - name: mmlu_pro | ||
# display_name: MMLU-Pro | ||
# description: MMLU-Pro | ||
# metric_groups: | ||
# - accuracy | ||
# - efficiency | ||
# - general_information | ||
# environment: | ||
# main_name: exact_match # non-CoT | ||
# main_split: test | ||
# taxonomy: | ||
# task: "?" | ||
# what: "?" | ||
# who: "?" | ||
# when: "?" | ||
# language: English |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this change (otherwise it will conflict with the MMLU-Pro pull request)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Both comments are addressed in the latest change.
Adding the scenario, run specs, and metric for IFEval. No new adapters were added, instead reused the
GenerationAdapter
.