Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the IFEval scenario #3122

Merged
merged 18 commits into from
Nov 12, 2024
Merged

Adding the IFEval scenario #3122

merged 18 commits into from
Nov 12, 2024

Conversation

liamjxu
Copy link
Contributor

@liamjxu liamjxu commented Oct 31, 2024

Adding the scenario, run specs, and metric for IFEval. No new adapters were added, instead reused the GenerationAdapter.

@liamjxu liamjxu self-assigned this Oct 31, 2024
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks! Left some comments.

Could you open a separate pull request to add IFEval to schema_lite_v2.yaml? Basically, follow what has already been done with gpqa. You'll also need to add ifeval_strict_accuracy to the metrics (follow exact_match).

super().__init__()

def get_instances(self, output_path: str) -> List[Instance]:
# Get GPQA from HuggingFace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still seems to be here?


name = "ifeval"
description = "Instruction-Following Evaluation for Large Language Models"
tags = ["question answering"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"instruction following"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest change.

input=input,
references=[],
split=TEST_SPLIT,
extra_data={"instruction_id_list": row["instruction_id_list"], "question_kwargs": row["kwargs"]},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

"instruction_id_list" -> "instruction_ids"
"question_kwargs" -> "instruction_kwargs"

Update the key names in the metrics as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest change.

Comment on lines 170 to 171
def get_ifeval_metric_specs() -> List[MetricSpec]:
return [MetricSpec(class_name="helm.benchmark.metrics.ifeval_metrics.IFEvalMetric")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is scenario-specific so don't put it in common_metric_specs; just inline adapter_specs = [MetricSpec(class_name="helm.benchmark.metrics.ifeval_metrics.IFEvalMetric")] in the run spec function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

I think you meant metric_specs = ..., not adapter?

With that assumption, addressed in the latest change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Move these files to a ifeval subpackage within metrics (except ifeval_metrics.py which can stay under metrics).
  2. Don't run the linter on this file; just reproduce the raw contents exactly (except for import statements). This makes it easier for someone to audit that the code is unchanged using diff.
  3. Add the following lines to the start of the file to skip the linter:
# flake8: noqa
# type: ignore
# The following code has reproduced with minor modifications to `import` statements from the following URL:
# https://github.com/google-research/google-research/blob/c7f60c013623e613732a096e2a0c2872491ec912/instruction_following_eval/instructions.py

Tip: you can get the permalink version of the GitHub URL with the githash by going to the latest version and pressing 'y' on your keyboard.

Likewise for the other ifeval_instructions* files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest change.

P.S. Thanks for the tip, it's convenient! The shortcut for me is Shift+Ctrl+, for some unknown reason, but it works too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be an OS / browser specific thing.

from helm.benchmark.metrics.metric_service import MetricService
from helm.benchmark.metrics.statistic import Stat

import src.helm.benchmark.metrics.ifeval_instructions_registry as instructions_registry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove src. from this import.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest change.

Comment on lines 27 to 29
import src.helm.benchmark.metrics.ifeval_instructions_util as instructions_util

from src.helm.benchmark.metrics.ifeval_instructions_util import LANGUAGE_CODES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete src. from these imports.

Likewise for the other files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest change.

response = request_state.result.completions[0].text.strip()

is_following_list = []
for index, instruction_id in enumerate(instruction_id_list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest change.

else:
is_following_list.append(0)

return [Stat(MetricName("strict_accuracy")).add(sum(is_following_list) / len(is_following_list))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "strict_accuracy" -> "ifeval_strict_accuracy" - the name is generic enough that we should probably namespace it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest change.

@yifanmai
Copy link
Collaborator

Also I have no idea why the tests are failing... I'll look into it.

@liamjxu
Copy link
Contributor Author

liamjxu commented Oct 31, 2024

Also I have no idea why the tests are failing... I'll look into it.

I think this is because the IFEval code has these two dependencies not in helm yet: langdetect, immutabledict

I installed them locally to make it run.

@liamjxu
Copy link
Contributor Author

liamjxu commented Oct 31, 2024

Awesome, thanks! Left some comments.

Could you open a separate pull request to add IFEval to schema_lite_v2.yaml? Basically, follow what has already been done with gpqa. You'll also need to add ifeval_strict_accuracy to the metrics (follow exact_match).

Sure, I will do this after addressing all the comments.

@liamjxu
Copy link
Contributor Author

liamjxu commented Oct 31, 2024

Found 12 errors in 2 files (checked 650 source files)
src/helm/benchmark/scenarios/test_ifeval_scenario.py:31: error: Invalid index type "str" for "str | Any"; expected type "SupportsIndex | slice"  [index]
src/helm/benchmark/scenarios/test_ifeval_scenario.py:32: error: Invalid index type "str | Any" for "str | Any"; expected type "SupportsIndex | slice"  [index]
src/helm/benchmark/scenarios/test_ifeval_scenario.py:33: error: Invalid index type "str" for "str | Any"; expected type "SupportsIndex | slice"  [index]
src/helm/benchmark/scenarios/test_ifeval_scenario.py:34: error: Invalid index type "str" for "str | Any"; expected type "SupportsIndex | slice"  [index]
src/helm/benchmark/scenarios/test_ifeval_scenario.py:35: error: Invalid index type "str | Any" for "str | Any"; expected type "SupportsIndex | slice"  [index]
src/helm/benchmark/metrics/ifeval_metrics.py:22: error: Value of type "dict[str, str] | None" is not indexable  [index]
src/helm/benchmark/metrics/ifeval_metrics.py:23: error: Value of type "dict[str, str] | None" is not indexable  [index]
src/helm/benchmark/metrics/ifeval_metrics.py:34: error: Module has no attribute "INSTRUCTION_DICT"  [attr-defined]
src/helm/benchmark/metrics/ifeval_metrics.py:37: error: Item "str" of "str | Any" has no attribute "items"  [union-attr]
Error: Process completed with exit code 1.

I looked into the testing failure errors and realized that the type checker was failing.

In the current implementation, the extra_data field in the Instance class is annotated with type Optional[Dict[str, str]], yet in IFEval, instruction_ids maps to a list of strings and instruction_kwargs maps to a list of dictionaries.

@yifanmai Should we linearize IFEval's extra data, or should we update the type annotation of the extra_data field?

@liamjxu liamjxu requested a review from yifanmai October 31, 2024 20:32
@yifanmai
Copy link
Collaborator

Let's change extra_data to type Dict[str, Any] - does this make the type checker pass? My opinion is that the value can consist of any JSON serializable object, including nested dicts and lists,

@liamjxu liamjxu force-pushed the jialiang/ifeval branch 4 times, most recently from 950eec1 to f542292 Compare November 2, 2024 18:34
@liamjxu
Copy link
Contributor Author

liamjxu commented Nov 5, 2024

@yifanmai This PR is ready for review for merging

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

super().__init__()

def get_instances(self, output_path: str) -> List[Instance]:
# Get GPQA from HuggingFace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still seems to be here?

Comment on lines 140 to 155
# - name: mmlu_pro
# display_name: MMLU-Pro
# description: MMLU-Pro
# metric_groups:
# - accuracy
# - efficiency
# - general_information
# environment:
# main_name: exact_match # non-CoT
# main_split: test
# taxonomy:
# task: "?"
# what: "?"
# who: "?"
# when: "?"
# language: English
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this change (otherwise it will conflict with the MMLU-Pro pull request)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Both comments are addressed in the latest change.

@liamjxu liamjxu merged commit f9c4498 into main Nov 12, 2024
12 checks passed
@liamjxu liamjxu deleted the jialiang/ifeval branch November 12, 2024 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants