Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Omni-MATH #3187

Merged
merged 38 commits into from
Dec 16, 2024
Merged

Add Omni-MATH #3187

merged 38 commits into from
Dec 16, 2024

Conversation

liamjxu
Copy link
Contributor

@liamjxu liamjxu commented Nov 28, 2024

Adding the scenario, metric, run specs, and annotator for OmniMATH.

@liamjxu liamjxu self-assigned this Nov 28, 2024
)

adapter_spec = AdapterSpec(
method=ADAPT_GENERATION, input_prefix="", output_prefix="", max_tokens=1000, num_outputs=1, temperature=0.0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is max_tokens=1000 consistent with what the paper recommends? (Looks good to me, just checking.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper did not make recommendations on the generation length, seems they just take it as-is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000 should be fine then.

method=ADAPT_GENERATION, input_prefix="", output_prefix="", max_tokens=1000, num_outputs=1, temperature=0.0,
)
annotator_specs = [AnnotatorSpec(class_name="helm.benchmark.annotation.omnimath_annotator.OmniMATHAnnotator")]
metric_specs = [MetricSpec(class_name="helm.benchmark.metrics.omnimath_metrics.OmniMATHMetric")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric_specs = get_basic_metric_specs([]) + [MetricSpec(class_name="helm.benchmark.metrics.omnimath_metrics.OmniMATHMetric")]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change

import datasets
import os
from typing import List
from helm.benchmark.scenarios.scenario import (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline before the first import, to follow PEP-8 convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change

(and potentially more) sub-domains and span across 10 distinct difficulty levels, enabling a nuanced
analysis of model performance across various mathematical disciplines and levels of complexity.."""

name = "omnimath"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "omni_math" throughout, which is closer to the stylization that the authors used. Same for metric name and annotator name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change



class OmniMATHScenario(Scenario):
"""OmniMATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Omni-MATH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change

Comment on lines 26 to 28
def __init__(self):
super().__init__()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete empty constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change

ensure_directory_exists(cache_dir)
dataset = datasets.load_dataset(
"KbsdJames/Omni-MATH",
trust_remote_code=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't set trust_remote_code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change

references=[],
split=TEST_SPLIT,
extra_data={
"answer": row["answer"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the answer in references, rather than in extra_data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change


def __init__(self, auto_client: AutoClient):
self._auto_client = auto_client
with open("src/helm/benchmark/annotation/omnimath/gpt_evaluation_template.txt") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two additional steps you need to take to make this work when someone uses pip install to install HELM:

  1. Add a pattern to the manifest that matches the path to gpt_evaluation_template.txt.
  2. Use importlib_resources to open the file. Refer to this example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change

@liamjxu liamjxu force-pushed the jialiang/bigcodebench branch from 49c0aa5 to a6788b7 Compare December 14, 2024 00:13
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a lot of unrelated changes got added to this branch - could you revert them and also resolve the merge conflicts?

@@ -0,0 +1,68 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline after typing rather than before.

class OmniMATHAnnotator(Annotator):
"""The OmniMATH autograder."""

name = "omnimath"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still seems to be "omnimath" rather than "omni_math" - did you miss a commit?



class OmniMATHAnnotator(Annotator):
"""The OmniMATH autograder."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omni-MATH

)

adapter_spec = AdapterSpec(
method=ADAPT_GENERATION, input_prefix="", output_prefix="", max_tokens=1000, num_outputs=1, temperature=0.0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000 should be fine then.

setup.cfg Outdated
@@ -81,6 +81,7 @@ metrics =
sacrebleu~=2.2.1 # For disinformation_metrics, machine_translation_metrics
langdetect~=1.0.9 # For ifeval_metrics
immutabledict~=4.2.0 # For ifeval_metrics
gradio_client==1.4.3 # For bigcodebench_metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this

Comment on lines 25 to 29
assert eval_instance.extra_data
messages = [
{"role": message["role"], "content": message["content"]}
for message in eval_instance.extra_data["conversation"]
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add instance.input.messages and use that instead of extra_data before we launch? OK to do in a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, this is for IFEval, not for OmniMath?

@liamjxu liamjxu changed the title Add OmniMATH Add Omni-MATH Dec 16, 2024
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you reconcile this with main before merging.

import datasets
import os
from typing import List
from helm.benchmark.scenarios.scenario import (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline before the helm imports

"KbsdJames/Omni-MATH",
cache_dir=cache_dir,
split="test",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set the revision argument

@liamjxu liamjxu merged commit 0239261 into jialiang/bigcodebench Dec 16, 2024
12 checks passed
@liamjxu liamjxu deleted the jialiang/omnimath branch December 16, 2024 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants