-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Omni-MATH #3187
Add Omni-MATH #3187
Conversation
) | ||
|
||
adapter_spec = AdapterSpec( | ||
method=ADAPT_GENERATION, input_prefix="", output_prefix="", max_tokens=1000, num_outputs=1, temperature=0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is max_tokens=1000
consistent with what the paper recommends? (Looks good to me, just checking.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paper did not make recommendations on the generation length, seems they just take it as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 should be fine then.
method=ADAPT_GENERATION, input_prefix="", output_prefix="", max_tokens=1000, num_outputs=1, temperature=0.0, | ||
) | ||
annotator_specs = [AnnotatorSpec(class_name="helm.benchmark.annotation.omnimath_annotator.OmniMATHAnnotator")] | ||
metric_specs = [MetricSpec(class_name="helm.benchmark.metrics.omnimath_metrics.OmniMATHMetric")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metric_specs = get_basic_metric_specs([]) + [MetricSpec(class_name="helm.benchmark.metrics.omnimath_metrics.OmniMATHMetric")]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
import datasets | ||
import os | ||
from typing import List | ||
from helm.benchmark.scenarios.scenario import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline before the first import, to follow PEP-8 convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
(and potentially more) sub-domains and span across 10 distinct difficulty levels, enabling a nuanced | ||
analysis of model performance across various mathematical disciplines and levels of complexity..""" | ||
|
||
name = "omnimath" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "omni_math" throughout, which is closer to the stylization that the authors used. Same for metric name and annotator name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
|
||
|
||
class OmniMATHScenario(Scenario): | ||
"""OmniMATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Omni-MATH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
def __init__(self): | ||
super().__init__() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete empty constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
ensure_directory_exists(cache_dir) | ||
dataset = datasets.load_dataset( | ||
"KbsdJames/Omni-MATH", | ||
trust_remote_code=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't set trust_remote_code
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
references=[], | ||
split=TEST_SPLIT, | ||
extra_data={ | ||
"answer": row["answer"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the answer in references
, rather than in extra_data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
|
||
def __init__(self, auto_client: AutoClient): | ||
self._auto_client = auto_client | ||
with open("src/helm/benchmark/annotation/omnimath/gpt_evaluation_template.txt") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two additional steps you need to take to make this work when someone uses pip install
to install HELM:
- Add a pattern to the manifest that matches the path to
gpt_evaluation_template.txt
. - Use
importlib_resources
to open the file. Refer to this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
49c0aa5
to
a6788b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a lot of unrelated changes got added to this branch - could you revert them and also resolve the merge conflicts?
@@ -0,0 +1,68 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline after typing
rather than before.
class OmniMATHAnnotator(Annotator): | ||
"""The OmniMATH autograder.""" | ||
|
||
name = "omnimath" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still seems to be "omnimath" rather than "omni_math" - did you miss a commit?
|
||
|
||
class OmniMATHAnnotator(Annotator): | ||
"""The OmniMATH autograder.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omni-MATH
) | ||
|
||
adapter_spec = AdapterSpec( | ||
method=ADAPT_GENERATION, input_prefix="", output_prefix="", max_tokens=1000, num_outputs=1, temperature=0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 should be fine then.
setup.cfg
Outdated
@@ -81,6 +81,7 @@ metrics = | |||
sacrebleu~=2.2.1 # For disinformation_metrics, machine_translation_metrics | |||
langdetect~=1.0.9 # For ifeval_metrics | |||
immutabledict~=4.2.0 # For ifeval_metrics | |||
gradio_client==1.4.3 # For bigcodebench_metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need this
assert eval_instance.extra_data | ||
messages = [ | ||
{"role": message["role"], "content": message["content"]} | ||
for message in eval_instance.extra_data["conversation"] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add instance.input.messages
and use that instead of extra_data
before we launch? OK to do in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, this is for IFEval, not for OmniMath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you reconcile this with main before merging.
import datasets | ||
import os | ||
from typing import List | ||
from helm.benchmark.scenarios.scenario import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline before the helm
imports
"KbsdJames/Omni-MATH", | ||
cache_dir=cache_dir, | ||
split="test", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set the revision
argument
Adding the scenario, metric, run specs, and annotator for OmniMATH.