New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add Omni-MATH #3187

Merged

liamjxu merged 38 commits into jialiang/bigcodebench from jialiang/omnimath

Dec 16, 2024

Contributor

liamjxu commented Nov 28, 2024

Adding the scenario, metric, run specs, and annotator for OmniMATH.

liamjxu added 25 commits

November 11, 2024 20:45


          adding wildbench

46f7a7c


          satisfy type checker

a18243c


          adding adapter to use message on model inferencing

ca76693


          minor fix

d1ca4ea


          formatting


          aligning with original repo

fb82d07


          formatting

ea4f765


          minor fix

d48dbda


          scenario test

74772f3


          minor fix for ifeval prod result

0a6e112


          formatting

a739d27


          add support for messages

7220e79


          minor fix

87d1662


          modifications to satisfy type checker

b74442e


          more fix

4b08fda


          more fix

15fa912

fix

2c9b89e


          making type optional

a3cac69

fix

e129456


          vertex client cache fix

f004fa1


          formatting

a9a38d6


          leaderboard

1b98fad


          scenario and test

1a0ef7e


          metric and annotator

e0296b4


          omnimath

4696b34

liamjxu self-assigned this

yifanmai requested changes

View reviewed changes

src/helm/benchmark/run_specs/lite_run_specs.py Outdated

+                  )
+                  adapter_spec = AdapterSpec(
+                      method=ADAPT_GENERATION, input_prefix="", output_prefix="", max_tokens=1000, num_outputs=1, temperature=0.0,

Collaborator

yifanmai Dec 3, 2024

Is max_tokens=1000 consistent with what the paper recommends? (Looks good to me, just checking.)

Contributor Author

liamjxu Dec 9, 2024

The paper did not make recommendations on the generation length, seems they just take it as-is.

Collaborator

yifanmai Dec 14, 2024

1000 should be fine then.

src/helm/benchmark/run_specs/lite_run_specs.py Outdated

+                      method=ADAPT_GENERATION, input_prefix="", output_prefix="", max_tokens=1000, num_outputs=1, temperature=0.0,
+                  )
+                  annotator_specs = [AnnotatorSpec(class_name="helm.benchmark.annotation.omnimath_annotator.OmniMATHAnnotator")]
+                  metric_specs = [MetricSpec(class_name="helm.benchmark.metrics.omnimath_metrics.OmniMATHMetric")]

Collaborator

yifanmai Dec 3, 2024

metric_specs = get_basic_metric_specs([]) + [MetricSpec(class_name="helm.benchmark.metrics.omnimath_metrics.OmniMATHMetric")]

Contributor Author

liamjxu Dec 9, 2024

addressed in the latest change

src/helm/benchmark/scenarios/omnimath_scenario.py Outdated

+              import datasets
+              import os
+              from typing import List
+              from helm.benchmark.scenarios.scenario import (

Collaborator

yifanmai Dec 3, 2024

nit: newline before the first import, to follow PEP-8 convention.

Contributor Author

liamjxu Dec 9, 2024

addressed in the latest change

src/helm/benchmark/scenarios/omnimath_scenario.py Outdated

+                  (and potentially more) sub-domains and span across 10 distinct difficulty levels, enabling a nuanced
+                  analysis of model performance across various mathematical disciplines and levels of complexity.."""
+                  name = "omnimath"

Collaborator

yifanmai Dec 3, 2024

nit: "omni_math" throughout, which is closer to the stylization that the authors used. Same for metric name and annotator name.

Contributor Author

liamjxu Dec 9, 2024

addressed in the latest change

src/helm/benchmark/scenarios/omnimath_scenario.py Outdated



		class OmniMATHScenario(Scenario):
		"""OmniMATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models

Collaborator

yifanmai Dec 3, 2024

nit: Omni-MATH

Contributor Author

liamjxu Dec 9, 2024

addressed in the latest change

src/helm/benchmark/scenarios/omnimath_scenario.py Outdated

Comment on lines 26 to 28

		def __init__(self):
		super().__init__()

Collaborator

yifanmai Dec 3, 2024

Delete empty constructor.

Contributor Author

liamjxu Dec 9, 2024

addressed in the latest change

src/helm/benchmark/scenarios/omnimath_scenario.py Outdated

+                      ensure_directory_exists(cache_dir)
+                      dataset = datasets.load_dataset(
+                          "KbsdJames/Omni-MATH",
+                          trust_remote_code=True,

Collaborator

yifanmai Dec 3, 2024

Don't set trust_remote_code.

Contributor Author

liamjxu Dec 9, 2024

addressed in the latest change

src/helm/benchmark/scenarios/omnimath_scenario.py Outdated

+                              references=[],
+                              split=TEST_SPLIT,
+                              extra_data={
+                                  "answer": row["answer"],

Collaborator

yifanmai Dec 3, 2024

Put the answer in references, rather than in extra_data.

Contributor Author

liamjxu Dec 9, 2024

addressed in the latest change

src/helm/benchmark/annotation/omnimath_annotator.py Outdated

+                  def __init__(self, auto_client: AutoClient):
+                      self._auto_client = auto_client
+                      with open("src/helm/benchmark/annotation/omnimath/gpt_evaluation_template.txt") as f:

Collaborator

yifanmai Dec 3, 2024

There are two additional steps you need to take to make this work when someone uses pip install to install HELM:

Add a pattern to the manifest that matches the path to gpt_evaluation_template.txt.
Use importlib_resources to open the file. Refer to this example.

Contributor Author

liamjxu Dec 9, 2024

addressed in the latest change


          addressing comments

82950dd

liamjxu force-pushed the jialiang/bigcodebench branch from 49c0aa5 to a6788b7 Compare

December 14, 2024 00:13

yifanmai requested changes

View reviewed changes

Collaborator

yifanmai left a comment

Looks like a lot of unrelated changes got added to this branch - could you revert them and also resolve the merge conflicts?

src/helm/benchmark/annotation/omnimath_annotator.py Outdated

		@@ -0,0 +1,68 @@

Collaborator

yifanmai Dec 13, 2024

nit: newline after typing rather than before.

src/helm/benchmark/annotation/omnimath_annotator.py Outdated

+              class OmniMATHAnnotator(Annotator):
+                  """The OmniMATH autograder."""
+                  name = "omnimath"

Collaborator

yifanmai Dec 13, 2024

still seems to be "omnimath" rather than "omni_math" - did you miss a commit?

src/helm/benchmark/annotation/omnimath_annotator.py Outdated



		class OmniMATHAnnotator(Annotator):
		"""The OmniMATH autograder."""

Collaborator

yifanmai Dec 13, 2024

Omni-MATH

src/helm/benchmark/run_specs/lite_run_specs.py Outdated

+                  )
+                  adapter_spec = AdapterSpec(
+                      method=ADAPT_GENERATION, input_prefix="", output_prefix="", max_tokens=1000, num_outputs=1, temperature=0.0,

Collaborator

yifanmai Dec 14, 2024

1000 should be fine then.

setup.cfg Outdated

@@ @@ -81,6 +81,7 @@ metrics = @@
                   sacrebleu~=2.2.1  # For disinformation_metrics, machine_translation_metrics
                   langdetect~=1.0.9  # For ifeval_metrics
                   immutabledict~=4.2.0  # For ifeval_metrics
+                  gradio_client==1.4.3  # For bigcodebench_metrics

Collaborator

yifanmai Dec 15, 2024

don't need this

yifanmai reviewed

View reviewed changes

src/helm/benchmark/adaptation/adapters/chat_adapter.py Outdated

Comment on lines 25 to 29

+                      assert eval_instance.extra_data
+                      messages = [
+                          {"role": message["role"], "content": message["content"]}
+                          for message in eval_instance.extra_data["conversation"]
+                      ]

Collaborator

yifanmai Dec 15, 2024

Can we add instance.input.messages and use that instead of extra_data before we launch? OK to do in a separate PR.

Collaborator

yifanmai Dec 15, 2024

Wait, this is for IFEval, not for OmniMath?

liamjxu added 8 commits

December 15, 2024 21:09


          Merge branch 'jialiang/bigcodebench' into jialiang/omnimath

d4444fd


          manual cleaning up

7b2d7b6


          renaming omnimath to omni_math

a3ab0a6


          updating the test cases

e3557e0


          formatting

e48e753


          debugging to satisfy type checker

f164d50


          further fix to satisfy type checker

c0bfcd7


          addressing comments

0f590ef

liamjxu changed the title ~~Add OmniMATH~~ Add Omni-MATH

yifanmai approved these changes

View reviewed changes

Collaborator

yifanmai left a comment

Looks good, thanks!

src/helm/benchmark/annotation/bigcodebench_annotator.py Outdated

Collaborator

yifanmai Dec 16, 2024

Make sure you reconcile this with main before merging.

src/helm/benchmark/scenarios/omni_math_scenario.py

+              import datasets
+              import os
+              from typing import List
+              from helm.benchmark.scenarios.scenario import (

Collaborator

yifanmai Dec 16, 2024

nit: newline before the helm imports

src/helm/benchmark/scenarios/omni_math_scenario.py

+                          "KbsdJames/Omni-MATH",
+                          cache_dir=cache_dir,
+                          split="test",
+                      )

Collaborator

yifanmai Dec 16, 2024

set the revision argument

liamjxu added 4 commits

December 16, 2024 12:02


          pinpoint the actual commit hash to specify revision

477022a


          pinpointing the commit hash

eaab5a6


          adding schema entries

388c1b1


          addressing conflicts

dc6e4aa

liamjxu merged commit 0239261 into jialiang/bigcodebench

12 checks passed

liamjxu deleted the jialiang/omnimath branch

December 16, 2024 20:56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet