Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Omni-MATH #3187

Merged
merged 38 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
46f7a7c
adding wildbench
liamjxu Nov 12, 2024
a18243c
satisfy type checker
liamjxu Nov 12, 2024
ca76693
adding adapter to use message on model inferencing
liamjxu Nov 12, 2024
d1ca4ea
minor fix
liamjxu Nov 12, 2024
6864183
formatting
liamjxu Nov 12, 2024
fb82d07
aligning with original repo
liamjxu Nov 12, 2024
ea4f765
formatting
liamjxu Nov 12, 2024
d48dbda
minor fix
liamjxu Nov 12, 2024
74772f3
scenario test
liamjxu Nov 13, 2024
0a6e112
minor fix for ifeval prod result
liamjxu Nov 13, 2024
a739d27
formatting
liamjxu Nov 13, 2024
7220e79
add support for messages
liamjxu Nov 14, 2024
87d1662
minor fix
liamjxu Nov 14, 2024
b74442e
modifications to satisfy type checker
liamjxu Nov 14, 2024
4b08fda
more fix
liamjxu Nov 14, 2024
15fa912
more fix
liamjxu Nov 14, 2024
2c9b89e
fix
liamjxu Nov 14, 2024
a3cac69
making type optional
liamjxu Nov 14, 2024
e129456
fix
liamjxu Nov 14, 2024
f004fa1
vertex client cache fix
liamjxu Nov 14, 2024
a9a38d6
formatting
liamjxu Nov 14, 2024
1b98fad
leaderboard
liamjxu Nov 20, 2024
1a0ef7e
scenario and test
liamjxu Nov 21, 2024
e0296b4
metric and annotator
liamjxu Nov 21, 2024
4696b34
omnimath
liamjxu Nov 22, 2024
82950dd
addressing comments
liamjxu Dec 9, 2024
d4444fd
Merge branch 'jialiang/bigcodebench' into jialiang/omnimath
liamjxu Dec 16, 2024
7b2d7b6
manual cleaning up
liamjxu Dec 16, 2024
a3ab0a6
renaming omnimath to omni_math
liamjxu Dec 16, 2024
e3557e0
updating the test cases
liamjxu Dec 16, 2024
e48e753
formatting
liamjxu Dec 16, 2024
f164d50
debugging to satisfy type checker
liamjxu Dec 16, 2024
c0bfcd7
further fix to satisfy type checker
liamjxu Dec 16, 2024
0f590ef
addressing comments
liamjxu Dec 16, 2024
477022a
pinpoint the actual commit hash to specify revision
liamjxu Dec 16, 2024
eaab5a6
pinpointing the commit hash
liamjxu Dec 16, 2024
388c1b1
adding schema entries
liamjxu Dec 16, 2024
dc6e4aa
addressing conflicts
liamjxu Dec 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ recursive-include src/helm/benchmark/ *.json
recursive-include src/helm/benchmark/static/ *.css *.html *.js *.png *.yaml
recursive-include src/helm/benchmark/static_build/ *.css *.html *.js *.png *.yaml
recursive-include src/helm/config/ *.yaml
recursive-include src/helm/benchmark/annotation/omnimath/ *.txt
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ metrics =
sacrebleu~=2.2.1 # For disinformation_metrics, machine_translation_metrics
langdetect~=1.0.9 # For ifeval_metrics
immutabledict~=4.2.0 # For ifeval_metrics
gradio_client==1.4.3 # For bigcodebench_metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this


summarization =
summ-eval~=0.892 # For summarization_metrics
Expand Down
1 change: 1 addition & 0 deletions src/helm/benchmark/adaptation/adapter_spec.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

# Adaptation methods
ADAPT_GENERATION: str = "generation"
ADAPT_CHAT: str = "chat"
ADAPT_LANGUAGE_MODELING: str = "language_modeling"
ADAPT_MULTIPLE_CHOICE_JOINT: str = "multiple_choice_joint"
ADAPT_MULTIPLE_CHOICE_JOINT_CHAIN_OF_THOUGHT: str = "multiple_choice_joint_chain_of_thought"
Expand Down
4 changes: 4 additions & 0 deletions src/helm/benchmark/adaptation/adapters/adapter_factory.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from helm.benchmark.adaptation.adapter_spec import (
ADAPT_GENERATION,
ADAPT_CHAT,
ADAPT_GENERATION_MULTIMODAL,
ADAPT_LANGUAGE_MODELING,
ADAPT_MULTIPLE_CHOICE_JOINT,
Expand All @@ -13,6 +14,7 @@
from helm.benchmark.adaptation.adapters.adapter import Adapter
from helm.benchmark.adaptation.adapters.binary_ranking_adapter import BinaryRankingAdapter
from helm.benchmark.adaptation.adapters.generation_adapter import GenerationAdapter
from helm.benchmark.adaptation.adapters.chat_adapter import ChatAdapter
from helm.benchmark.adaptation.adapters.language_modeling_adapter import LanguageModelingAdapter
from helm.benchmark.adaptation.adapters.multimodal.generation_multimodal_adapter import GenerationMultimodalAdapter
from helm.benchmark.adaptation.adapters.multimodal.multiple_choice_joint_multimodal_adapter import (
Expand All @@ -38,6 +40,8 @@ def get_adapter(adapter_spec: AdapterSpec, tokenizer_service: TokenizerService)

if method == ADAPT_GENERATION:
adapter = GenerationAdapter(adapter_spec, tokenizer_service)
elif method == ADAPT_CHAT:
adapter = ChatAdapter(adapter_spec, tokenizer_service)
elif method == ADAPT_LANGUAGE_MODELING:
adapter = LanguageModelingAdapter(adapter_spec, tokenizer_service)
elif method == ADAPT_MULTIPLE_CHOICE_JOINT:
Expand Down
52 changes: 52 additions & 0 deletions src/helm/benchmark/adaptation/adapters/chat_adapter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
from typing import List

from helm.benchmark.adaptation.request_state import RequestState
from helm.benchmark.scenarios.scenario import Instance
from helm.common.request import Request
from helm.benchmark.adaptation.adapters.in_context_learning_adapter import InContextLearningAdapter


class ChatAdapter(InContextLearningAdapter):
"""
Each `Instance` in a `Scenario` has a history of the format:

[
{"role": "user", "content": <user-content>},
{"role": "assistant", "content": <assistant-content>},
{"role": "user", "content": <user-content>},
...
]

"""

def generate_requests(
self, eval_instance: Instance, train_trial_index: int, training_instances: List[Instance]
) -> List[RequestState]:
assert eval_instance.extra_data
messages = [
{"role": message["role"], "content": message["content"]}
for message in eval_instance.extra_data["conversation"]
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add instance.input.messages and use that instead of extra_data before we launch? OK to do in a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, this is for IFEval, not for OmniMath?

request = Request(
model=self.adapter_spec.model,
model_deployment=self.adapter_spec.model_deployment,
messages=messages,
num_completions=self.adapter_spec.num_outputs,
temperature=self.adapter_spec.temperature,
max_tokens=self.adapter_spec.max_tokens,
stop_sequences=self.adapter_spec.stop_sequences,
random=self.adapter_spec.random,
image_generation_parameters=self.adapter_spec.image_generation_parameters,
)
request_state = RequestState(
instance=eval_instance,
reference_index=None,
request_mode=None,
train_trial_index=train_trial_index,
output_mapping=None,
request=request,
result=None,
num_train_instances=0,
prompt_truncated=False,
)
return [request_state]
127 changes: 127 additions & 0 deletions src/helm/benchmark/annotation/bigcodebench_annotator.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you reconcile this with main before merging.

Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
from typing import Any

from helm.benchmark.adaptation.request_state import RequestState
from helm.benchmark.annotation.annotator import Annotator
from helm.common.request import Request
from gradio_client import Client, handle_file
from tempfile import TemporaryDirectory

import ast
import traceback
import time
import json


def syntax_check(code, verbose=False):
try:
ast.parse(code)
return True
except (SyntaxError, MemoryError):
if verbose:
traceback.print_exc()
return False


def code_extract(text: str) -> str:
lines = text.split("\n")
longest_line_pair = (0, 0)
longest_so_far = 0

for i in range(len(lines)):
for j in range(i + 1, len(lines)):
current_lines = "\n".join(lines[i : j + 1])
if syntax_check(current_lines):
current_length = sum(1 for line in lines[i : j + 1] if line.strip())
if current_length > longest_so_far:
longest_so_far = current_length
longest_line_pair = (i, j)

return "\n".join(lines[longest_line_pair[0] : longest_line_pair[1] + 1])


class BigCodeBenchAnnotator(Annotator):
"""The BigCodeBench autograder."""

name = "bigcodebench"

def __init__(self):
self.remote_execute_api = "https://bigcode-bigcodebench-evaluator-2.hf.space/"
self.split = "instruct"
self.subset = "full"
self.pass_k = "1" # Original: "1,5,10"

def annotate(self, request_state: RequestState) -> Any:
assert request_state.result
assert len(request_state.result.completions) == 1
assert request_state.instance.extra_data
model_output_text = request_state.result.completions[0].text
solution = code_extract(model_output_text)

pass_at_one: float
with TemporaryDirectory() as tmpdir:

# dump result to a jsonl in tmpdir using json library
with open(f"{tmpdir}/result.jsonl", "w") as file:
for i in range(1140):
line: str
if request_state.instance.extra_data["task_id"] == f"BigCodeBench/{i}":
escaped_solution = json.dumps(solution)[1:-1]
line = f'{{"task_id": "BigCodeBench/{i}", "solution": "{escaped_solution}"}}\n'
else:
line = f'{{"task_id": "BigCodeBench/{i}", "solution": ""}}\n'
file.write(line)

# with open(f"node_modules/temp_result.jsonl", "w") as file:
# for i in range(1140):
# line: str
# if request_state.instance.extra_data["task_id"] == f"BigCodeBench/{i}":
# escaped_solution = json.dumps(solution)[1:-1]
# line = f'{{"task_id": "BigCodeBench/{i}", "solution": "{escaped_solution}"}}\n'
# else:
# line = f'{{"task_id": "BigCodeBench/{i}", "solution": ""}}\n'
# file.write(line)

# # Following https://github.dev/bigcode-project/bigcodebench/blob/main/bigcodebench/evaluate.py
# while True:
# try:
# client = Client(self.remote_execute_api)
# results, pass_at_k = client.predict(
# split=self.split,
# subset=self.subset,
# samples=handle_file(f"{tmpdir}/result.jsonl"),
# pass_k=self.pass_k,
# api_name="/predict"
# )
# break
# except Exception as e:
# print(f"Error Message: {e}. Retrying in 4s...")
# time.sleep(4)

max_retries = 3
retry_count = 0
success = False # Flag to indicate if the operation was successful

while retry_count < max_retries:
try:
client = Client(self.remote_execute_api)
results, pass_at_k = client.predict(
split=self.split,
subset=self.subset,
samples=handle_file(f"{tmpdir}/result.jsonl"),
pass_k=self.pass_k,
api_name="/predict"
)
success = True # Operation succeeded
pass_at_one = pass_at_k["pass@1"]
break
except Exception as e:
retry_count += 1
print(f"Attempt {retry_count} failed. Error Message: {e}. Retrying in 4s...")
time.sleep(4)

if not success:
print("Failed to complete the operation after 3 attempts.")
pass_at_one = 0


return {"pass_at_one": pass_at_one}
152 changes: 152 additions & 0 deletions src/helm/benchmark/annotation/omnimath/gpt_evaluation_template.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# CONTEXT #
I am a teacher, and I have some high-level math problems. I am tasked with evaluating the correctness of a student's answer.
Below, I am provided with a problem and a reference answer. Additionally, a student's answer is provided. My job is to assess whether the student's answer captures the same meaning as the reference answer, even when expressed with different wording or format.

# OBJECTIVE #
I need you to judge whether the student's answer is correct given the ground truth answer.

Your tasks include:
A. Identify Mathematical or Notational Equivalence: Pay special attention to any LaTeX expressions in both answers. Confirm that the mathematical relationships, variables, and operations conveyed are equivalent.
B. Provide a Justification: Conclude with a brief explanation as to why you believe the student's output is correct or incorrect, highlighting any key differences in meaning or content.

# STYLE #
Teaching report.

# TONE #
Professional, scientific.

# AUDIENCE #
Students. Enable them to better understand whether the answer they produce is correct.

# RESPONSE: MARKDOWN REPORT #
## Student Final Answer
[Extract the student's final answer, which is enclosed in "\\boxed{}".]
## Equivalence Judgement
[Whether the student's answer share the same meaning with the reference answer. (TRUE or FALSE)]
## Justification
[Conclude with a brief explanation as to why you believe the student's answer is correct or incorrect.]


# ATTENTION #
- The reference answer is ALWAYS correct. You should carefully judge whether the student gives the same answer as reference answer.
- The Equivalence Judgement is only TRUE or FALSE. The answer is FALSE even if the student's final answer almost correct with a minor mistakes.
- The answer is contained within the "boxed" section, so you can focus solely on comparing the content in the student's answer box with the reference answer, without needing to consider the intermediate steps.
- Add "=== report over ===" at the end of the report.

<example math solution>
**Question**:
The numbers $1447$ , $1005$ and $1231$ have something in common: each is a $4$ -digit number beginning with $1$ that has exactly two identical digits. How many such numbers are there?

**Reference Answer**
432

**Student Solution**:
To determine how many 4-digit numbers beginning with 1 have exactly two identical digits, we need to consider the structure of these numbers. A number of this form can be represented as \(1abc\), where \(a\), \(b\), and \(c\) are digits from 0 to 9, and exactly one of the digits \(a\), \(b\), or \(c\) is 1, while the other two are different from each other and different from 1.

We can break down the problem into the following steps:

1. **Choose the digit that is 1**: There are 3 positions (a, b, c) where the digit 1 can be placed. So, there are 3 choices for the position of the digit 1.

2. **Choose the other two digits**: The remaining two digits must be different from each other and different from 1. Therefore, we have 9 choices for the first of these two digits (0-9 except 1) and 8 choices for the second digit (0-9 except 1 and the first digit). This gives us \(9 \times 8 = 72\) choices for the other two digits.

Combining these choices, the total number of 4-digit numbers beginning with 1 that have exactly two identical digits is:
\[3 \times 72 = 216.\]

Thus, the number of such numbers is \(\boxed{216}\).

</example math solution>

## Student Final Answer
216

## Equivalence Judgement
FALSE

## Justification
The student's answer of 216 is incorrect in the context of the problem, which asks for the total count of 4-digit numbers beginning with 1 that have exactly two identical digits. The reference answer is 432.
In the student's solution, they consider only cases where the digit '1' is one of the identical digits. However, the problem also includes the scenario where the identical digits could be different from '1'. Thus, the student's calculation does not account for all valid configurations. The discrepancy in figures indicates that the student's answer does not share the same meaning as the reference answer.

=== report over ===


<example math solution>
**Question**:
Susana and Brenda play a game writing polynomials on the board. Susana starts and they play taking turns.
1) On the preparatory turn (turn 0), Susana choose a positive integer $n_0$ and writes the polynomial $P_0(x)=n_0$.
2) On turn 1, Brenda choose a positive integer $n_1$, different from $n_0$, and either writes the polynomial

$$P_1(x)=n_1x+P_0(x) \textup{ or } P_1(x)=n_1x-P_0(x)$$

3) In general, on turn $k$, the respective player chooses an integer $n_k$, different from $n_0, n_1, \ldots, n_{k-1}$, and either writes the polynomial
$$P_k(x)=n_kx^k+P_{k-1}(x) \textup{ or } P_k(x)=n_kx^k-P_{k-1}(x)$$
The first player to write a polynomial with at least one whole whole number root wins. Find and describe a winning strategy.

**Reference Answer**
\\text{No winning strategy; the game can last indefinitely.}

**Student Solution**:
To determine a winning strategy for Susana, we need to analyze the game and the conditions under which a player can win. The goal is to write a polynomial that has at least one integer root.

Let's start by examining the polynomials that are written on the board after each turn.

1. On turn 0, Susana writes the polynomial \( P_0(x) = n_0 \), where \( n_0 \) is a positive integer. This polynomial is a constant polynomial with no integer roots unless \( n_0 = 0 \), which is not allowed since \( n_0 \) is a positive integer.

2. On turn 1, Brenda chooses a positive integer \( n_1 \) different from \( n_0 \) and writes either \( P_1(x) = n_1 x + P_0(x) = n_1 x + n_0 \) or \( P_1(x) = n_1 x - P_0(x) = n_1 x - n_0 \). In both cases, \( P_1(x) \) is a linear polynomial. A linear polynomial \( ax + b \) has an integer root if and only if \( b/a \) is an integer. Therefore, Brenda can choose \( n_1 \) and \( n_0 \) such that \( n_1 n_0 \) is divisible by \( n_1 \), which is always possible since \( n_1 \) and \( n_0 \) are different positive integers. For example, if \( n_1 = 2 \) and \( n_0 = 2 \), then \( P_1(x) = 2x - 2 \) has the integer root \( x = 1 \). However, since \( n_1 \) must be different from \( n_0 \), we need to consider the next turn.

3. On turn 2, Susana chooses a positive integer \( n_2 \) different from \( n_0 \) and \( n_1 \) and writes either \( P_2(x) = n_2 x^2 + P_1(x) \) or \( P_2(x) = n_2 x^2 - P_1(x) \). To ensure a win, Susana should choose \( n_2 \) such that the resulting polynomial has an integer root. Notice that if \( P_1(x) = n_1 x + n_0 \) or \( P_1(x) = n_1 x - n_0 \), then \( P_2(x) = n_2 x^2 + P_1(x) \) or \( P_2(x) = n_2 x^2 - P_1(x) \) will have an integer root if \( n_2 \) is chosen appropriately. For example, if \( P_1(x) = 2x - 2 \), then \( P_2(x) = n_2 x^2 + 2x - 2 \) or \( P_2(x) = n_2 x^2 - 2x + 2 \) will have an integer root if \( n_2 \) is chosen such that the polynomial can be factored to have an integer root.

By following this strategy, Susana can always choose \( n_2 \) such that the resulting polynomial has an integer root. Therefore, Susana has a winning strategy.

The winning strategy for Susana is to choose \( n_2 \) such that the resulting polynomial has an integer root. Specifically, if \( P_1(x) = n_1 x + n_0 \) or \( P_1(x) = n_1 x - n_0 \), then Susana should choose \( n_2 \) such that \( P_2(x) = n_2 x^2 + P_1(x) \) or \( P_2(x) = n_2 x^2 - P_1(x) \) has an integer root.

The answer is: \(\boxed{\text{Susana has a winning strategy.}}\)

</example math solution>


## Student Final Answer
\text{Susana has a winning strategy.}

## Equivalence Judgement
FALSE

## Justification
The student's answer states that "Susana has a winning strategy," which is not in agreement with the reference answer indicating "No winning strategy; the game can last indefinitely." The essence of the problem is that neither player can guarantee a win because the game can go on indefinitely without either player being able to write a polynomial with an integer root. The student's conclusion contradicts this, suggesting that a specific winning strategy exists for Susana, which is fundamentally incorrect according to the problem's conditions. Thus, the student's answer does not capture the same meaning as the reference answer.


=== report over ===

<example math solution>
**Question**:
Determine if there exists a (three-variable) polynomial $P(x,y,z)$ with integer coefficients satisfying the following property: a positive integer $n$ is [i]not[/i] a perfect square if and only if there is a triple $(x,y,z)$ of positive integers such that $P(x,y,z) = n$.

**Reference Answer**
P(x,y,z) = x^2 + y^2 + z^2 + 2xyz

**Student Solution**:
**Claim:** There exists a three-variable polynomial $P(x, y

</example math solution>

## Student Final Answer
None

## Equivalence Judgement
FALSE

## Justification
The student's answer does not provide a final conclusion and the final answer is missing, which indicates a lack of conclusion.

=== report over ===

<math solution>
**Question**:
{{Problem}}

**Reference Answer**
{{Reference Answer}}

**Student Solution**:
{{Solution}}

</math solution>
Loading