custom prompt engineering for various models #2130

percyliang · 2023-12-10T02:49:35Z

Created different RunExpanders for Google, OpenAI, and Anthropic models that are instruction following and hence need more explicit instructions to follow the format and not ramble on and on.
Also changed the default metric for GSM8K to test whether the final number in the response matches rather than the final word.
For HumanEval, which is a completion task (rather than in-context learning), we need a different prompt, which only sort of works (there are some annoyances like GPT-4 working better without instructions, but GPT-4 Turbo working better with instructions).

percyliang · 2023-12-10T02:49:54Z

src/helm/benchmark/model_metadata_registry.py

@@ -22,12 +22,14 @@
 CHATML_MODEL_TAG: str = "CHATML_MODEL_TAG"

 # OpenAI Chat format
-OPENAI_CHATGPT_MODEL_TAG: str = "openai_chatgpt"
+OPENAI_CHATGPT_MODEL_TAG: str = "OPENAI_CHATGPT_MODEL_TAG"


percyliang · 2023-12-10T02:50:34Z

src/helm/benchmark/run_specs.py

        metric_specs=get_metric_specs(big_bench_task["metrics"]),
-        groups=["BIG-bench"],
+        groups=[f"big_bench_" + task],


Drive by fix when I was trying out some BIG-bench scenarios

percyliang · 2023-12-10T02:50:47Z

src/helm/benchmark/static/schema_classic.yaml

@@ -1317,7 +1317,12 @@ metrics:
    description: Fraction of model outputs that are mathematically equivalent to the correct reference when using chain-of-thought prompting.
    lower_is_better: false
  - name: exact_match_indicator
-    display_name: Exact match (up to specified indicator)
+    display_name: Exact match (final)


"indicator" is confusing

percyliang · 2023-12-10T02:51:01Z

src/helm/benchmark/static/schema_lite.yaml

@@ -884,14 +889,14 @@ run_groups:
      - efficiency
      - general_information
    environment:
-      main_name: exact_match_indicator
+      main_name: final_number_exact_match


Note we change the main metric for GSM8K

yifanmai · 2023-12-11T18:32:03Z

src/helm/benchmark/metrics/basic_metrics.py

+        # Note: case that's not handled is "2,300" is parsed as "300"
+        x = re.sub(",", "", x)  # To handle numbers like "2,300"
+        x = re.sub(r"[^0-9\.]", " ", x)  # Replace non-digit, non-'.'


FYI: the . character will have to change support math scenarios in non-English European languages.

yifanmai · 2023-12-11T18:53:25Z

src/helm/benchmark/metrics/basic_metrics.py

+    - Returns 1
+    """
+
+    def get_final_number(x: str) -> str:


I think a regex is more straightforward here.

def get_final_number(x: str) -> str: match = re.search(r"(-?[\d,]+(?:.\d+)?)\D*$", x) if match is None: return "" return match.group(1).replace(",", "")

This doesn't work for fractions (the original code does not either), but I checked the GSM8K test set and it looks like the answers are non-negative integers only.

Explanation of what this does: https://regexr.com/7ou5g

Ended up going with a slightly different and simpler regex.

Also noting that my regex has slightly different behavior from this code when the last numeric digits are part of a malformed number.

yifanmai · 2023-12-11T18:56:11Z

src/helm/benchmark/metrics/test_basic_metrics.py

+    assert final_number_exact_match("33", "33") == 1
+    assert final_number_exact_match("33", "33 eggs.") == 1
+    assert final_number_exact_match("The answer is 33", "\\boxed{33}") == 1
+    assert final_number_exact_match("The answer is 33", "\\boxed{33} and 34") == 0


Optional: add a test for negative numbers.

yifanmai · 2023-12-11T18:58:23Z

src/helm/benchmark/run_expander.py

@@ -291,6 +307,65 @@ def expand(self, run_spec: RunSpec) -> List[RunSpec]:
        ]


+class OpenAIRunExpander(RunExpander):


How about just one InContextLearningInstructionsRunExpander since the OpenAI and Google run expanders are identical? That class name would also be more descriptive.

I originally had OpenAI and Google run expanders because I had different prompts at some point...how about having a InContextLearningInstructionsRunExpander class and then just having OpenAI and Google inherit from it? In the future, we might tweak things.

Added a todo to deal with this later.

yifanmai · 2023-12-12T05:20:56Z

src/helm/benchmark/metrics/basic_metrics.py

+    - Returns 1
+    """
+
+    def get_final_number(x: str) -> str:


Ended up going with a slightly different and simpler regex.

Also noting that my regex has slightly different behavior from this code when the last numeric digits are part of a malformed number.

yifanmai · 2023-12-12T05:21:07Z

src/helm/benchmark/run_expander.py

@@ -291,6 +307,65 @@ def expand(self, run_spec: RunSpec) -> List[RunSpec]:
        ]


+class OpenAIRunExpander(RunExpander):


Added a todo to deal with this later.

percyliang added 8 commits December 8, 2023 11:07

anthropic completions

58c3f75

use XML

7149ae0

Merge branch 'main' into pliang-prompt-eng

11f0218

final_number_exact_match, add prompts

661704f

Merge branch 'main' into pliang-prompt-eng

480d035

custom Google prompts

805f806

customize prompts

244c539

format

3938913

percyliang requested review from yifanmai and JosselinSomervilleRoberts December 10, 2023 02:49

percyliang commented Dec 10, 2023

View reviewed changes

percyliang added 14 commits December 9, 2023 20:09

fix

da16f51

fix

3df83f5

update Cohere metadata

720ba23

simpify metrics

b19cae2

shorten descriptions

82cf2be

delete HumanEval

20ca404

fix

4955687

don't add more tokens

6590f8e

fix

1ea97ec

Anthropic always needs prompt

277baba

improve data descriptions

467fb29

fix

bccdb97

add mixtral

a86ea34

fix name

578e567

yifanmai requested changes Dec 11, 2023

View reviewed changes

Fix math metrics

750e1df

Add todos

b197cb1

yifanmai approved these changes Dec 12, 2023

View reviewed changes

yifanmai merged commit 6dd3a5b into main Dec 12, 2023
6 checks passed

yifanmai deleted the pliang-prompt-eng branch December 12, 2023 05:46

yifanmai mentioned this pull request Dec 21, 2023

Prompt format for GPT-3.5 breaks some DecodingTrust scenarios #2174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom prompt engineering for various models #2130

custom prompt engineering for various models #2130

percyliang commented Dec 10, 2023

percyliang Dec 10, 2023

percyliang Dec 10, 2023

percyliang Dec 10, 2023

percyliang Dec 10, 2023

yifanmai Dec 11, 2023

yifanmai Dec 11, 2023 •

edited

Loading

yifanmai Dec 11, 2023

yifanmai Dec 12, 2023 •

edited

Loading

yifanmai Dec 11, 2023

yifanmai Dec 11, 2023

percyliang Dec 12, 2023

yifanmai Dec 12, 2023

yifanmai Dec 12, 2023 •

edited

Loading

yifanmai Dec 12, 2023

		@@ -291,6 +307,65 @@ def expand(self, run_spec: RunSpec) -> List[RunSpec]:
		]


		class OpenAIRunExpander(RunExpander):

custom prompt engineering for various models #2130

custom prompt engineering for various models #2130

Conversation

percyliang commented Dec 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai Dec 11, 2023 •

edited

Loading

yifanmai Dec 12, 2023 •

edited

Loading

yifanmai Dec 12, 2023 •

edited

Loading