Adding WildBench #3150

liamjxu · 2024-11-12T03:49:26Z

Added WildBench scenario, adapter, run specs, annotator, and metrics.

TODO:

~~Add a customized adapter that applies chat template for model inference~~
~~Align with original repo on the prompt format for GPT-as-a-judge~~

Comment:

Currently created a new adapter ChatAdapter to use chat messages in the Request initialization, but it's most likely optimizable. Suggestions on this would be helpful and are very welcome.
Right now we only included WB score in the schema, we can also include WB reward.

yifanmai · 2024-11-21T22:39:51Z

src/helm/benchmark/annotation/wildbench/eval_template.pairwise.v2.md

Maybe drop this file if we don't need it? Or do you intend to implement a pairwise annotator as well?

I do plan to implement this. However, the original repo contains an ambiguity in the implementation of this metric. I have raised an issue on this in this issue.

Basically they mentioned a machanism to length-debias in comparison of models, yet the released codebase do not seem to implement it

yifanmai · 2024-11-21T22:41:18Z

src/helm/benchmark/metrics/ifeval_metrics.py

+                try:
+                    is_following = instruction.check_following(response)
+                except Exception as e:
+                    print(f"Instruction following checking failed with error message {e}")


use hlog() instead of print()

start the message with "WARNING: "

does this fail frequently? this basically means that if the judge model fails, the model under evaluation gets penalized (score defaults to 0)

Regarding 3., this is very rare and the only type of exception I observed so far was due to langdetect failing to recognize languages, in that case the original codebase consider it as a successful following case, so I followed that for now.

and 2. have been updated in the latest change.

yifanmai · 2024-11-21T22:42:11Z

src/helm/benchmark/static/schema_lite_v2.yaml

+  - name: wildbench_score
+    display_name: WildBench Score
+    short_display_name: WB Score
+    description: Score of the AI output judged by GPT-4.


Yes. Changed.

yifanmai · 2024-11-21T22:42:28Z

src/helm/benchmark/static/schema_lite_v2.yaml

+    taxonomy:
+      task: "?"
+      what: "?"
+      who: "?"
+      when: "?"


Fill these out (to the best of your knowledge)

Added in the latest change

src/helm/benchmark/annotation/wildbench_annotator.py

yifanmai · 2024-11-21T22:48:25Z

src/helm/benchmark/scenarios/wildbench_scenario.py

+        dataset = datasets.load_dataset(
+            "allenai/WildBench",
+            self.subset,
+            trust_remote_code=True,


is trust_remote_code needed? :(

Not needed. This is now removed

src/helm/benchmark/scenarios/wildbench_scenario.py

yifanmai · 2024-11-21T23:11:34Z

src/helm/benchmark/scenarios/wildbench_scenario.py

+        baseline_outputs = {
+            f"{model}": datasets.load_dataset(
+                "allenai/WildBench-V2-Model-Outputs",
+                model,
+                trust_remote_code=True,
+                cache_dir=cache_dir,
+                split="train",
+            )
+            for model in REFERENCE_MODELS
+        }


It seems like pairwise is half-implemented - I'd suggest finishing the implementation (only the annotator is missing) or removing all the pairwise components.

Similar to the previous comment on the pairwise metric, I'm waiting on the authors's response, but I can also remove it first to keep the codebase clean.

OK, I'm fine with keeping it in.

yifanmai · 2024-11-21T23:11:48Z

src/helm/benchmark/scenarios/wildbench_scenario.py

+                history.append(noun + round["content"])
+            history_text = "\n\n".join(history)
+            user_query_text = row["conversation_input"][-1]["content"]
+            checklist_text = "\n".join([f"- {checklist_item}" for checklist_item in row["checklist"]])


why not just keep this as a list?

That's a good catch, the original code also initially kept the checklist items as a list but they later merged the check list, so I thought we could just store only the processed text.

I have a slight preference for doing the merging in the annotator and keeping this as a list, but either is fine (up to you).

yifanmai · 2024-11-21T23:13:48Z

src/helm/clients/vertexai_client.py

+                contents.append(
+                    Content(role=role_mapping.get(msg["role"], "user"), parts=[Part.from_text(msg["content"])])
+                )
+            content_key = "\n".join([msg["content"] for msg in request.messages])


don't need this - can just rely on the existing content field (the cache key can be a nested dict).

Just to confirm I understand it correctly, do you mean we can use the Content objects as the cache key? But here we have a list of messages, which essentially gives a list of Content objects, does a list of such nested dicts also work? If it does I'm curious about how that works - any quick pointers would be much appreciated.

Yes, basically any nested dict that can be serialized to JSON can be used as a cache key. In terms of the implementation:

MongoDB natively supports using a JSON object

Other caches like SQLite serialize the JSON object to a string in a "canonical" way, which happens here:

helm/src/helm/common/key_value_store.py

Lines 9 to 11 in 40b3d23

def request_to_key(request: Mapping) -> str:

"""Normalize a `request` into a `key` so that we can hash using it."""

return json.dumps(request, sort_keys=True)

In terms of this PR, you could do something like

if: cache_key["messages"] = request.messages else: cache_key["prompt"] = request.prompt

or

{ "prompt": request.messages or request.prompt }

or something similar

yifanmai · 2024-12-13T18:22:06Z

src/helm/benchmark/scenarios/wildbench_scenario.py

+        )
+        assert isinstance(dataset, datasets.Dataset)
+        baseline_outputs = {
+            f"{model}": datasets.load_dataset(


Since this is quite expensive and also takes up quite a bit of space, I would prefer for the baseline_outputs to only be loaded if we actually need them i.e. if we're in the pairwise comparison case. You can make this configurable by adding a argument in the constructor (and passing it from the run spec function via ScenarioSpec args), setting an instance variable, and then checking it here..

yifanmai · 2024-12-13T18:24:00Z

Mostly looks good. Feel free to merge after addressing the remaining comments. Also, you need to resolve the conflict with main.

liamjxu force-pushed the jialiang/wildbench branch 2 times, most recently from 8007c43 to 46f7a7c Compare November 12, 2024 04:45

liamjxu requested a review from yifanmai November 12, 2024 16:45

liamjxu force-pushed the jialiang/wildbench branch from 90e8d56 to a9a38d6 Compare November 14, 2024 20:14

liamjxu self-assigned this Nov 20, 2024

yifanmai requested changes Nov 21, 2024

View reviewed changes

yifanmai approved these changes Nov 22, 2024

View reviewed changes

yifanmai reviewed Dec 13, 2024

View reviewed changes

liamjxu added 21 commits December 13, 2024 15:00

adding wildbench

4ea5b73

satisfy type checker

41f548e

adding adapter to use message on model inferencing

be71c09

minor fix

4a3bfe1

formatting

4c3b94f

aligning with original repo

054f998

formatting

0949754

minor fix

d65b808

scenario test

fc7368e

minor fix for ifeval prod result

81a483d

formatting

68a83ec

add support for messages

fcab68f

minor fix

026520d

modifications to satisfy type checker

1d8f6a6

more fix

fab92be

more fix

d25f226

fix

16becbd

making type optional

9111af3

fix

ed52218

vertex client cache fix

4b65281

formatting

521ff70

liamjxu added 9 commits December 13, 2024 15:05

addressing PR comments

21f365f

pep8 compliance

996a0ce

changed the way checklist was used

00b9e92

change request cache key setting in vertex ai client

c4acaa7

only use existing model outputs when explicitly passed argument

112d338

updated scenario test for WildBench

0fa0390

update the runspec

f0341ec

adding input text for display only

1f71efc

formatting

b5c7a7a

liamjxu force-pushed the jialiang/wildbench branch from f42304a to b5c7a7a Compare December 13, 2024 23:33

updated schema

f7a76bd

liamjxu merged commit 1e55710 into main Dec 13, 2024
12 checks passed

liamjxu deleted the jialiang/wildbench branch December 13, 2024 23:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding WildBench #3150

Adding WildBench #3150

liamjxu commented Nov 12, 2024 •

edited

Loading

yifanmai Nov 21, 2024

liamjxu Nov 28, 2024

liamjxu Nov 28, 2024 •

edited

Loading

yifanmai Nov 21, 2024

liamjxu Nov 28, 2024

liamjxu Nov 28, 2024

yifanmai Nov 21, 2024

liamjxu Nov 28, 2024

yifanmai Nov 21, 2024

liamjxu Nov 28, 2024

yifanmai Nov 21, 2024

liamjxu Nov 28, 2024 •

edited

Loading

yifanmai Nov 21, 2024

liamjxu Nov 28, 2024

yifanmai Dec 13, 2024

yifanmai Nov 21, 2024

liamjxu Nov 28, 2024

yifanmai Dec 13, 2024

yifanmai Nov 21, 2024

liamjxu Nov 28, 2024

yifanmai Dec 13, 2024

yifanmai Dec 13, 2024

yifanmai commented Dec 13, 2024

	def request_to_key(request: Mapping) -> str:
	"""Normalize a `request` into a `key` so that we can hash using it."""
	return json.dumps(request, sort_keys=True)

Adding WildBench #3150

Adding WildBench #3150

Conversation

liamjxu commented Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liamjxu Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liamjxu Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai commented Dec 13, 2024

liamjxu commented Nov 12, 2024 •

edited

Loading

liamjxu Nov 28, 2024 •

edited

Loading

liamjxu Nov 28, 2024 •

edited

Loading