Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding WildBench #3150

Merged
merged 31 commits into from
Dec 13, 2024
Merged

Adding WildBench #3150

merged 31 commits into from
Dec 13, 2024

Conversation

liamjxu
Copy link
Contributor

@liamjxu liamjxu commented Nov 12, 2024

Added WildBench scenario, adapter, run specs, annotator, and metrics.

TODO:

  • Add a customized adapter that applies chat template for model inference
  • Align with original repo on the prompt format for GPT-as-a-judge

Comment:

  • Currently created a new adapter ChatAdapter to use chat messages in the Request initialization, but it's most likely optimizable. Suggestions on this would be helpful and are very welcome.
  • Right now we only included WB score in the schema, we can also include WB reward.

@liamjxu liamjxu force-pushed the jialiang/wildbench branch 2 times, most recently from 8007c43 to 46f7a7c Compare November 12, 2024 04:45
@liamjxu liamjxu requested a review from yifanmai November 12, 2024 16:45
@liamjxu liamjxu self-assigned this Nov 20, 2024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe drop this file if we don't need it? Or do you intend to implement a pairwise annotator as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do plan to implement this. However, the original repo contains an ambiguity in the implementation of this metric. I have raised an issue on this in this issue.

Copy link
Contributor Author

@liamjxu liamjxu Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically they mentioned a machanism to length-debias in comparison of models, yet the released codebase do not seem to implement it

try:
is_following = instruction.check_following(response)
except Exception as e:
print(f"Instruction following checking failed with error message {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. use hlog() instead of print()
  2. start the message with "WARNING: "
  3. does this fail frequently? this basically means that if the judge model fails, the model under evaluation gets penalized (score defaults to 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 3., this is very rare and the only type of exception I observed so far was due to langdetect failing to recognize languages, in that case the original codebase consider it as a successful following case, so I followed that for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. and 2. have been updated in the latest change.

- name: wildbench_score
display_name: WildBench Score
short_display_name: WB Score
description: Score of the AI output judged by GPT-4.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPT-4o?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Changed.

Comment on lines 207 to 211
taxonomy:
task: "?"
what: "?"
who: "?"
when: "?"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fill these out (to the best of your knowledge)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the latest change

dataset = datasets.load_dataset(
"allenai/WildBench",
self.subset,
trust_remote_code=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is trust_remote_code needed? :(

Copy link
Contributor Author

@liamjxu liamjxu Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. This is now removed

Comment on lines 45 to 54
baseline_outputs = {
f"{model}": datasets.load_dataset(
"allenai/WildBench-V2-Model-Outputs",
model,
trust_remote_code=True,
cache_dir=cache_dir,
split="train",
)
for model in REFERENCE_MODELS
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like pairwise is half-implemented - I'd suggest finishing the implementation (only the annotator is missing) or removing all the pairwise components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the previous comment on the pairwise metric, I'm waiting on the authors's response, but I can also remove it first to keep the codebase clean.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm fine with keeping it in.

history.append(noun + round["content"])
history_text = "\n\n".join(history)
user_query_text = row["conversation_input"][-1]["content"]
checklist_text = "\n".join([f"- {checklist_item}" for checklist_item in row["checklist"]])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just keep this as a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch, the original code also initially kept the checklist items as a list but they later merged the check list, so I thought we could just store only the processed text.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for doing the merging in the annotator and keeping this as a list, but either is fine (up to you).

contents.append(
Content(role=role_mapping.get(msg["role"], "user"), parts=[Part.from_text(msg["content"])])
)
content_key = "\n".join([msg["content"] for msg in request.messages])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this - can just rely on the existing content field (the cache key can be a nested dict).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm I understand it correctly, do you mean we can use the Content objects as the cache key? But here we have a list of messages, which essentially gives a list of Content objects, does a list of such nested dicts also work? If it does I'm curious about how that works - any quick pointers would be much appreciated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, basically any nested dict that can be serialized to JSON can be used as a cache key. In terms of the implementation:

  1. MongoDB natively supports using a JSON object
  2. Other caches like SQLite serialize the JSON object to a string in a "canonical" way, which happens here:

def request_to_key(request: Mapping) -> str:
"""Normalize a `request` into a `key` so that we can hash using it."""
return json.dumps(request, sort_keys=True)

In terms of this PR, you could do something like

if:
    cache_key["messages"] = request.messages
else:
    cache_key["prompt"] = request.prompt

or

{
    "prompt": request.messages or request.prompt
}

or something similar

)
assert isinstance(dataset, datasets.Dataset)
baseline_outputs = {
f"{model}": datasets.load_dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is quite expensive and also takes up quite a bit of space, I would prefer for the baseline_outputs to only be loaded if we actually need them i.e. if we're in the pairwise comparison case. You can make this configurable by adding a argument in the constructor (and passing it from the run spec function via ScenarioSpec args), setting an instance variable, and then checking it here..

@yifanmai
Copy link
Collaborator

Mostly looks good. Feel free to merge after addressing the remaining comments. Also, you need to resolve the conflict with main.

@liamjxu liamjxu merged commit 1e55710 into main Dec 13, 2024
12 checks passed
@liamjxu liamjxu deleted the jialiang/wildbench branch December 13, 2024 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants