-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add verifiability judgment scenario #1518
Conversation
src/helm/benchmark/scenarios/verifiability_judgment_scenario.py
Outdated
Show resolved
Hide resolved
Some of the instances here are particularly long, since webpages can be lengthy. Is HELM smart about automatically truncating things, or is there something else I need to do on the scenario-side? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline: perhaps we can have a max_num_words argument in the scenario, which filters out instances over the limit.
"complete_support": "fully supports", | ||
"partial_support": "partially supports", | ||
"no_support": "does not support", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale of the aliasing, as opposed to using the original word forms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought it'd be more natural for the LM to generate "fully supports" as opposed to "complete_support".
src/helm/benchmark/scenarios/verifiability_judgment_scenario.py
Outdated
Show resolved
Hide resolved
src/helm/benchmark/scenarios/verifiability_judgment_scenario.py
Outdated
Show resolved
Hide resolved
@nelson-liu would you have time to get the checks working and merge this? I think you just need to run |
fcb7acb
to
57fcb32
Compare
Added a scenario for verifiability judgment---given a generated statement and a cited source, predict if the source fully, partially, or does not support the statement.
Running:
helm-run -r verifiability_judgment:model=openai/gpt-4-0314 --max-eval-instances 1 --suite 1
I'll add the metrics once I have them, but figured I'd open this PR for now.