-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLEVA Scenarios, Perturbations and Metrics #1824
Conversation
2. Use chinese_bleu_1 in Pinyin Transliteration Task
Add Pinyin Transliteration Scenario
Add Scenarios: Instruction Following, Classical Chinese Understanding, and Sentiment Analysis
CLEVA Robustness Perturbations
Fix Python style
2. Change disinformation to fact_check
Load Prompt Setting From File
0914 Review Adjustment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running this on the debugging model simple/model1
. Most runs work, but some are failing inside template rendering:
Failing runs:
cleva:task=dialogue_generation,version=v1,prompt_id=0,subtask=task_oriented,model=simple/model1,data_augmentation=cleva
cleva:task=paraphrase_identification,version=v1,prompt_id=0,subtask=short_utterance,model=simple/model1,data_augmentation=cleva
cleva:task=paraphrase_identification,version=v1,prompt_id=0,subtask=financial_question,model=simple/model1,data_augmentation=cleva
cleva:task=summarization,version=v1,prompt_id=0,subtask=dialogue_summarization,model=simple/model1,data_augmentation=cleva
cleva:task=closed_book_question_answering,version=v1,prompt_id=0,subtask=medical_question_answering,model=simple/model1,data_augmentation=cleva
cleva:task=text_classification,version=v1,prompt_id=0,subtask=news,model=simple/model1,data_augmentation=cleva
cleva:task=text_classification,version=v1,prompt_id=0,subtask=humor,model=simple/model1,data_augmentation=cleva
cleva:task=sentiment_analysis,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=coreference_resolution,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=commonsense_reasoning,version=v1,prompt_id=0,subtask=textual_entailment,model=simple/model1,data_augmentation=cleva
cleva:task=code_synthesis,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=toxicity_detection,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_gender_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_occupation_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_race_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_region_bias,model=simple/model1,data_augmentation=cleva
cleva:task=fact_checking,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
Error traceback:
Traceback (most recent call last):
File "/home/yifanmai/oss/helm/src/helm/benchmark/runner.py", line 173, in run_all
self.run_one(run_spec)
File "/home/yifanmai/oss/helm/src/helm/benchmark/runner.py", line 221, in run_one
instances = scenario.get_instances()
File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 942, in get_instances
instances.extend(self.process_dialogue_instance(row, self.splits[split]))
File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 957, in process_dialogue_instance
self.process_instance(
File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 463, in process_instance
instance = self.converter.transform(row, self.prompt_template, split)
File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 70, in transform
transformed_data = self._apply_all(copy.deepcopy(data), templates)
File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 203, in _apply_all
assert isinstance(v, str)
AssertionError
Could you take a look please?
Thanks for letting us know about these failed runs. We have fixed these issues that mostly come from inappropriate assertions and now they work just fine on our side. We notice that fairness perturbations do not work with the code scenarios, including ours and HumanEval, which does not seem to be a problem of our code. An example config we run into error is Traceback:
Currently we use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I was able to run it locally now! We're very close to merging. I tried running it end-to-end again and found a few issues; we should be good after this round.
Also, please review #1845 for the schema changes... for some reason GitHub is not letting me add you as a reviewer. |
Minor update for clearer presentation
Thanks for your suggestions! We have made the necessary changes according to your comments and tested
We have reviewed #1845 and @Jianqiao-Zhao left a comment that we have no problem with that template. We managed to open a new PR soon to fill |
Sorry, it looks like I just caused some merge conflicts by landing #1844. Could you resolve these please? The required changes should be to rename your Thank you! |
Resolve upstream conflicts
I think we have resolved the conflicts, but Github still reports some weird conflict cases in |
The easiest way to do this would be through the web editor. If that doesn't work, you can do the following on the command line (assuming your local main branch is set to track git checkout main
git pull https://github.com/stanford-crfm/helm.git main
# TODO: Manually edit setup.cfg to resolve the conflict
# i.e. add the CLEVA dependencies
git add setup.cfg
git commit
git log
# TODO: Check that the latest commit is called
# "Merge branch 'main' of https://github.com/stanford-crfm/helm into main"
git push |
I forgot to say: to use the web editor, click the "Resolve conflicts" button next to "This branch has conflicts that must be resolved". Sorry again for the trouble. |
Thanks for your detailed comment! The conflicts should all be resolved now. |
Summary
This PR implements major functionalities of CLEVA:
Command
An example of running a CLEVA scenario (Chinese-to-English translation) is as follows (We need to specify
prompt_id
as CLEVA provides multiple prompt templates; It starts from 0 and is 0 by default):Results
Below is the result comparison of some CLEVA scenarios using
openai/gpt-3.5-turbo-0613
: