Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLEVA Scenarios, Perturbations and Metrics #1824

Merged
merged 198 commits into from
Sep 20, 2023
Merged

Conversation

lyy1994
Copy link
Contributor

@lyy1994 lyy1994 commented Sep 6, 2023

Summary

This PR implements major functionalities of CLEVA:

  • All Chinese scenarios with a multiple prompt templates setting.
  • All Chinese perturbation strategies.
  • Metrics tailored to Chinese evaluation (e.g., Chinese word segmentation, Chinese word lists, etc.).

Command

An example of running a CLEVA scenario (Chinese-to-English translation) is as follows (We need to specify prompt_id as CLEVA provides multiple prompt templates; It starts from 0 and is 0 by default):

helm-run \
-r "cleva:model=openai/gpt-3.5-turbo-0613,task=translation,subtask=zh2en,prompt_id=0,version=v1,data_augmentation=cleva" \
--num-train-trials <num_trials> \
--max-eval-instances <max_eval_instances> \
--suite <suite_id>

Results

Below is the result comparison of some CLEVA scenarios using openai/gpt-3.5-turbo-0613:

Scenario Metric Reproduced CLEVA
task=summarization,subtask=dialogue_summarization ROUGE-2 0.3045 0.3065
task=translation,subtask=en2zh SacreBLEU 60.48 59.23
task=fact_checking Exact Match 0.4595 0.4528
task=bias,subtask=dialogue_region_bias Micro F1 0.5656 0.5589

lyy1994 and others added 30 commits August 18, 2023 01:05
2. Use chinese_bleu_1 in Pinyin Transliteration Task
Add Pinyin Transliteration Scenario
Add Scenarios: Instruction Following, Classical Chinese Understanding, and Sentiment Analysis
2. Change disinformation to fact_check
@lyy1994 lyy1994 requested a review from yifanmai September 15, 2023 04:08
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running this on the debugging model simple/model1. Most runs work, but some are failing inside template rendering:

Failing runs:

cleva:task=dialogue_generation,version=v1,prompt_id=0,subtask=task_oriented,model=simple/model1,data_augmentation=cleva
cleva:task=paraphrase_identification,version=v1,prompt_id=0,subtask=short_utterance,model=simple/model1,data_augmentation=cleva
cleva:task=paraphrase_identification,version=v1,prompt_id=0,subtask=financial_question,model=simple/model1,data_augmentation=cleva
cleva:task=summarization,version=v1,prompt_id=0,subtask=dialogue_summarization,model=simple/model1,data_augmentation=cleva
cleva:task=closed_book_question_answering,version=v1,prompt_id=0,subtask=medical_question_answering,model=simple/model1,data_augmentation=cleva
cleva:task=text_classification,version=v1,prompt_id=0,subtask=news,model=simple/model1,data_augmentation=cleva
cleva:task=text_classification,version=v1,prompt_id=0,subtask=humor,model=simple/model1,data_augmentation=cleva
cleva:task=sentiment_analysis,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=coreference_resolution,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=commonsense_reasoning,version=v1,prompt_id=0,subtask=textual_entailment,model=simple/model1,data_augmentation=cleva
cleva:task=code_synthesis,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=toxicity_detection,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_gender_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_occupation_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_race_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_region_bias,model=simple/model1,data_augmentation=cleva
cleva:task=fact_checking,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva

Error traceback:

Traceback (most recent call last):
  File "/home/yifanmai/oss/helm/src/helm/benchmark/runner.py", line 173, in run_all
    self.run_one(run_spec)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/runner.py", line 221, in run_one
    instances = scenario.get_instances()
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 942, in get_instances
    instances.extend(self.process_dialogue_instance(row, self.splits[split]))
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 957, in process_dialogue_instance
    self.process_instance(
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 463, in process_instance
    instance = self.converter.transform(row, self.prompt_template, split)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 70, in transform
    transformed_data = self._apply_all(copy.deepcopy(data), templates)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 203, in _apply_all
    assert isinstance(v, str)
AssertionError

Could you take a look please?

src/helm/benchmark/presentation/run_specs_cleva_v1.conf Outdated Show resolved Hide resolved
@lyy1994
Copy link
Contributor Author

lyy1994 commented Sep 16, 2023

I tried running this on the debugging model simple/model1. Most runs work, but some are failing inside template rendering:

Failing runs:

cleva:task=dialogue_generation,version=v1,prompt_id=0,subtask=task_oriented,model=simple/model1,data_augmentation=cleva
cleva:task=paraphrase_identification,version=v1,prompt_id=0,subtask=short_utterance,model=simple/model1,data_augmentation=cleva
cleva:task=paraphrase_identification,version=v1,prompt_id=0,subtask=financial_question,model=simple/model1,data_augmentation=cleva
cleva:task=summarization,version=v1,prompt_id=0,subtask=dialogue_summarization,model=simple/model1,data_augmentation=cleva
cleva:task=closed_book_question_answering,version=v1,prompt_id=0,subtask=medical_question_answering,model=simple/model1,data_augmentation=cleva
cleva:task=text_classification,version=v1,prompt_id=0,subtask=news,model=simple/model1,data_augmentation=cleva
cleva:task=text_classification,version=v1,prompt_id=0,subtask=humor,model=simple/model1,data_augmentation=cleva
cleva:task=sentiment_analysis,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=coreference_resolution,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=commonsense_reasoning,version=v1,prompt_id=0,subtask=textual_entailment,model=simple/model1,data_augmentation=cleva
cleva:task=code_synthesis,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=toxicity_detection,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_gender_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_occupation_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_race_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_region_bias,model=simple/model1,data_augmentation=cleva
cleva:task=fact_checking,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva

Error traceback:

Traceback (most recent call last):
  File "/home/yifanmai/oss/helm/src/helm/benchmark/runner.py", line 173, in run_all
    self.run_one(run_spec)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/runner.py", line 221, in run_one
    instances = scenario.get_instances()
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 942, in get_instances
    instances.extend(self.process_dialogue_instance(row, self.splits[split]))
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 957, in process_dialogue_instance
    self.process_instance(
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 463, in process_instance
    instance = self.converter.transform(row, self.prompt_template, split)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 70, in transform
    transformed_data = self._apply_all(copy.deepcopy(data), templates)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 203, in _apply_all
    assert isinstance(v, str)
AssertionError

Could you take a look please?

Thanks for letting us know about these failed runs. We have fixed these issues that mostly come from inappropriate assertions and now they work just fine on our side. We notice that fairness perturbations do not work with the code scenarios, including ours and HumanEval, which does not seem to be a problem of our code. An example config we run into error is code:model=simple/model1,dataset=humaneval,data_augmentation=fairness.

Traceback:

Traceback (most recent call last):
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/runner.py", line 173, in run_all
    self.run_one(run_spec)
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/runner.py", line 260, in run_one
    metric_result: MetricResult = metric.evaluate(
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/metric.py", line 148, in evaluate
    results: List[List[Stat]] = parallel_map(
  File "/home/ubuntu/liyanyang/helm/src/helm/common/general.py", line 232, in parallel_map
    results = list(tqdm(executor.map(process, items), total=len(items), disable=None))
  File "/mnt/data/huangyongfeng/AI4Protein/anaconda3/envs/crfm-helm/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/mnt/data/huangyongfeng/AI4Protein/anaconda3/envs/crfm-helm/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/mnt/data/huangyongfeng/AI4Protein/anaconda3/envs/crfm-helm/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/mnt/data/huangyongfeng/AI4Protein/anaconda3/envs/crfm-helm/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/mnt/data/huangyongfeng/AI4Protein/anaconda3/envs/crfm-helm/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/metric.py", line 77, in process
    self.metric.evaluate_generation(
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/basic_metrics.py", line 745, in evaluate_generation
    stats.extend(self.compute_reference_metrics(adapter_spec, request_state, metric_service))
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/basic_metrics.py", line 551, in compute_reference_metrics
    stats.extend(compute_metrics_helper(MetricName(metric_name), metric_fn_mapping[metric_name]))
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/basic_metrics.py", line 479, in compute_metrics_helper
    score_1 = max(score_func((gold.output.text, gold.test_cases), preds[0]) for gold in code_golds)
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/basic_metrics.py", line 479, in <genexpr>
    score_1 = max(score_func((gold.output.text, gold.test_cases), preds[0]) for gold in code_golds)
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/basic_metrics.py", line 366, in code_eval
    assert gold[1] is not None  # gold[1]["canonical_solution"]
AssertionError

Currently we use cleva_robustness for code scenarios in run_specs_cleva_v1.conf to circumvent this issue.

@lyy1994 lyy1994 requested a review from yifanmai September 16, 2023 08:11
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I was able to run it locally now! We're very close to merging. I tried running it end-to-end again and found a few issues; we should be good after this round.

src/helm/benchmark/metrics/machine_translation_metrics.py Outdated Show resolved Hide resolved
src/helm/benchmark/run_specs.py Outdated Show resolved Hide resolved
src/helm/benchmark/scenarios/cleva_scenario.py Outdated Show resolved Hide resolved
@yifanmai
Copy link
Collaborator

Also, please review #1845 for the schema changes... for some reason GitHub is not letting me add you as a reviewer.

@lyy1994
Copy link
Contributor Author

lyy1994 commented Sep 19, 2023

Thank you, I was able to run it locally now! We're very close to merging. I tried running it end-to-end again and found a few issues; we should be good after this round.

Thanks for your suggestions! We have made the necessary changes according to your comments and tested run_specs_cleva_v1.conf with simple/model1 on our side successfully.

Also, please review #1845 for the schema changes... for some reason GitHub is not letting me add you as a reviewer.

We have reviewed #1845 and @Jianqiao-Zhao left a comment that we have no problem with that template. We managed to open a new PR soon to fill schema.yaml after this PR merged. Thanks for your guidance in the past few weeks!

@lyy1994 lyy1994 requested a review from yifanmai September 19, 2023 13:48
@yifanmai
Copy link
Collaborator

Sorry, it looks like I just caused some merge conflicts by landing #1844. Could you resolve these please?

The required changes should be to rename your requirements-freeze.txt file to requirements.txt (use your latest contents), deconflict the requirements extras lists in extras.cfg, and redo the imports in run.py

Thank you!

@lyy1994
Copy link
Contributor Author

lyy1994 commented Sep 20, 2023

Sorry, it looks like I just caused some merge conflicts by landing #1844. Could you resolve these please?

The required changes should be to rename your requirements-freeze.txt file to requirements.txt (use your latest contents), deconflict the requirements extras lists in extras.cfg, and redo the imports in run.py

Thank you!

I think we have resolved the conflicts, but Github still reports some weird conflict cases in requirements.txt and setup.cfg... Another issue is that we only handle conflicts in src/helm/benchmark/run.py, requirements.txt and setup.cfg. But #1844 also changes other files, making the Github action fail the test. Please let us know if we need to take more action.

@yifanmai
Copy link
Collaborator

The easiest way to do this would be through the web editor.

If that doesn't work, you can do the following on the command line (assuming your local main branch is set to track lyy1994:main):

git checkout main
git pull https://github.com/stanford-crfm/helm.git main
# TODO: Manually edit setup.cfg to resolve the conflict
# i.e. add the CLEVA dependencies
git add setup.cfg
git commit
git log
# TODO: Check that the latest commit is called
# "Merge branch 'main' of https://github.com/stanford-crfm/helm into main"
git push

@yifanmai
Copy link
Collaborator

I forgot to say: to use the web editor, click the "Resolve conflicts" button next to "This branch has conflicts that must be resolved".

Sorry again for the trouble.

@lyy1994
Copy link
Contributor Author

lyy1994 commented Sep 20, 2023

The easiest way to do this would be through the web editor.

If that doesn't work, you can do the following on the command line (assuming your local main branch is set to track lyy1994:main):

git checkout main
git pull https://github.com/stanford-crfm/helm.git main
# TODO: Manually edit setup.cfg to resolve the conflict
# i.e. add the CLEVA dependencies
git add setup.cfg
git commit
git log
# TODO: Check that the latest commit is called
# "Merge branch 'main' of https://github.com/stanford-crfm/helm into main"
git push

Thanks for your detailed comment! The conflicts should all be resolved now.

@yifanmai yifanmai merged commit 3288e25 into stanford-crfm:main Sep 20, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants