CLEVA Scenarios, Perturbations and Metrics #1824

lyy1994 · 2023-09-06T06:54:43Z

Summary

This PR implements major functionalities of CLEVA:

All Chinese scenarios with a multiple prompt templates setting.
All Chinese perturbation strategies.
Metrics tailored to Chinese evaluation (e.g., Chinese word segmentation, Chinese word lists, etc.).

Command

An example of running a CLEVA scenario (Chinese-to-English translation) is as follows (We need to specify prompt_id as CLEVA provides multiple prompt templates; It starts from 0 and is 0 by default):

helm-run \
-r "cleva:model=openai/gpt-3.5-turbo-0613,task=translation,subtask=zh2en,prompt_id=0,version=v1,data_augmentation=cleva" \
--num-train-trials <num_trials> \
--max-eval-instances <max_eval_instances> \
--suite <suite_id>

Results

Below is the result comparison of some CLEVA scenarios using openai/gpt-3.5-turbo-0613:

Scenario	Metric	Reproduced	CLEVA
task=summarization,subtask=dialogue_summarization	ROUGE-2	0.3045	0.3065
task=translation,subtask=en2zh	SacreBLEU	60.48	59.23
task=fact_checking	Exact Match	0.4595	0.4528
task=bias,subtask=dialogue_region_bias	Micro F1	0.5656	0.5589

2. Use chinese_bleu_1 in Pinyin Transliteration Task

Add Pinyin Transliteration Scenario

Add Scenarios: Instruction Following, Classical Chinese Understanding, and Sentiment Analysis

CLEVA Robustness Perturbations

Fix Python style

2. Change disinformation to fact_check

Load Prompt Setting From File

0914 Review Adjustment

yifanmai

I tried running this on the debugging model simple/model1. Most runs work, but some are failing inside template rendering:

Failing runs:

cleva:task=dialogue_generation,version=v1,prompt_id=0,subtask=task_oriented,model=simple/model1,data_augmentation=cleva
cleva:task=paraphrase_identification,version=v1,prompt_id=0,subtask=short_utterance,model=simple/model1,data_augmentation=cleva
cleva:task=paraphrase_identification,version=v1,prompt_id=0,subtask=financial_question,model=simple/model1,data_augmentation=cleva
cleva:task=summarization,version=v1,prompt_id=0,subtask=dialogue_summarization,model=simple/model1,data_augmentation=cleva
cleva:task=closed_book_question_answering,version=v1,prompt_id=0,subtask=medical_question_answering,model=simple/model1,data_augmentation=cleva
cleva:task=text_classification,version=v1,prompt_id=0,subtask=news,model=simple/model1,data_augmentation=cleva
cleva:task=text_classification,version=v1,prompt_id=0,subtask=humor,model=simple/model1,data_augmentation=cleva
cleva:task=sentiment_analysis,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=coreference_resolution,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=commonsense_reasoning,version=v1,prompt_id=0,subtask=textual_entailment,model=simple/model1,data_augmentation=cleva
cleva:task=code_synthesis,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=toxicity_detection,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_gender_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_occupation_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_race_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_region_bias,model=simple/model1,data_augmentation=cleva
cleva:task=fact_checking,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva

Error traceback:

Traceback (most recent call last):
  File "/home/yifanmai/oss/helm/src/helm/benchmark/runner.py", line 173, in run_all
    self.run_one(run_spec)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/runner.py", line 221, in run_one
    instances = scenario.get_instances()
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 942, in get_instances
    instances.extend(self.process_dialogue_instance(row, self.splits[split]))
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 957, in process_dialogue_instance
    self.process_instance(
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 463, in process_instance
    instance = self.converter.transform(row, self.prompt_template, split)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 70, in transform
    transformed_data = self._apply_all(copy.deepcopy(data), templates)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 203, in _apply_all
    assert isinstance(v, str)
AssertionError

Could you take a look please?

src/helm/benchmark/presentation/run_specs_cleva_v1.conf

Fix Runtime Bugs

lyy1994 · 2023-09-16T08:06:12Z

I tried running this on the debugging model simple/model1. Most runs work, but some are failing inside template rendering:

Failing runs:

cleva:task=dialogue_generation,version=v1,prompt_id=0,subtask=task_oriented,model=simple/model1,data_augmentation=cleva
cleva:task=paraphrase_identification,version=v1,prompt_id=0,subtask=short_utterance,model=simple/model1,data_augmentation=cleva
cleva:task=paraphrase_identification,version=v1,prompt_id=0,subtask=financial_question,model=simple/model1,data_augmentation=cleva
cleva:task=summarization,version=v1,prompt_id=0,subtask=dialogue_summarization,model=simple/model1,data_augmentation=cleva
cleva:task=closed_book_question_answering,version=v1,prompt_id=0,subtask=medical_question_answering,model=simple/model1,data_augmentation=cleva
cleva:task=text_classification,version=v1,prompt_id=0,subtask=news,model=simple/model1,data_augmentation=cleva
cleva:task=text_classification,version=v1,prompt_id=0,subtask=humor,model=simple/model1,data_augmentation=cleva
cleva:task=sentiment_analysis,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=coreference_resolution,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=commonsense_reasoning,version=v1,prompt_id=0,subtask=textual_entailment,model=simple/model1,data_augmentation=cleva
cleva:task=code_synthesis,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=toxicity_detection,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_gender_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_occupation_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_race_bias,model=simple/model1,data_augmentation=cleva
cleva:task=bias,version=v1,prompt_id=0,subtask=dialogue_region_bias,model=simple/model1,data_augmentation=cleva
cleva:task=fact_checking,version=v1,prompt_id=0,model=simple/model1,data_augmentation=cleva

Error traceback:

Traceback (most recent call last):
  File "/home/yifanmai/oss/helm/src/helm/benchmark/runner.py", line 173, in run_all
    self.run_one(run_spec)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/runner.py", line 221, in run_one
    instances = scenario.get_instances()
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 942, in get_instances
    instances.extend(self.process_dialogue_instance(row, self.splits[split]))
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 957, in process_dialogue_instance
    self.process_instance(
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 463, in process_instance
    instance = self.converter.transform(row, self.prompt_template, split)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 70, in transform
    transformed_data = self._apply_all(copy.deepcopy(data), templates)
  File "/home/yifanmai/oss/helm/src/helm/benchmark/scenarios/cleva_scenario.py", line 203, in _apply_all
    assert isinstance(v, str)
AssertionError

Could you take a look please?

Thanks for letting us know about these failed runs. We have fixed these issues that mostly come from inappropriate assertions and now they work just fine on our side. We notice that fairness perturbations do not work with the code scenarios, including ours and HumanEval, which does not seem to be a problem of our code. An example config we run into error is code:model=simple/model1,dataset=humaneval,data_augmentation=fairness.

Traceback:

Traceback (most recent call last):
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/runner.py", line 173, in run_all
    self.run_one(run_spec)
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/runner.py", line 260, in run_one
    metric_result: MetricResult = metric.evaluate(
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/metric.py", line 148, in evaluate
    results: List[List[Stat]] = parallel_map(
  File "/home/ubuntu/liyanyang/helm/src/helm/common/general.py", line 232, in parallel_map
    results = list(tqdm(executor.map(process, items), total=len(items), disable=None))
  File "/mnt/data/huangyongfeng/AI4Protein/anaconda3/envs/crfm-helm/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/mnt/data/huangyongfeng/AI4Protein/anaconda3/envs/crfm-helm/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/mnt/data/huangyongfeng/AI4Protein/anaconda3/envs/crfm-helm/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/mnt/data/huangyongfeng/AI4Protein/anaconda3/envs/crfm-helm/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/mnt/data/huangyongfeng/AI4Protein/anaconda3/envs/crfm-helm/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/metric.py", line 77, in process
    self.metric.evaluate_generation(
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/basic_metrics.py", line 745, in evaluate_generation
    stats.extend(self.compute_reference_metrics(adapter_spec, request_state, metric_service))
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/basic_metrics.py", line 551, in compute_reference_metrics
    stats.extend(compute_metrics_helper(MetricName(metric_name), metric_fn_mapping[metric_name]))
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/basic_metrics.py", line 479, in compute_metrics_helper
    score_1 = max(score_func((gold.output.text, gold.test_cases), preds[0]) for gold in code_golds)
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/basic_metrics.py", line 479, in <genexpr>
    score_1 = max(score_func((gold.output.text, gold.test_cases), preds[0]) for gold in code_golds)
  File "/home/ubuntu/liyanyang/helm/src/helm/benchmark/metrics/basic_metrics.py", line 366, in code_eval
    assert gold[1] is not None  # gold[1]["canonical_solution"]
AssertionError

Currently we use cleva_robustness for code scenarios in run_specs_cleva_v1.conf to circumvent this issue.

yifanmai

Thank you, I was able to run it locally now! We're very close to merging. I tried running it end-to-end again and found a few issues; we should be good after this round.

src/helm/benchmark/metrics/machine_translation_metrics.py

src/helm/benchmark/run_specs.py

src/helm/benchmark/scenarios/cleva_scenario.py

yifanmai · 2023-09-19T05:52:06Z

Also, please review #1845 for the schema changes... for some reason GitHub is not letting me add you as a reviewer.

Minor update for clearer presentation

lyy1994 · 2023-09-19T13:46:41Z

Thank you, I was able to run it locally now! We're very close to merging. I tried running it end-to-end again and found a few issues; we should be good after this round.

Thanks for your suggestions! We have made the necessary changes according to your comments and tested run_specs_cleva_v1.conf with simple/model1 on our side successfully.

Also, please review #1845 for the schema changes... for some reason GitHub is not letting me add you as a reviewer.

We have reviewed #1845 and @Jianqiao-Zhao left a comment that we have no problem with that template. We managed to open a new PR soon to fill schema.yaml after this PR merged. Thanks for your guidance in the past few weeks!

yifanmai · 2023-09-20T05:26:34Z

Sorry, it looks like I just caused some merge conflicts by landing #1844. Could you resolve these please?

The required changes should be to rename your requirements-freeze.txt file to requirements.txt (use your latest contents), deconflict the requirements extras lists in extras.cfg, and redo the imports in run.py

Thank you!

Resolve upstream conflicts

lyy1994 · 2023-09-20T07:26:58Z

Sorry, it looks like I just caused some merge conflicts by landing #1844. Could you resolve these please?

The required changes should be to rename your requirements-freeze.txt file to requirements.txt (use your latest contents), deconflict the requirements extras lists in extras.cfg, and redo the imports in run.py

Thank you!

I think we have resolved the conflicts, but Github still reports some weird conflict cases in requirements.txt and setup.cfg... Another issue is that we only handle conflicts in src/helm/benchmark/run.py, requirements.txt and setup.cfg. But #1844 also changes other files, making the Github action fail the test. Please let us know if we need to take more action.

yifanmai · 2023-09-20T16:21:22Z

The easiest way to do this would be through the web editor.

If that doesn't work, you can do the following on the command line (assuming your local main branch is set to track lyy1994:main):

git checkout main
git pull https://github.com/stanford-crfm/helm.git main
# TODO: Manually edit setup.cfg to resolve the conflict
# i.e. add the CLEVA dependencies
git add setup.cfg
git commit
git log
# TODO: Check that the latest commit is called
# "Merge branch 'main' of https://github.com/stanford-crfm/helm into main"
git push

yifanmai · 2023-09-20T16:22:05Z

I forgot to say: to use the web editor, click the "Resolve conflicts" button next to "This branch has conflicts that must be resolved".

Sorry again for the trouble.

lyy1994 · 2023-09-20T16:29:32Z

The easiest way to do this would be through the web editor.

If that doesn't work, you can do the following on the command line (assuming your local main branch is set to track lyy1994:main):
git checkout main
git pull https://github.com/stanford-crfm/helm.git main
# TODO: Manually edit setup.cfg to resolve the conflict
# i.e. add the CLEVA dependencies
git add setup.cfg
git commit
git log
# TODO: Check that the latest commit is called
# "Merge branch 'main' of https://github.com/stanford-crfm/helm into main"
git push

Thanks for your detailed comment! The conflicts should all be resolved now.

lyy1994 and others added 30 commits August 18, 2023 01:05

Add CLEVA text classification task

b063368

Add CLEVA opinion mining task

6b3658a

Add Pinyin Transliteration Task and Two Related Subtasks

96b7b15

Amend requirements-freeze.txt

0b5c4b1

1. Add chinese_bleu_1 Metric in BasicMetric

1544260

2. Use chinese_bleu_1 in Pinyin Transliteration Task

Amend Pinyin Transliteration max_tokens

5e50966

Fix Typo and Add Examples for CLEVAPinyinTransliterationScenario

145f914

Add Classical Chinese Understanding Scenario

64141ac

Add Sentiment Analysis Scenario

d45fc97

Add Instruction Following Scenario

0a770e8

Merge pull request #1 from lyy1994/pinyin

9835f38

Add Pinyin Transliteration Scenario

Add CLEVA robustness perturbations

bb7c088

Fix Bug & Clean up

3ee04ba

Put name in to CLEVAScenario to avoid creating unnecessary directory

cebb101

Merge pull request #5 from lyy1994/scenarios/instruction_following

db4c953

Add Scenarios: Instruction Following, Classical Chinese Understanding, and Sentiment Analysis

Merge branch 'stanford-crfm:main' into main

0a00421

Add Disinformation Scenario

872e810

Merge pull request #4 from lyy1994/perturbations/robustness

b1c0104

CLEVA Robustness Perturbations

Load Prompt Setting From File

26b0e86

Fix Python style

c1bb31c

Update style

ac61947

Fix checking

25a53b0

Merge pull request #8 from lyy1994/hotfix/style

d778690

Fix Python style

Add Gender Perturbation

12d8329

1. Merge branch 'main' to pass the auto checks

4716053

2. Change disinformation to fact_check

Fix checking

ac6754a

Fix checking

fa8bbbf

Fix checking

43e9815

Merge pull request #7 from lyy1994/load_prompt_templates

46f0a18

Load Prompt Setting From File

Initialize Translation Scenario

b759579

Merge pull request #39 from lyy1994/template_improvement

05e7358

0914 Review Adjustment

lyy1994 requested a review from yifanmai September 15, 2023 04:08

yifanmai requested changes Sep 16, 2023

View reviewed changes

src/helm/benchmark/presentation/run_specs_cleva_v1.conf Outdated Show resolved Hide resolved

lyy1994 and others added 4 commits September 16, 2023 12:54

Fix incorrect assert

858be52

Update from full_functionality_text to text

e4fd565

Minor debug

997dcf0

Merge pull request #40 from lyy1994/debug

f7c0576

Fix Runtime Bugs

lyy1994 requested a review from yifanmai September 16, 2023 08:11

yifanmai requested changes Sep 19, 2023

View reviewed changes

src/helm/benchmark/metrics/machine_translation_metrics.py Outdated Show resolved Hide resolved

src/helm/benchmark/run_specs.py Outdated Show resolved Hide resolved

src/helm/benchmark/scenarios/cleva_scenario.py Outdated Show resolved Hide resolved

lyy1994 and others added 4 commits September 19, 2023 14:24

Minor update for clearer presentation

ea34b43

Minor debug

2e562f6

Rephrase a comment

5877b19

Merge pull request #41 from lyy1994/minor_improvement

e4d08a0

Minor update for clearer presentation

lyy1994 requested a review from yifanmai September 19, 2023 13:48

lyy1994 and others added 2 commits September 20, 2023 15:07

Resolve upstream conflicts

f2d01e2

Merge pull request #42 from lyy1994/conflicts

4cd0f8d

Resolve upstream conflicts

Minor fix

7da19db

lyy1994 force-pushed the main branch from e568099 to 7da19db Compare September 20, 2023 07:38

Merge branch 'main' into main

e87e1c8

yifanmai approved these changes Sep 20, 2023

View reviewed changes

yifanmai merged commit 3288e25 into stanford-crfm:main Sep 20, 2023
3 checks passed

lyy1994 mentioned this pull request Oct 29, 2023

don't understand mul_as_gen LaVi-Lab/CLEVA#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLEVA Scenarios, Perturbations and Metrics #1824

CLEVA Scenarios, Perturbations and Metrics #1824

lyy1994 commented Sep 6, 2023

yifanmai left a comment

lyy1994 commented Sep 16, 2023 •

edited

Loading

yifanmai left a comment

yifanmai commented Sep 19, 2023

lyy1994 commented Sep 19, 2023

yifanmai commented Sep 20, 2023

lyy1994 commented Sep 20, 2023

yifanmai commented Sep 20, 2023

yifanmai commented Sep 20, 2023

lyy1994 commented Sep 20, 2023

CLEVA Scenarios, Perturbations and Metrics #1824

CLEVA Scenarios, Perturbations and Metrics #1824

Conversation

lyy1994 commented Sep 6, 2023

Summary

Command

Results

yifanmai left a comment

Choose a reason for hiding this comment

lyy1994 commented Sep 16, 2023 • edited Loading

yifanmai left a comment

Choose a reason for hiding this comment

yifanmai commented Sep 19, 2023

lyy1994 commented Sep 19, 2023

yifanmai commented Sep 20, 2023

lyy1994 commented Sep 20, 2023

yifanmai commented Sep 20, 2023

yifanmai commented Sep 20, 2023

lyy1994 commented Sep 20, 2023

lyy1994 commented Sep 16, 2023 •

edited

Loading