Add BHASA scenarios #2648

raileymontalan · 2024-05-14T14:07:06Z

Add the following for BHASA:

Setup dependencies
Run entries (few-shot and zero-shot settings)
Custom metrics for machine translation (ChrF++), summarization (XL-Sum Rouge-L), and question-answering (SQuAD exact match and SQuAD F1)
Run specs
Scenarios (Indonesian, Tamil, Thai, Vietnamese):
- Natural Language Understanding: question answering, sentiment analysis, toxicity detection/classification
- Natural Language Generation: machine translation, abstractive summarization
- Natural Language Reasoning: natural language inference, causal reasoning
Presentation schemas

Moved to a different PR:

Linguistic Diagnostics: syntax minimal pairs, pragmatic reasoning

into bhasa_scenarios

yifanmai

Thanks! Left some initial comments. This may take me a while to review - I'll try to finish this early next week.

yifanmai · 2024-05-18T01:13:31Z

src/helm/benchmark/metrics/xlsum/setup.py

Could we can use rouge-score from PyPI? https://pypi.org/project/rouge-score/ Or are there divergences from the PyPI package?

The rouge-score package from PyPI should be the codebase that the xlsum multilingual rouge score codebase was based on, but there are divergences in the use of multilingual tokenizers and stemmers, and we also added a fixed random seed in the bootstrap aggregation step in order to ensure reproducibility

src/helm/benchmark/run_specs/bhasa_run_specs.py

…anford-crfm#2636)

yifanmai · 2024-06-08T00:25:56Z

Should I review this now or should we try to merge #2694 first?

weiqipedia · 2024-06-08T08:00:08Z

Let's review this first as #2694 requires further discussion on how to handle the aggregation of scores across categories!

raileymontalan · 2024-06-13T06:08:19Z

Hi @yifanmai, addressing your comments from the other PR here:

Add BHASA LINDSEA scenarios #2694 (comment): Replaced the name with chr_f_plus_plus
Add BHASA LINDSEA scenarios #2694 (comment): Removed the overriding for evaluate() as it is not called.
Add BHASA LINDSEA scenarios #2694 (comment): Moved these to a bhasa_metrics_specs.py file.
Add BHASA LINDSEA scenarios #2694 (comment): Added a comment explaining that we maintain the SQuAD v1.1 behavior.
Add BHASA LINDSEA scenarios #2694 (comment): Removed this line to cast the type.
Add BHASA LINDSEA scenarios #2694 (comment): Corrected the type declaration.
Add BHASA LINDSEA scenarios #2694 (comment): Moved this line after the if statement.
Increased the sample size for the test sets (Add BHASA LINDSEA scenarios #2694 (comment), Add BHASA LINDSEA scenarios #2694 (comment), etc.)
For the comments regarding XLSum (Add BHASA LINDSEA scenarios #2694 (comment), Add BHASA LINDSEA scenarios #2694 (comment), etc.), we've decided to remove those from the PR for now.
Removed the generator functions for some scenarios (Add BHASA LINDSEA scenarios #2694 (comment), Add BHASA LINDSEA scenarios #2694 (comment), etc.)
Replaced the F1 metric specs with exact match and classification metric specs (Add BHASA LINDSEA scenarios #2694 (comment), Add BHASA LINDSEA scenarios #2694 (comment), etc.)

weiqipedia · 2024-06-18T01:56:00Z

src/helm/benchmark/scenarios/bhasa_scenario.py

-                # Sample 100 examples for test
-                data = df.sample(n=100, random_state=5018)
+                # Sample 565 examples for test
+                data = df.sample(n=565, random_state=5018)


No need to sample since we are taking the entire validation set as the test set!

weiqipedia · 2024-06-18T02:10:54Z

src/helm/benchmark/metrics/bhasa_metrics.py

+        metrics['chr_f_plus_plus'] = self.chrf_scorer.sentence_score(pred, refs).score
+        return metrics
+
+    def remove_braces(self, text: str) -> str:


@raileymontalan There's no need for this function actually. If I'm not wrong, this is something that was found in the SummarizationMetric (https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/metrics/summarization_metrics.py) because they used to use braces to contain the summary, and they used } as a stop token. But it seems like the separator between few-shot instances for summarization has been changed to ###. Since we do not actually include any instructions in our prompt to provide the translation within braces, there is no need for this cleaning step.

@yifanmai This also answers your comment from (#2694 (comment)).
We might want to remove this from SummarizationMetric as well? Or is that perhaps there for backward compatibility?

Let's leave it there for backwards compatibility for now.

weiqipedia · 2024-06-18T02:13:51Z

src/helm/benchmark/metrics/test_bhasa_metrics.py

+    rel_tol: float = 0.01
+
+
+def check_test_cases(test_cases: List[TestCase], bias_func: Callable[[List[str]], Optional[float]]):


@raileymontalan There's no need for test_bhasa_scenario.py and test_bhasa_metrics.py since they seem to be copies of existing examples and not actually targeting BHASA scenarios/metrics?

yifanmai · 2024-06-19T04:33:36Z

Looks good, thanks! Could you also run the linter on this?

pip install -e '.[dev]'
./pre-commit.sh

raileymontalan · 2024-06-19T07:35:46Z

Looks good, thanks! Could you also run the linter on this?
pip install -e '.[dev]'
./pre-commit.sh

Linter has been run, files have been reformatted.
Test files have been removed.
Sampling for TyDiQA test set has been removed.
remove_braces function from chrF++ has been removed.

yifanmai · 2024-06-19T15:43:09Z

setup.cfg

@@ -271,6 +276,7 @@ all =
    crfm-helm[mongo]
    crfm-helm[heim]
    crfm-helm[vlm]
+    cfrm-helm[bhasa]


Typo: cfrm-helm[bhasa] should be crfm-helm[bhasa]

raileymontalan · 2024-06-20T06:01:40Z

@yifanmai there's a peculiar issue I'm running into when executing pre-commit.sh.

There are some lines in our code that are >120 characters, in particular some instructions that are in Tamil ("உங்களுக்கு ஒரு பத்தியும் ஒரு கேள்வியும் தரப்படும். தரப்பட்ட பத்தியிலிருந்து கேள்விக்கான பதிலைக் கண்டறியவும்.") and Thai ("อารมณ์ความรู้สึกของข้อความต่อไปนี้เป็นอย่างไร?\nกรุณาตอบโดยใช้คำเดียวเท่านั้น:\n- แง่บวก\n- แง่ลบ\n- เฉยๆ") found in bhasa_run_specs.py. Because they are >120 characters, I try to split them into multi-line strings. However, pre-commit.sh thinks that they are <120 characters and tries to remerge the multi-line strings into just one line, causing the failures in the Git checks.

To remedy this, I had to assign these strings to a temporary variable to ensure the whole statement is <120 characters. Please advise.

yifanmai · 2024-06-21T20:34:34Z

What you did looks reasonable. We have two different tools, flake8 and black. It looks like they conflict in how they measure string length (this doesn't happen often). In the future, if you run into this again, you can add # noqa to the end of the line, which should skip the length check for that line.

yifanmai · 2024-06-21T20:35:18Z

Thanks for your help!

raileymontalan and others added 25 commits May 10, 2024 09:13

Add BHASA scenarios, run specs, and metrics

9d03497

Edit docstrings and clean up imports

8857a24

Reorganize run_specs, remove SEA-LION stop token

180a45b

Reorganize BHASA scenarios

b01bb3d

Edit QA scenarios

bcb496b

Edit Sentiment Analysis Scenarios

8fdb5f9

Apply change requests

577afe1

Add correct citations, schema, and run specs for BHASA

1124825

Merge branch 'bhasa_scenarios' of https://github.com/raileymontalan/helm

8cb7196

into bhasa_scenarios

Edit Toxicity Detection Tasks

118aa82

Edit Machine Translation Task

6f6d91e

Edit Summarization Task

aa54724

Edit NLI Task

ba448de

Edit Causal Reasoning task

9f3d813

Edit docstrings for LINDSEA

a58fcf3

Fix LINDSEA Minimal Pair Scenario and RunSpec implementation

81f927f

Edit BHASA run entries for zero-shot

a8a502e

Edit XCOPA Indonesian prompt

f4bf9f2

Fix UITVSFC sampling, remove LINDSEA Pragmatics splits

ebc81c9

Minor BHASA run spec edits

dca5b1e

Edits to 5-shot run entries

f15bd0e

Fix LINDSEA randomization issue

74c6ae7

Edit max tokens for tasks

be982ee

Temporarily remove schema_bhasa.yaml

363d2c7

Add BHASA schema

cb50621

yifanmai requested changes May 18, 2024

View reviewed changes

yifanmai and others added 4 commits May 20, 2024 22:44

Add TogetherCompletionClient (stanford-crfm#2629)

6b3de7f

Fix incorrect Together model strings for Yi Chat and Llama 3 Chat (st…

c25eb02

…anford-crfm#2636)

Bug fix for Image2Struct (stanford-crfm#2640)

4ce6283

Add Optimum Intel (stanford-crfm#2609)

e0843d4

raileymontalan added 4 commits June 13, 2024 03:49

Fix implementation of metrics

071f097

Replace F1 metrics with exact match/classification metrics

eb8766f

Increase test sample size for scenarios

7e41752

Fix chrF calculation

d86e095

raileymontalan mentioned this pull request Jun 13, 2024

Add BHASA LINDSEA scenarios #2694

Merged

1 task

weiqipedia reviewed Jun 18, 2024

View reviewed changes

raileymontalan added 3 commits June 18, 2024 07:15

Remove test files

5022993

Remove sampling for TyDiQA test set

fc3bf32

Remove remove_braces function from chrF++

3aefda6

yifanmai approved these changes Jun 19, 2024

View reviewed changes

Run linter

752708c

yifanmai reviewed Jun 19, 2024

View reviewed changes

raileymontalan added 7 commits June 20, 2024 02:50

Fix typo in setup.cfg

e35036c

Fix MetricName typing

243ed6d

Add explicit typing for variables

43fd4c4

Fix line length nitpick

6e2ff51

Fix line length nitpick

67e3002

Fix line length nitpick

b6d7c13

Fix line length nitpick

3cf41ec

yifanmai merged commit 99ee768 into stanford-crfm:main Jun 21, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BHASA scenarios #2648

Add BHASA scenarios #2648

raileymontalan commented May 14, 2024 •

edited

Loading

yifanmai left a comment

yifanmai May 18, 2024

weiqipedia May 20, 2024

yifanmai commented Jun 8, 2024

weiqipedia commented Jun 8, 2024

raileymontalan commented Jun 13, 2024

weiqipedia Jun 18, 2024

weiqipedia Jun 18, 2024

yifanmai Jun 19, 2024

weiqipedia Jun 18, 2024

yifanmai commented Jun 19, 2024

raileymontalan commented Jun 19, 2024

yifanmai Jun 19, 2024

raileymontalan commented Jun 20, 2024

yifanmai commented Jun 21, 2024

yifanmai commented Jun 21, 2024

		rel_tol: float = 0.01


		def check_test_cases(test_cases: List[TestCase], bias_func: Callable[[List[str]], Optional[float]]):

Add BHASA scenarios #2648

Add BHASA scenarios #2648

Conversation

raileymontalan commented May 14, 2024 • edited Loading

yifanmai left a comment

Choose a reason for hiding this comment

yifanmai May 18, 2024

Choose a reason for hiding this comment

weiqipedia May 20, 2024

Choose a reason for hiding this comment

yifanmai commented Jun 8, 2024

weiqipedia commented Jun 8, 2024

raileymontalan commented Jun 13, 2024

weiqipedia Jun 18, 2024

Choose a reason for hiding this comment

weiqipedia Jun 18, 2024

Choose a reason for hiding this comment

yifanmai Jun 19, 2024

Choose a reason for hiding this comment

weiqipedia Jun 18, 2024

Choose a reason for hiding this comment

yifanmai commented Jun 19, 2024

raileymontalan commented Jun 19, 2024

yifanmai Jun 19, 2024

Choose a reason for hiding this comment

raileymontalan commented Jun 20, 2024

yifanmai commented Jun 21, 2024

yifanmai commented Jun 21, 2024

raileymontalan commented May 14, 2024 •

edited

Loading