New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add BHASA LINDSEA scenarios #2694

Merged

yifanmai merged 11 commits into stanford-crfm:main from raileymontalan:lindsea

Jun 26, 2024

Contributor

raileymontalan commented May 31, 2024

Add the following for BHASA:

Linguistic Diagnostics: syntax minimal pairs, pragmatic reasoning


          Add BHASA code

68665f6

raileymontalan mentioned this pull request

Add BHASA scenarios #2648

Merged

10 tasks


          Add BHASA metrics

2506c69

yifanmai requested changes

View reviewed changes

Collaborator

yifanmai left a comment

This looks amazing! I mostly just have minor comments. It is a little long so I did a partial review (not quite done with the scenarios, schema and run entries yet) - will review the rest tomorrow.

src/helm/benchmark/metrics/bhasa_metrics.py Outdated

+                  def _compute_chrf(self, refs: List[str], pred: str) -> Dict[str, float]:
+                      metrics: Dict[str, float] = {}
+                      metrics['ChrF++'] = self.chrf_scorer.sentence_score(pred, refs).score

Collaborator

yifanmai Jun 6, 2024

Generally our convention for metrics is camel case - could you make this "chr_f_plus_plus" instead?

Collaborator

yifanmai Jun 12, 2024

Correction - convention is snake case, not camel case.

src/helm/benchmark/metrics/bhasa_metrics.py Outdated

+                          aggregator.add_scores(self.rouge_scorer.score(ref, pred))
+                      aggregates = aggregator.aggregate()
+                      for key, value in self.rouge_metrics.items():
+                          metrics[value] = aggregates[key].mid.fmeasure * 100

Collaborator

yifanmai Jun 6, 2024

Is this to make the range 0 to 100 instead of 0 to 1? We generally prefer the 0 to 1 range.

src/helm/benchmark/metrics/bhasa_metrics.py Outdated

+                          metrics[value] = aggregates[key].mid.fmeasure * 100
+                      return metrics
+                  def _remove_braces(self, text: str) -> str:

Collaborator

yifanmai Jun 6, 2024

Optional: Don't know if this is important, but if you want to ensure that braces are removed in a balanced way, then you should do

        if text.startswith("{") and text.endswith("}"):
            text = text[1:-1]

otherwise you might strip the brace from only the start or only the end. Likewise for the other occurrence of this function.

Also, for my education, why do we need to remove braces?

src/helm/benchmark/metrics/bhasa_metrics.py Outdated

+                          text = text[:-1]
+                      return text
+                  def evaluate(

Collaborator

yifanmai Jun 6, 2024

nit: can omit this definition, if all you are doing is calling the super.

src/helm/benchmark/metrics/bhasa_metrics.py Outdated


		return result

		def get_bhasa_machine_translation_metric_specs() -> List[MetricSpec]:

Collaborator

yifanmai Jun 6, 2024

Move these to a different file bhasa_metric_specs.py (follow the conventions in common_metric_specs.py).

The reason for doing this is that we don't want the bhasa_run_specs file to import bhasa_metrics, which transitively imports optional dependencies, otherwise this would cause helm-run to fail for someone who doesn't have the optional dependencies installed.

src/helm/benchmark/run_specs/bhasa_run_specs.py Outdated

+                  },
+              }
+              def generate_xquad_run_spec(language="th"):

Collaborator

yifanmai Jun 6, 2024

optional: can inline this into get_xquad_spec() since it isn't used elsewhere

src/helm/benchmark/run_specs/bhasa_run_specs.py Outdated

+                      name=name,
+                      scenario_spec=scenario_spec,
+                      adapter_spec=adapter_spec,
+                      metric_specs=get_f1_metric_specs(),

Collaborator

yifanmai Jun 6, 2024

For all the sentiment analysis scenarios, you probably want get_exact_match_metric_specs() + get_classification_metric_specs() - the confusingly-named get_f1_metric_specs() gives you word overlap F1 instead of classification F1 and is intended for open ended generation evaluation.

src/helm/benchmark/run_specs/bhasa_run_specs.py Outdated

+                      name=name,
+                      scenario_spec=scenario_spec,
+                      adapter_spec=adapter_spec,
+                      metric_specs=get_f1_metric_specs(),

Collaborator

yifanmai Jun 6, 2024

Same comment regarding metric_specs as above.

src/helm/benchmark/run_specs/bhasa_run_specs.py Outdated

+                  }
+              }
+              def generate_xlsum_run_spec(language="id"):

Collaborator

yifanmai Jun 6, 2024

optional: inline

same for the other inline-able functions below

src/helm/benchmark/metrics/xlsum/README.md Outdated

Collaborator

yifanmai Jun 6, 2024

Could you clarify:

Is this subfolder is a redistributed copy of another code repository? If so:
- What is the URL for the original source?
- Has it been modified from the original source?

I see some URLs in README.md and source code, but it is hard for me to tell if these URLs refer to the original source, or to other redistributed code contained in the original source.

raileymontalan added 3 commits

June 7, 2024 03:25


          Remove XLSum

42a75e3


          Remove XLSum

f787f33


          Keep only LINDSEA code

119fdff

Contributor Author

raileymontalan commented Jun 13, 2024

Hi @yifanmai, apologies for the mixup. We've decided to keep only the LINDSEA-related code here. All other code has been moved to this PR. I will be compiling and addressing your comments there.

yifanmai approved these changes

View reviewed changes

Collaborator

yifanmai left a comment

Mostly looks good, just some minor stuff left.

src/helm/benchmark/scenarios/bhasa_scenario.py Outdated

+                          if split == "train":
+                              # Select only bottom 20th percentile by length for in-context examples as examples are very long
+                              data = df[df["passage_text"].apply(len) < df["passage_text"].apply(len).quantile(.2)]

Collaborator

yifanmai Jun 12, 2024

Great, thanks for the clarification.

src/helm/benchmark/run_specs/bhasa_run_specs.py

+              @run_spec_function("lindsea_syntax_minimal_pairs")
+              def get_lindsea_syntax_minimal_pairs_spec(language: str = "id", method: str = "mcq") -> RunSpec:
+                  name = f"lindsea_syntax_minimal_pairs_{language}"
+                  if method == "mcq":

Collaborator

yifanmai Jun 19, 2024

What's the reason for doing the "A" and "B" formatting yourself, instead of using get_multiple_choice_joint_adapter_spec() which will do it for you? get_multiple_choice_joint_adapter_spec() has the added advantage of being more similar to get_multiple_choice_separate_adapter_spec() except for logprobs vs generation.

src/helm/benchmark/static/schema_bhasa.yaml Outdated

+                    - efficiency
+                    - general_information
+                  environment:
+                    main_name: accuracy

Collaborator

yifanmai Jun 19, 2024

This should be exact_match or quasi_exact_match.

src/helm/benchmark/static/schema_bhasa.yaml Outdated

@@ @@ -0,0 +1,167 @@ @@
+              ---
+              ############################################################
+              metrics:

Collaborator

yifanmai Jun 19, 2024

Add exact_match or quasi_exact_match to this list.

Collaborator

yifanmai commented Jun 19, 2024

Please run the linter: at the root of the repo, run:

pip install -e '.[dev]'
./pre-commit.sh

raileymontalan added 2 commits

June 20, 2024 03:07


          Run linter

625a258


          Fix line length nitpick

d520018

yifanmai reviewed

View reviewed changes

Collaborator

yifanmai left a comment

Could you merge / rebase the changes from #2648?

raileymontalan added 2 commits

June 24, 2024 09:49


          Merge branch 'main' of https://github.com/raileymontalan/helm into li…

7137e1c

…ndsea


          retrigger checks

72705c2

Contributor Author

raileymontalan commented Jun 24, 2024

Hi @yifanmai, I saw that a lot of PR checks (including this one) were failing with the same error message (Failed to build backports.zoneinfo). Please advise.

Collaborator

yifanmai commented Jun 25, 2024

@raileymontalan sorry, the main branch was broken. Could you merge main again? That should pick up the fix.

raileymontalan added 2 commits

June 25, 2024 13:11


          Merge branch 'stanford-crfm:main' into lindsea

cbe5f74


          Add newline to end of file

b414432

Contributor Author

raileymontalan commented Jun 25, 2024

@yifanmai Thanks! This PR has now passed all the checks :)

Contributor

weiqipedia commented Jun 25, 2024

As a reminder, there is still a need to discuss and revamp the calculation of metrics for the LINDSEA scenarios before this can be merged!

Collaborator

yifanmai commented Jun 26, 2024

I'll merge this first so that I can do some prototyping on top of this.

yifanmai merged commit 4e0d5ed into stanford-crfm:main

9 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet