Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BHASA LINDSEA scenarios #2694

Merged
merged 11 commits into from
Jun 26, 2024
Merged

Conversation

raileymontalan
Copy link
Contributor

Add the following for BHASA:

  • Linguistic Diagnostics: syntax minimal pairs, pragmatic reasoning

@raileymontalan raileymontalan mentioned this pull request May 31, 2024
10 tasks
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks amazing! I mostly just have minor comments. It is a little long so I did a partial review (not quite done with the scenarios, schema and run entries yet) - will review the rest tomorrow.


def _compute_chrf(self, refs: List[str], pred: str) -> Dict[str, float]:
metrics: Dict[str, float] = {}
metrics['ChrF++'] = self.chrf_scorer.sentence_score(pred, refs).score
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally our convention for metrics is camel case - could you make this "chr_f_plus_plus" instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction - convention is snake case, not camel case.

aggregator.add_scores(self.rouge_scorer.score(ref, pred))
aggregates = aggregator.aggregate()
for key, value in self.rouge_metrics.items():
metrics[value] = aggregates[key].mid.fmeasure * 100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to make the range 0 to 100 instead of 0 to 1? We generally prefer the 0 to 1 range.

metrics[value] = aggregates[key].mid.fmeasure * 100
return metrics

def _remove_braces(self, text: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Don't know if this is important, but if you want to ensure that braces are removed in a balanced way, then you should do

        if text.startswith("{") and text.endswith("}"):
            text = text[1:-1]

otherwise you might strip the brace from only the start or only the end. Likewise for the other occurrence of this function.

Also, for my education, why do we need to remove braces?

text = text[:-1]
return text

def evaluate(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can omit this definition, if all you are doing is calling the super.


return result

def get_bhasa_machine_translation_metric_specs() -> List[MetricSpec]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these to a different file bhasa_metric_specs.py (follow the conventions in common_metric_specs.py).

The reason for doing this is that we don't want the bhasa_run_specs file to import bhasa_metrics, which transitively imports optional dependencies, otherwise this would cause helm-run to fail for someone who doesn't have the optional dependencies installed.

},
}

def generate_xquad_run_spec(language="th"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: can inline this into get_xquad_spec() since it isn't used elsewhere

name=name,
scenario_spec=scenario_spec,
adapter_spec=adapter_spec,
metric_specs=get_f1_metric_specs(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the sentiment analysis scenarios, you probably want get_exact_match_metric_specs() + get_classification_metric_specs() - the confusingly-named get_f1_metric_specs() gives you word overlap F1 instead of classification F1 and is intended for open ended generation evaluation.

name=name,
scenario_spec=scenario_spec,
adapter_spec=adapter_spec,
metric_specs=get_f1_metric_specs(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment regarding metric_specs as above.

}
}

def generate_xlsum_run_spec(language="id"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: inline

same for the other inline-able functions below

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify:

  • Is this subfolder is a redistributed copy of another code repository? If so:
    • What is the URL for the original source?
    • Has it been modified from the original source?

I see some URLs in README.md and source code, but it is hard for me to tell if these URLs refer to the original source, or to other redistributed code contained in the original source.

@raileymontalan
Copy link
Contributor Author

Hi @yifanmai, apologies for the mixup. We've decided to keep only the LINDSEA-related code here. All other code has been moved to this PR. I will be compiling and addressing your comments there.

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just some minor stuff left.


if split == "train":
# Select only bottom 20th percentile by length for in-context examples as examples are very long
data = df[df["passage_text"].apply(len) < df["passage_text"].apply(len).quantile(.2)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the clarification.

@run_spec_function("lindsea_syntax_minimal_pairs")
def get_lindsea_syntax_minimal_pairs_spec(language: str = "id", method: str = "mcq") -> RunSpec:
name = f"lindsea_syntax_minimal_pairs_{language}"
if method == "mcq":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for doing the "A" and "B" formatting yourself, instead of using get_multiple_choice_joint_adapter_spec() which will do it for you? get_multiple_choice_joint_adapter_spec() has the added advantage of being more similar to get_multiple_choice_separate_adapter_spec() except for logprobs vs generation.

- efficiency
- general_information
environment:
main_name: accuracy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be exact_match or quasi_exact_match.

@@ -0,0 +1,167 @@
---
############################################################
metrics:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add exact_match or quasi_exact_match to this list.

@yifanmai
Copy link
Collaborator

Please run the linter: at the root of the repo, run:

pip install -e '.[dev]'
./pre-commit.sh

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you merge / rebase the changes from #2648?

@raileymontalan
Copy link
Contributor Author

Hi @yifanmai, I saw that a lot of PR checks (including this one) were failing with the same error message (Failed to build backports.zoneinfo). Please advise.

@yifanmai
Copy link
Collaborator

@raileymontalan sorry, the main branch was broken. Could you merge main again? That should pick up the fix.

@raileymontalan
Copy link
Contributor Author

@yifanmai Thanks! This PR has now passed all the checks :)

@weiqipedia
Copy link
Contributor

As a reminder, there is still a need to discuss and revamp the calculation of metrics for the LINDSEA scenarios before this can be merged!

@yifanmai
Copy link
Collaborator

I'll merge this first so that I can do some prototyping on top of this.

@yifanmai yifanmai merged commit 4e0d5ed into stanford-crfm:main Jun 26, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants