Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BHASA scenarios #2648

Merged
merged 52 commits into from
Jun 21, 2024
Merged

Conversation

raileymontalan
Copy link
Contributor

@raileymontalan raileymontalan commented May 14, 2024

Add the following for BHASA:

  • Setup dependencies
  • Run entries (few-shot and zero-shot settings)
  • Custom metrics for machine translation (ChrF++), summarization (XL-Sum Rouge-L), and question-answering (SQuAD exact match and SQuAD F1)
  • Run specs
  • Scenarios (Indonesian, Tamil, Thai, Vietnamese):
    • Natural Language Understanding: question answering, sentiment analysis, toxicity detection/classification
    • Natural Language Generation: machine translation, abstractive summarization
    • Natural Language Reasoning: natural language inference, causal reasoning
  • Presentation schemas

Moved to a different PR:

  • Linguistic Diagnostics: syntax minimal pairs, pragmatic reasoning

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left some initial comments. This may take me a while to review - I'll try to finish this early next week.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we can use rouge-score from PyPI? https://pypi.org/project/rouge-score/ Or are there divergences from the PyPI package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rouge-score package from PyPI should be the codebase that the xlsum multilingual rouge score codebase was based on, but there are divergences in the use of multilingual tokenizers and stemmers, and we also added a fixed random seed in the bootstrap aggregation step in order to ensure reproducibility

src/helm/benchmark/run_specs/bhasa_run_specs.py Outdated Show resolved Hide resolved
@yifanmai
Copy link
Collaborator

yifanmai commented Jun 8, 2024

Should I review this now or should we try to merge #2694 first?

@weiqipedia
Copy link
Contributor

Let's review this first as #2694 requires further discussion on how to handle the aggregation of scores across categories!

@raileymontalan raileymontalan mentioned this pull request Jun 13, 2024
1 task
@raileymontalan
Copy link
Contributor Author

Hi @yifanmai, addressing your comments from the other PR here:

# Sample 100 examples for test
data = df.sample(n=100, random_state=5018)
# Sample 565 examples for test
data = df.sample(n=565, random_state=5018)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to sample since we are taking the entire validation set as the test set!

metrics['chr_f_plus_plus'] = self.chrf_scorer.sentence_score(pred, refs).score
return metrics

def remove_braces(self, text: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raileymontalan There's no need for this function actually. If I'm not wrong, this is something that was found in the SummarizationMetric (https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/metrics/summarization_metrics.py) because they used to use braces to contain the summary, and they used } as a stop token. But it seems like the separator between few-shot instances for summarization has been changed to ###. Since we do not actually include any instructions in our prompt to provide the translation within braces, there is no need for this cleaning step.

@yifanmai This also answers your comment from (#2694 (comment)).
We might want to remove this from SummarizationMetric as well? Or is that perhaps there for backward compatibility?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it there for backwards compatibility for now.

rel_tol: float = 0.01


def check_test_cases(test_cases: List[TestCase], bias_func: Callable[[List[str]], Optional[float]]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raileymontalan There's no need for test_bhasa_scenario.py and test_bhasa_metrics.py since they seem to be copies of existing examples and not actually targeting BHASA scenarios/metrics?

@yifanmai
Copy link
Collaborator

Looks good, thanks! Could you also run the linter on this?

pip install -e '.[dev]'
./pre-commit.sh

@raileymontalan
Copy link
Contributor Author

Looks good, thanks! Could you also run the linter on this?

pip install -e '.[dev]'
./pre-commit.sh
  • Linter has been run, files have been reformatted.
  • Test files have been removed.
  • Sampling for TyDiQA test set has been removed.
  • remove_braces function from chrF++ has been removed.

setup.cfg Outdated
@@ -271,6 +276,7 @@ all =
crfm-helm[mongo]
crfm-helm[heim]
crfm-helm[vlm]
cfrm-helm[bhasa]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: cfrm-helm[bhasa] should be crfm-helm[bhasa]

@raileymontalan
Copy link
Contributor Author

@yifanmai there's a peculiar issue I'm running into when executing pre-commit.sh.

There are some lines in our code that are >120 characters, in particular some instructions that are in Tamil ("உங்களுக்கு ஒரு பத்தியும் ஒரு கேள்வியும் தரப்படும். தரப்பட்ட பத்தியிலிருந்து கேள்விக்கான பதிலைக் கண்டறியவும்.") and Thai ("อารมณ์ความรู้สึกของข้อความต่อไปนี้เป็นอย่างไร?\nกรุณาตอบโดยใช้คำเดียวเท่านั้น:\n- แง่บวก\n- แง่ลบ\n- เฉยๆ") found in bhasa_run_specs.py. Because they are >120 characters, I try to split them into multi-line strings. However, pre-commit.sh thinks that they are <120 characters and tries to remerge the multi-line strings into just one line, causing the failures in the Git checks.

To remedy this, I had to assign these strings to a temporary variable to ensure the whole statement is <120 characters. Please advise.

@yifanmai
Copy link
Collaborator

What you did looks reasonable. We have two different tools, flake8 and black. It looks like they conflict in how they measure string length (this doesn't happen often). In the future, if you run into this again, you can add # noqa to the end of the line, which should skip the length check for that line.

@yifanmai
Copy link
Collaborator

Thanks for your help!

@yifanmai yifanmai merged commit 99ee768 into stanford-crfm:main Jun 21, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants