-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BHASA scenarios #2648
Add BHASA scenarios #2648
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left some initial comments. This may take me a while to review - I'll try to finish this early next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we can use rouge-score
from PyPI? https://pypi.org/project/rouge-score/ Or are there divergences from the PyPI package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rouge-score
package from PyPI should be the codebase that the xlsum multilingual rouge score codebase was based on, but there are divergences in the use of multilingual tokenizers and stemmers, and we also added a fixed random seed in the bootstrap aggregation step in order to ensure reproducibility
Should I review this now or should we try to merge #2694 first? |
Let's review this first as #2694 requires further discussion on how to handle the aggregation of scores across categories! |
Hi @yifanmai, addressing your comments from the other PR here:
|
# Sample 100 examples for test | ||
data = df.sample(n=100, random_state=5018) | ||
# Sample 565 examples for test | ||
data = df.sample(n=565, random_state=5018) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to sample since we are taking the entire validation set as the test set!
metrics['chr_f_plus_plus'] = self.chrf_scorer.sentence_score(pred, refs).score | ||
return metrics | ||
|
||
def remove_braces(self, text: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raileymontalan There's no need for this function actually. If I'm not wrong, this is something that was found in the SummarizationMetric
(https://github.com/stanford-crfm/helm/blob/main/src/helm/benchmark/metrics/summarization_metrics.py) because they used to use braces to contain the summary, and they used }
as a stop token. But it seems like the separator between few-shot instances for summarization has been changed to ###
. Since we do not actually include any instructions in our prompt to provide the translation within braces, there is no need for this cleaning step.
@yifanmai This also answers your comment from (#2694 (comment)).
We might want to remove this from SummarizationMetric
as well? Or is that perhaps there for backward compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it there for backwards compatibility for now.
rel_tol: float = 0.01 | ||
|
||
|
||
def check_test_cases(test_cases: List[TestCase], bias_func: Callable[[List[str]], Optional[float]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raileymontalan There's no need for test_bhasa_scenario.py
and test_bhasa_metrics.py
since they seem to be copies of existing examples and not actually targeting BHASA scenarios/metrics?
Looks good, thanks! Could you also run the linter on this? pip install -e '.[dev]'
./pre-commit.sh |
|
setup.cfg
Outdated
@@ -271,6 +276,7 @@ all = | |||
crfm-helm[mongo] | |||
crfm-helm[heim] | |||
crfm-helm[vlm] | |||
cfrm-helm[bhasa] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: cfrm-helm[bhasa]
should be crfm-helm[bhasa]
@yifanmai there's a peculiar issue I'm running into when executing There are some lines in our code that are >120 characters, in particular some instructions that are in Tamil ("உங்களுக்கு ஒரு பத்தியும் ஒரு கேள்வியும் தரப்படும். தரப்பட்ட பத்தியிலிருந்து கேள்விக்கான பதிலைக் கண்டறியவும்.") and Thai ("อารมณ์ความรู้สึกของข้อความต่อไปนี้เป็นอย่างไร?\nกรุณาตอบโดยใช้คำเดียวเท่านั้น:\n- แง่บวก\n- แง่ลบ\n- เฉยๆ") found in To remedy this, I had to assign these strings to a temporary variable to ensure the whole statement is <120 characters. Please advise. |
What you did looks reasonable. We have two different tools, flake8 and black. It looks like they conflict in how they measure string length (this doesn't happen often). In the future, if you run into this again, you can add |
Thanks for your help! |
Add the following for BHASA:
Moved to a different PR: