Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multilabel classification metrics #1408

Merged

Conversation

JoelNiklaus
Copy link
Contributor

This is a WIP for adding population level classification metrics for multilabel text classification tasks. Any Feedback is welcome, especially for testing it.

src/helm/benchmark/metrics/classification_metrics.py Outdated Show resolved Hide resolved
@@ -218,10 +218,15 @@ def construct_example_prompt(self, instance: Instance, include_output: bool, ref

# References (optionally) and output
output: str

delimiter = ","
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to try with and without space

if not correct_references:
output = "n/a"
else:
output = delimiter.join([correct_reference.output.text for correct_reference in correct_references])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: This might change instances if we have scenarios with multiple correct references somehow (which should not happen).

src/helm/benchmark/metrics/classification_metrics.py Outdated Show resolved Hide resolved
src/helm/benchmark/run_specs.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also modify test_clsasification_metrics.py; the existing tests should check if the single-label case continues to work.

@JoelNiklaus
Copy link
Contributor Author

I updated the PR. @yifanmai would you mind taking a look?

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, we just need to make sure we don't accidentally split predictions in the single-task classification case.

src/helm/benchmark/metrics/classification_metrics.py Outdated Show resolved Hide resolved
src/helm/benchmark/metrics/classification_metrics.py Outdated Show resolved Hide resolved
src/helm/benchmark/metrics/classification_metrics.py Outdated Show resolved Hide resolved
src/helm/benchmark/metrics/classification_metrics.py Outdated Show resolved Hide resolved
@JoelNiklaus
Copy link
Contributor Author

Thank you so much for the review! I addressed the changes.

@yifanmai
Copy link
Collaborator

Looks good. Thanks!

@yifanmai yifanmai merged commit eecea63 into stanford-crfm:main Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants