Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The f-measure is ill-defined when there are no true positives or no positive predicitons #72

Open
timokau opened this issue Nov 14, 2019 · 3 comments
Labels
enhancement New feature or request Priority: Medium

Comments

@timokau
Copy link
Collaborator

timokau commented Nov 14, 2019

sklearn issues a warning during the tests:

sklearn.exceptions.UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in samples with no predicted labels.

This is because

  • some of the test samples generated in csrank/tests/test_choice_functions.py:trivial_choice_problem have no true positives
  • some of the learners predict no positives for some of the generated problems

In both of those cases the f-measure is not properly defined. sklearn assigns 0 and 1 respectively.

How should we deal with this? A metric should be defined for these possibilities. 0 and 1 in those cases seems somewhat reasonable, so maybe we should just silence the warning?

@timokau timokau mentioned this issue Nov 14, 2019
7 tasks
@kiudee
Copy link
Owner

kiudee commented Nov 18, 2019

The first problem we should avoid by generating test samples, which cannot consist of only negatives. Assigning a 1 in these cases would be sensible in general.

Regarding the second case: Assigning 0 here is sensible, since the learner achieved no true positive.

Note: My version of sklearn (0.20.2) returns 0.0 for both cases.

@timokau
Copy link
Collaborator Author

timokau commented Nov 18, 2019

You're right, sklearn returns 0.0 for both cases. The more I think about this the less sure I am that defining values for these cases is a good idea. The implementation is also non-straightforward, since we would have to do some of the work that we currently outsource to scipy.

Here are the tests I came up with:

    There are no true positives but some predicted positives; e.g. "infinite recall".
    >>> f1_measure([[False, False]], [[True, True]])
    0.0

    There are no predicted positives but some true positives; e.g. 0 recall, 0 precision.
    >>> f1_measure([[True, True]], [[False, False]])
    0.0

    There are neither true nor predicted positives, e.g. all predictions are correct:
    >>> f1_measure([[False, False]], [[False, False]])
    1.0

(2) and (3) seem pretty clear cut, but (1) should really depend on how many labels were predicted positive. Should we sidestep the issue by just defining cases (2) and (3) and continuing to throw a warning in (1)?

@kiudee
Copy link
Owner

kiudee commented Nov 22, 2019

From those three cases (2) is an obvious 0.0.
For (3) the value 1.0 is sensible, but I would still throw a warning, since having no positives in an instance might hint at a problem in the dataset.
Similarly, I would return 0.0 for (1) and raise a warning.

@kiudee kiudee added enhancement New feature or request Priority: Medium labels Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority: Medium
Projects
None yet
Development

No branches or pull requests

2 participants