Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GreaterThan constraint crashing with NaN values #592

Closed
fealho opened this issue Sep 16, 2021 · 4 comments · Fixed by #600
Closed

GreaterThan constraint crashing with NaN values #592

fealho opened this issue Sep 16, 2021 · 4 comments · Fixed by #600
Assignees
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic
Milestone

Comments

@fealho
Copy link
Member

fealho commented Sep 16, 2021

As reported in issue #590 and #589, the GreaterThan seems to be failing in the presence of NaN values. The cause seems to be the following line:

if not constraint.is_valid(data).all():

Basically, since NaN values are not comparable, this line is returning false for the comparison between NaN's, which causes the code to crash.

We should work on a fix, seeing as this error happens with our demo dataset (student_placements).

@fealho fealho added the bug Something isn't working label Sep 16, 2021
@npatki
Copy link
Contributor

npatki commented Sep 17, 2021

FYI same thing is happening with ColumnFormula -- raised in #593

Under the hood, the is_valid methods for each constraint check for validity. In this case, GreaterThan seems to be using np.greater or np.greater_equal. These will output False whenever they encounter NaN values.

I think we should just ignore NaN values since they clearly don't apply to a GreaterThan constraint.

@kveerama kveerama added the feature:constraints Related to inputting rules or business logic label Sep 19, 2021
@katxiao katxiao self-assigned this Sep 21, 2021
@katxiao
Copy link
Contributor

katxiao commented Sep 21, 2021

@npatki I have a follow-up to the proposal of ignoring NaN values.

If we have a GreaterThan constraint on two columns a and b (a > b), I think it makes sense to ignore a row where both a and b are NaN, and return True for that row. However, if a is non-null and b is null, that row seems invalid and I think we should return False in that case.

What do you think?

@npatki
Copy link
Contributor

npatki commented Sep 22, 2021

@katxiao To me, the most straightforward interpretation of GreaterThan is: a > b if and only if a is present and b is present. We should return True whenever the > operator doesn't really apply.

  1. Both values are missing or
  2. One of the values is missing

Otherwise, I'm not sure how the user would handle the situation where certain values may be missing randomly.

@npatki
Copy link
Contributor

npatki commented Sep 22, 2021

One motivating example to the above: In the Loan dataset on Kaggle, there is a constraint that the loans paid_off_time > effective_date. However, not all loans are paid off. So if paid_off_time is missing, then I don't expect the > constraint to apply (return True)

@katxiao katxiao added this to the 0.12.1 milestone Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants