Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN values impact binary or operations asymmetrically #6528

Closed
behzadnouri opened this issue Mar 3, 2014 · 10 comments
Closed

NaN values impact binary or operations asymmetrically #6528

behzadnouri opened this issue Mar 3, 2014 · 10 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@behzadnouri
Copy link
Contributor

I am not sure if this is on purpose, but as far as I know, or operation should be symmetric ( I mean a | b == b | a ); not in pandas:

>>> a = pd.Series( [True] )
>>> b = pd.Series( [np.nan] )
>>> a | b
0    True
dtype: bool
>>> b | a
0    False
dtype: bool

is there any reason pandas is treating NaN values asymmetrically in here?

also, this is a weird outcome as well:

>>> ( b == True ) | ( a == True )
0    True
dtype: bool

and this:

 >>> b.astype( bool )
 0    True
 dtype: bool

this even has internal inconsistency; so before in b | a, NaN was treated as False, but now here it is True;

@jreback
Copy link
Contributor

jreback commented Mar 3, 2014

this could be a bug, but you boolean comparison of True == np.nan is very odd in the first place
should the result be np.nan because its missing, False because True is equivalent of np.nan?

See the long and compliacted tests here:
https://github.com/pydata/pandas/blob/master/pandas/tests/test_series.py#L3142

if you see a bug, pls submit a PR

will mark it as a bug, but come back to us with your thoughts

@jreback jreback added this to the 0.14.0 milestone Mar 3, 2014
@behzadnouri
Copy link
Contributor Author

@jreback there are 3 issues above:

  1. a | b is not equal to b | a;
  2. b.astype( bool ) converts NaN values to True; this is worst of 3 possible values, namely NaN, False, True; and also not consistent with [3] below
  3. b == True is not consistent with [2] in above;

this is really bad; because outcome of | should not depend on order; especially when NaN values are implicitly generated:

>>> ts = pd.Series( [True, False] )
>>> ts | ts.shift( 1 )
0    True
1    True
dtype: bool
>>> ts.shift( 1 ) | ts
0    False
1     True
dtype: bool

@jreback
Copy link
Contributor

jreback commented Mar 3, 2014

@behzadnouri ok.....pls make a PR to fix! Keep in mind their may be some reasons buried for this; e.g. put in tests for the 'correct' behavior and see what breaks.

  1. symmetry, I don't think it tested
  2. This is filling behavior, I am not really sure what to do here; the astype actually should fail IMHO. it is really unclear what to do with np.nan, you normally b.fillna(False).astype(bool).
  3. test as well

@jreback
Copy link
Contributor

jreback commented Mar 3, 2014

@hayd pls chime in here when you can....IIRC you and I discussed these types of things at length

@hayd
Copy link
Contributor

hayd commented Mar 3, 2014

some discussion was here: #4953.

A lot of this is as we're fighting "strange" truthiness for numpy's and python's (!) nan, so things like astype we can't/shouldn't fix... kinda a "gotcha".

IMO asymmetry is a bug but only because we're forcing a boolean return value (the reason this happens is cos in numpy and in general a|b is not symmetric). Now I think about it, I do remember this coming up...

@seth-p
Copy link
Contributor

seth-p commented Sep 28, 2014

My 2c: I would separate this out to two issues:

Issue 1: Any boolean operations should behave the same as if .astype(bool) were first invoked on the structure(s). So, for example, a | b should return the same as a.astype(bool) | b.astype(bool), and b == True should return the same as b.astype(bool) == True. If this is the case, this should address the commutativity problem (as I'm assuming the boolean operations on boolean structures are commutative).

Issue 2: Should .astype(bool) convert a NaN to True or False? Currently it converts it to True, which is consistent with the fact that bool(NaN) is True, as defined by numpy. Personally, I would have defined bool(NaN) as False, and this may be a situation where we want the Pandas behavior (of .astype(bool)) to differ from numpy's. But this should be considered an API change, not a bug fix. I also just raised this issue on https://groups.google.com/forum/#!topic/pydata/U7gQyI_gLMc.

At any rate, I view these as two separate (but obviously related) issues, which I think are being slightly conflated in this issue (and #8151). I would suggest first addressing Issue 1, which seems to me clearly to be a bug -- keeping the existing behavior of treating NaN as True (so in the first example above, a | b, b | a, and b == True would all return Series([True])); and then separately, after further discussion and reflection, consider changing the behavior of .astype(bool) (and related boolean operations) to convert NaN to False.

@behzadnouri
Copy link
Contributor Author

@seth-p

Any boolean operations should behave the same as if .astype(bool) were first invoked on the structure(s)

as long as the logical operations are bound to return dtype='bool'. that is exactly what i am testing here

and b == True should return the same as b.astype(bool) == True

that would not be achievable. for example, 2 != True and 2 != False, so you cannot type cast a series unless it is only zeros and ones.

this may be a situation where we want the Pandas behavior (of .astype(bool)) to differ from numpy's

strongly discourage that, because:

-1- it is not only numpy, but as you also mentioned, and much more importantly that is how python works:

>>> np.where([np.nan], True, False)
array([ True], dtype=bool)
>>> True if np.nan else False
True

so, it can easily break things as simple as ts.map(bool) == ts.astype(bool).

-2- the gain of breaking that consistency, if any, is less than minimal. because, if a user wants to deviate from python's convention that is more than easy: ts.fillna(False), or ts.fillna(0).

@seth-p
Copy link
Contributor

seth-p commented Sep 28, 2014

D'oh! @behzadnouri, of course you're right about == and !=. I guess I should have said that any logical operations that expect booleans should produce the same results as for .astype(bool).

Re: .astype(bool) converting NaN to True, fair enough. And you're right, of course, that what I'm suggesting is simply the same as .fillna(False).astype(bool).

@jreback jreback changed the title NaN values impact binary or operations asymmetrically NaN values impact binary or operations asymmetrically Sep 30, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@h-vetinari
Copy link
Contributor

Tried to find the most current open issue on this - the others are closed or merged, but this is open even though being milestoned since 0.14?

I asked a question about this on stackexchange, and someone took the time to identify the apparent source of the issue (as well as a work-around): https://stackoverflow.com/a/47664356/2965879

The main issue seems that in pandas.core.ops.wrapper (https://github.com/pandas-dev/pandas/blob/master/pandas/core/ops.py#L928), after aligning the series, only other is filled (transforming NaN to False), but self isn't, which introduces this symmetry break.

@TomAugspurger
Copy link
Contributor

I don't expect the behavior of np.nan in boolean operations to change, but #29842 is implementing a new BooleanArray extension type that I think addressees most things. Let us know if there are additional problems in new issues.

@TomAugspurger TomAugspurger removed this from the Contributions Welcome milestone Dec 2, 2019
@TomAugspurger TomAugspurger added this to the 1.0 milestone Dec 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants