Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust Bool evaluation for empty DataFrames to match PEP8 #6964

Closed
tstriker opened this issue Apr 25, 2014 · 35 comments
Closed

Adjust Bool evaluation for empty DataFrames to match PEP8 #6964

tstriker opened this issue Apr 25, 2014 · 35 comments

Comments

@tstriker
Copy link

Right now, as per the gotchas section, pandas raises a value error on an empty frame
http://pandas.pydata.org/pandas-docs/dev/gotchas.html#gotchas-truth

While documentation describes ambiguity, in the python world there is none.
PEP8 (http://legacy.python.org/dev/peps/pep-0008/) recommends: "For sequences, (strings, lists, tuples), use the fact that empty sequences are false."

The check returns False for None and empty values for all iterables. Further, there are any (https://docs.python.org/2/library/functions.html#any) and all(https://docs.python.org/2/library/functions.html#all) functions for the proposed ambiguity checks.

As currently it's just raising a ValueError, fixing the problem to correctly evaluate representing DataFrame.empty would be backwards compatible.

The check becomes important when working in a mixed environment with pandas and other structures - a simple if foo can become as convoluted as
empty = foo.empty() if (foo is not None and isinstance(foo, pd.DataFrame)) else not foo

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

You can simply do this (this would need an adjustement to handle scalars)

handling a mixed None and even lists/scalars is a bit non-trivial in python alone
e.g. len(None) is not defined

In [47]: def f(x):
   ....:     return x is not None and bool(len(x))
   ....: 

In [48]: f(None)
Out[48]: False

In [49]: f([])
Out[49]: False

In [50]: f(DataFrame())
Out[50]: False

In [51]: f([True])
Out[51]: True

In [52]: f([False])
Out[52]: True

In [53]: f(DataFrame(True,index=range(2),columns=range(2)))
Out[53]: True

I am not sure why you are confused about the ambiguity. Lists/tuples/dicts dont' care about the contents so by definition their isn't a problem. However pandas objects (and numpy objects) are simply different; the DO care about the contents and act on them.

@tstriker
Copy link
Author

I'm not confused about ambiguity - i'm saying that there is not ambiguity if you approach the problem from the perspective of python - the default check in python is for None/empty, while the other two (.any and .all) are to be provided explicitly.

To your point - handling a mixed None and even lists/scalars in python is trivial in python - it's if foo as [None, [], (,), {}, False] all evaluate to False.

I'm arguing that providing a sane default that matches the behavior of the environment would make it more standard in the pythonic sense.

Finally - it's not objects that care about data - it's the programmers - pandas and numpy sure have special properties, but from my perspective they all can be boiled down to python primitives and unless there is strong reason, on the basic level should also behave as such.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

you just made my point, these are NOT python primitives, in fact they are much more feature rich. Trying to treat a frame (or a numpy array) as a list is meaningless. Yes it sometimes can act like a list, but that doesn't mean it IS one. Duck typing has its limits.

so are you also advocating that numpy change this?

@tstriker
Copy link
Author

Eesh, i guess i've gotten myself into trouble.

In short - I'm using pandas for a web app, and it just trips me personally up every time when i do an "if foo" in the templates and then pandas is waving hands and claiming that it doesn't know what i'm talking about. The next steps then are plugging in .empty and then making sure that the object is always defined as DataFrame, as it would fail when the code branch that sets it from None would not execute, etc. Your suggested function would solve the problem, but shortly we would find ourselves following the magic of a cargo cult calling the f function for all the things.

The benefit of allowing simple evaluation would be that of simplicity of code.
As for downsides - i don't see any - so maybe that's a valid point for investigation - is there really ambiguity, considering how the environment treats empty objects? Or is it inertia?

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

@tbaugis you have a valid point, but I think its a combination of inertia (back compat), and refusing to guess

Here is the change issue with lots of sub-issues and commentary
#4657

if you want to review and come back with comments. Always like to hear a view.

@immerrr
Copy link
Contributor

immerrr commented Apr 25, 2014

@tbaugis considering numpy, there is some danger factor given how it can broadcast a single False value to a full container which will lose its "falseness". It is not an ambiguity per se, but you're in for a lot of surprises if you one day decide to replace a single constant with per-element one and don't have this particular scenario well tested (see #6966 for example).

Pandas containers are not so malleable, so there's less to be afraid of, but consistency with numpy is a plus.

As for the magical f function, I'm sticking to using empty lists as non-existent DataFrames and then len(self.data) > 0 check works in both cases.

@armaganthis3
Copy link

The OP is about Empty dataframes. As mentioned all empty python sequences give false. So are empty numpy arrays:

In [320]: np.empty(shape=[0])
Out[320]: array([], dtype=float64)

In [321]: bool(np.empty(shape=[0]))
Out[321]: False

In [322]: bool(np.empty(shape=[0,0]))
Out[322]: False

Numpy will not broadcast on an dimensions which have zero elements so what @immerrr
notes doesn't seem ralevent (correct me if I'm wrong). Furthermore this was broken by a fix for #4633 about binary boolean operators between NDframes, but none of the examples in that issues have to do with empty frames or numpy arrays. They are all about binary boolean operators between containers having at least some data.

#4657 by @jreback that changed this itself broke long existing behaviuor, and did it despite objections from users (@jseabold of statsmodels was one ) and actually overruled the original call wes mckiney made on this specific issue. Especially when considering the amount of breaking changes that have been going into master since 0.12.1 and for the most threadbare reasons, the "concerns" about backward-compatibility seem kinda hypocritical.

What's an actual good reason why bool(pd.DataFrame()) shouldn't be false like everything else in python?

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

@armaganthis3

The changes were all about consistency; @jseabold objection had to do with IMHO a numpy inconcistency. Pandas actually hasn't changed AT ALL, with the exception of the single-element case.
so numpy treats the 0,1 element cases differently from all others.

In [18]: bool(np.array([]))
Out[18]: False

In [19]: bool(np.array([True]))
Out[19]: True

In [20]: bool(np.array([False]))
Out[20]: False

In [21]: bool(np.array([False,False]))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [22]: bool(np.array([True,True]))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Pandas is entirely consistent


In [23]: bool(pd.DataFrame())
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [24]: bool(pd.DataFrame([[True]]))
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [25]: bool(pd.DataFrame([[False]]))
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

@armaganthis3
Copy link

... and we're back to discussing non-empty dataframes again? The OP put it in the title, discussed it in his description and I hyper-emphasized the point again in my comment. The question is about empty dataframes.

Consistency. By which you mean doing the same thing whether it makes sense or not? Even if by doing so it's inconsistent with every other python data type out there? That's just bizzare.

what's the point of this "consistency"? whom does it benefit?

the OP, discussions in previous issue, the ml discussion and the PEP are all explicit about
the obvious fact that bool(empty)=false is convenient. It's certainly pythonic by acclamation. how is all that trumped by some bogus definition of "consistency"? what precisely is ambiguous about the truth value of an empty dataframe as the exception message suggests?

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

so you think that bool(DataFrame()) == False but if its non empty you get a raise is better? how is that?

if I have an if statement I don't want it working when the test is False, but raising when it is True. *that makes no sense.

That would be inconsistent

@tstriker
Copy link
Author

If it's nonempty, you would get a True, irregardless of the contents of the frame. The bool check would be evaluating if the DataFrame (or other sequences for that matter, unless they are generators) is empty or not.

@armaganthis3 raised a good point, that the issue is not with consistency, but rather matching expectations (pep8).

I ran through the archive of the pointed out issue (thanks for the link!) and the discussion (https://groups.google.com/forum/#!topic/pydata/XzSHSLlTSZ8) seems to have flushed the baby (empty checks) with the bathwater (evaluating dataframe contents to determine the bool value), not considering the option to leave empty checks as the sane default.

@armaganthis3
Copy link

Yes, That's exactly what I think. So does the OP. So did the numpy developers.
So did several users on the ml/comments in the other issues.

I have no idea where you came up with this definition of consistency, and I sure don't know
why you're clinging to it with so very little to back it up with except some personal aesthetic
you personally hold.

Who says a method can't return a value when it's appropriate and raise otherwise? that's
really... common, come to think of it.

a method that always raises whenever it's called regardless of circumstances?
yup, that sure is consistent. also useless.

@tstriker
Copy link
Author

@armaganthis3 btw, couldn't not notice that you have created a github burner account - while you are certainly helping with argumentation, i'd beg you to stay constructive if possible :)

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

alright,

@armaganthis3

so let' have a PR from you that fixes the issue (pretty trivial), but changes the documentation with an explanation as well

look forward to it

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

So to repeat @jseabold points (which I think are very rational), and were addressed, correctly me if I am wrong

Behavior and reasoning: 

1. Empty series raises. Maybe you screwed up your index? What is the 
'correct' output of this? 

if pd.isnull(pd.DataFrame([])): 
  print 'this dataframe has no missing values?' 

This seems ambiguous. You can't answer the question because there's no 
information to evaluate the statement. 

2. 1 element is fine. You know what you're doing, carry on. Also 
.all() == .any() in this case, so it's not ambiguous. 

3. Length > 1 raises . This is ambiguous. Ask for all, any, or empty. 
Maybe you screwed up your index? 

Skipper 

@armaganthis3
Copy link

@jreback, you might do well to take a step back and think about what the real issue is.
in this issue, and in the previous ones that prompted this code. Just look. it's right
it's not back-compat. it's not consistency. it's ambiguity. That's what the exception warns
of, that's what prompted the ban on bool(), and it simply doesn't apply when the dataframe
is empty.

I will not be opening a PR, I'm sure you'll have fun finding made-up problems (whitespace,
squashing, I've seen you do it) and wasting my time.

@tbaugis, I opened up a GH account just so I can comment on this. It makes me so angry
to see technical discussions mired by nonsenical arguments in a project as widely used as this.
If one good reason would have been given, we'd all know why it shouldn't be done and there'd
be no need for all this. instead you get malarky like "it's beacuse of the feature-richness!".

That said, you're completely right and I do sound like a troll. Going away now.

@tstriker
Copy link
Author

Here's the proposed rationale (going reductionist):

  1. Empty series returns False - see PEP8 on sequences
  2. 1 element returns True - see PEP8 on sequences
  3. Lenght > 1 returns True - see PEP8 on sequences

And here's is why it's no ambiguous - and can be expressed in one sentence: The boolean check evaluates DataFrame for emptiness. In fact, it's so obvious that it wouldn't even require a line in the documentation (because, see PEP8).

We could run a fun dev poll on this - "what do you think an empty DataFrame should evaluate to? (a) can't evaluate, it's ambiguous; (b) False"
And if empty dataframe is False, then a non-empty one is True.

But i'm afraid i'm running circles now, heh.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

@tbaugis that completely breaks with numpy for 2) and 3). not that numpy is always right.

for 2) numpy evaluates Series([True]) and Series([False]) (or Frame), to True and False
pandas provides .bool() for this purpose

for 3) always raise the ValueError (same as numpy)

there is just too much room for screwing up your indexes in pandas because you did an operation which happened to return a different index and thus you end up with an empty frame. This is much harder in numpy; THAT's why pandas has to error in favor of more errors rather than less.

I supposes 1) is ok, but I think it goees to the same exact reason above, you might have screwed something up.

Yes, PEP8 is nice, but you want a failure to be obvious and loud.

If you have an empty DataFrame you had better be sure that is what you actually want and just didn't make a mistake (this is QUITE common).

@jreback
Copy link
Contributor

jreback commented Apr 25, 2014

@hayd @jseabold jump in? I know we had this disccusion, but @tbaugis does have a valid point

@jtratner
Copy link
Contributor

@tbaugis - I agree that it's annoying that you can't do normal checks like if x: do y on dataframes, but I think the problem is that you're assuming that it's feasible to say that all non-empty NDFrames are truthy.

And if empty dataframe is False, then a non-empty one is True.

but if non-empty dataframes must raise, would you still want an empty dataframe to register as False?

I've included some explanation from numpy devs on why you need to raise on the ambiguity, which mostly centers around this:

>>> pd.Series(['a', 'b']) == pd.Series(['a', 'c'])
0     True
1    False
dtype: bool
>>> if pd.Series(['a', 'b']) == pd.Series(['c', 'd']):
...    print "This is truthy"

Except that a == b must return the elementwise comparison of a and b , and therefore the statement if a == b would always be True if a and b have more than one item (as in the above).

Every built in Python object returns a bool on logical comparisons (with the exception of the special keywords and and or) and pandas would be much less useful if you couldn't write something like c[c >= max_value] = max_value. You definitely sacrifice some clarity regarding coercing to booleans, but in exchange you get more expressive syntax.


There are a few mailing list threads you might check out that explains why numpy chose to raise on non-empty arrays (links from numpy/numpy#2031) Snippet from a mailing list post that I think breaks it down best:

No. Numeric used to use the any() interpretation, and it led to many,
many errors in people's code that went undetected for years. For
example, people seem to usually want "a == b" to be True iff all
elements are equal. People also seem to usually want "a != b" to be
True if any elements are unequal. These desires are inconsistent and
cannot be realized at the same time, yet people seem to hold both
mental models in their head without thoroughly thinking through the
logic or testing it. No amount of documentation or education seemed to
help, so we decided to raise an exception instead.

@hayd
Copy link
Contributor

hayd commented Apr 28, 2014

We can't keep everyone happy here, so we keep everyone unhappy but with correct code (that is, code without surprises)... which ought to keep everyone happy (inside).

As you've mentioned this was brought up in #4633 #4633 originally #1073 #1069 etc. and was discussed extensively in the mailing lists: https://groups.google.com/d/msg/pydata/XzSHSLlTSZ8/QEOsT4l3RFYJ

I understand @tbaugis's sentiment that it should work pythonically, however experience/discussion suggest this is a bad idea. The current solution works well with a raise if it's ambiguous (and unfortunately it is ambiguous*), I stand by team's decision that raising is the correct solution, and writing df.empty rather than df is more explicit...

The message on this exception tells you exactly and immediately how to fix your code to be non-ambiguous.

* For example, consider calling bool on the following Series:

In [1]: a = pd.Series([False])

In [2]: bool(a.values)  # numpy
Out[2]: False

In [3]: bool(list(a))  # python
Out[3]: True

Choosing either is going to surprise/confuse. (That said, if you can persuade numpy to "fix" this potentially we'd be happy to as well...)

@armaganthis3 I think you've misread the numpy examples above, checking:

if a.values:  # numpy

does not check emptiness, it checks emptiness or whether it contains one item whose value is Falsey and if it contains more than one item it RAISES.... which is fine if that's what you meant, but IMO that's rarely what the user means. Personally I can't see a good reason to write code which could potentially depending on the length raise... but if you want to, for your convenience, you can use .bool().

@dalejung
Copy link
Contributor

As I mentioned in the ML thread, I used the

if df:
  do stuff

idiom quite a bit. It was super convenient and I never had a problem with the ambiguous cases. I had lots of legacy code break by making things more correct. If pandas were a personal project, it would work the way lists do.

That being said, there are valid reasons for the inconvenience and it's not something I even register anymore.

@tstriker
Copy link
Author

@jratner - what seems to be off for me is the attempt to do stuff like this

>>> pd.Series(['a', 'b']) == pd.Series(['a', 'c'])
0     True
1    False

equality operator is not the correct one to use as the response is not a single Boolean, but rather a list of booleans. I'm having hard time remembering any precedents like this outside of the numpy/pandas. The proper thing here would be to use an explictly named function something like pd.Series(['a', 'b']).compare(pd.Series(['a', 'c'])) or similar.

And the equality comparison should compare for full equality - that is if the hashes of the objects match.

Hope i'm not repeating myself too much!

@immerrr
Copy link
Contributor

immerrr commented Apr 29, 2014

I'm having hard time remembering any precedents like this outside of the numpy/pandas

octave and matlab come to mind.

UPD: also, R.

@jorisvandenbossche
Copy link
Member

@tbaugis This is the core of numpy (and pandas) that operations work elementwise. See also eg adding two arrays will add the individual elements, while adding two lists concatenates them:

In [36]: a = np.array([1,2])
In [37]: b = np.array([3,4])
In [38]: a + b
Out[38]: array([4, 6])
In [39]: list(a) + list(b)
Out[39]: [1, 2, 3, 4]

and the equality operator behaves the same.

@tstriker
Copy link
Author

@jorisvandenbossche thanks for the insight! Without grudge, i guess pandas can be described as "matlab or r in python", where the math computational world notations clash with the python ones.

How about pandas3k then! :)

Anyhoo, I guess you have explained all the things to me. Thank you all for your time!

@jorisvandenbossche
Copy link
Member

@tbaugis It's indeed true that pandas deviates from python in some ways, to have more convenient behaviour for numerical data analysis (what you can call matlab or R-like, but more general scientific computing like). But maybe this can be made clearer in the docs, for example a section on "differences between pandas and python" or "dataframes are not lists" (because it is certainly not going to change :-)).

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

@tbaugis thanks for your patience, this has been an illuminating issue!

going to close, but if anyone has comments, pls feel free to post!

@jreback jreback closed this as completed Apr 29, 2014
@jseabold
Copy link
Contributor

On Tue, Apr 29, 2014 at 6:59 AM, Toms Bauģis notifications@git.luolix.topwrote:

@jratner https://github.com/jratner - what seems to be off for me is
the attempt to do stuff like this

pd.Series(['a', 'b']) == pd.Series(['a', 'c'])
0 True
1 False

equality operator is not the correct one to use as the response is not a
single Boolean, but rather a list of booleans. I'm having hard time
remembering any precedents like this outside of the numpy/pandas. The
proper thing here would be to use an explictly named function something
like pd.Series(['a', 'b']).compare(pd.Series(['a', 'c'])) or similar.

FWIW, this is what the .all method is for in numpython. 'explicit is
better than implicit'

(pd.Series(['a', 'b']) == pd.Series(['a', 'c'])).all()

or

all(pd.Series(['a', 'b']) == pd.Series(['a', 'c']))

@hayd
Copy link
Contributor

hayd commented Apr 29, 2014

@tbaugis that's #1134 (a "feature").

There was some discussion of better way to check two pandas objects are equal... there's a couple of SO answers using assert_frame_equal :S ...hacky.

@tstriker
Copy link
Author

@jseabold this has nothing to do with all (https://docs.python.org/2/library/functions.html#all) as all just iterates through a sequence and evaluates each element's truthyness. The non-standard behavior of the equality operator is still there, and is implicit.

And if you want to chime a mantra, then, with respect, i'd suggest to go for "rectangular circles are better than circular squares", as the "explicit is better than implicit" seems to have lost it's meaning entirely.

@hayd - i'm toying with the idea of creating a supplemental module for pandas that would overload the default non-pythonic behavior. I imagine it would render some pandas code incompatible and would be meant only for people like me, for whom [1, 2, 3, 4] + [1, 2] still is [1, 2, 3, 4, 1, 2] :)
not sure if it's worth it, but then again - no harm either.

@jseabold
Copy link
Contributor

On Tue, Apr 29, 2014 at 9:20 AM, Toms Bauģis notifications@git.luolix.topwrote:

@jseabold https://github.com/jseabold this has nothing to do with all (
https://docs.python.org/2/library/functions.html#all) as all just
iterates through a sequence and evaluates each element's truthyness. The
non-standard behavior of the equality operator is still there, and is
implicit.

Are all these elements equal seems pretty explicit to me and is the
behavior you were hoping to mimic with 'compare' (There's already .equals
for the additional index check).

And if you want to chime a mantra, then, with respect, i'd suggest to go
for "rectangular circles are better than circular squares", as the
"explicit is better than implicit" seems to have lost it's meaning entirely.

Sigh ok I'm not going to rehash this and am now sorry I chimed in on this
again. There are ML discussions on this going back 5-10+ years on why this
is the 'best' explicit behavior when it's not necessarily clear what you
want in numpython (my feelings about scalar arrays aside).

@jorisvandenbossche
Copy link
Member

@hayd isn't this the newly added DataFrame.equals method? (pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html, #5283)

@tbaugis this is maybe leading to another discussion, but what features then do you want from pandas/numpy instead of lists? If not all standard elementwise operations (+, -, /, , ==, <, ...). This is the *strength of numpy/pandas, and changing that will not render some pandas code incompatible, but almost all I think.

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

@hayd @jorisvandenbossche

in 0.13.1

@tbaugis df1.equals(df2) is I think what you are looking, in effect testing for equality of ALL elements (including nans, which normally don't compare equal) . This returns a bool

eg.

if df1.equals(df2):
    ...

@tstriker
Copy link
Author

@jorisvandenbossche - In my case, i use pandas to load in data and then to analyze it - there is filtering, sorting, resampling, grouping and pivoting going on, but i haven't yet used any of the standard math operators. in addition, i'm using TimeSeries, and then all the export functions.
It depends heavily on the use case, of course, mine can be described as softcore - where the stats aspect is touched only slightly.
But once you start using pandas - the dataframe objects trickle down from the lib level and you end up using them everywhere, interchangeably with lists - and that is my usecase.

I can also see the other cases and why it might not make sense to tweak pandas to be more pythonic, as that will trip up the people coming/using r, matplotlib and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants