-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust Bool evaluation for empty DataFrames to match PEP8 #6964
Comments
You can simply do this (this would need an adjustement to handle scalars) handling a mixed None and even lists/scalars is a bit non-trivial in python alone
I am not sure why you are confused about the ambiguity. Lists/tuples/dicts dont' care about the contents so by definition their isn't a problem. However pandas objects (and numpy objects) are simply different; the DO care about the contents and act on them. |
I'm not confused about ambiguity - i'm saying that there is not ambiguity if you approach the problem from the perspective of python - the default check in python is for None/empty, while the other two (.any and .all) are to be provided explicitly. To your point - handling a mixed None and even lists/scalars in python is trivial in python - it's I'm arguing that providing a sane default that matches the behavior of the environment would make it more standard in the pythonic sense. Finally - it's not objects that care about data - it's the programmers - pandas and numpy sure have special properties, but from my perspective they all can be boiled down to python primitives and unless there is strong reason, on the basic level should also behave as such. |
you just made my point, these are NOT python primitives, in fact they are much more feature rich. Trying to treat a frame (or a numpy array) as a list is meaningless. Yes it sometimes can act like a list, but that doesn't mean it IS one. Duck typing has its limits. so are you also advocating that numpy change this? |
Eesh, i guess i've gotten myself into trouble. In short - I'm using pandas for a web app, and it just trips me personally up every time when i do an "if foo" in the templates and then pandas is waving hands and claiming that it doesn't know what i'm talking about. The next steps then are plugging in .empty and then making sure that the object is always defined as DataFrame, as it would fail when the code branch that sets it from None would not execute, etc. Your suggested function would solve the problem, but shortly we would find ourselves following the magic of a cargo cult calling the f function for all the things. The benefit of allowing simple evaluation would be that of simplicity of code. |
@tbaugis you have a valid point, but I think its a combination of inertia (back compat), and refusing to guess Here is the change issue with lots of sub-issues and commentary if you want to review and come back with comments. Always like to hear a view. |
@tbaugis considering numpy, there is some danger factor given how it can broadcast a single Pandas containers are not so malleable, so there's less to be afraid of, but consistency with numpy is a plus. As for the magical |
The OP is about Empty dataframes. As mentioned all empty python sequences give In [320]: np.empty(shape=[0])
Out[320]: array([], dtype=float64)
In [321]: bool(np.empty(shape=[0]))
Out[321]: False
In [322]: bool(np.empty(shape=[0,0]))
Out[322]: False Numpy will not broadcast on an dimensions which have zero elements so what @immerrr #4657 by @jreback that changed this itself broke long existing behaviuor, and did it despite objections from users (@jseabold of statsmodels was one ) and actually overruled the original call wes mckiney made on this specific issue. Especially when considering the amount of breaking changes that have been going into master since 0.12.1 and for the most threadbare reasons, the "concerns" about backward-compatibility seem kinda hypocritical. What's an actual good reason why |
The changes were all about consistency; @jseabold objection had to do with IMHO a numpy inconcistency. Pandas actually hasn't changed AT ALL, with the exception of the single-element case.
Pandas is entirely consistent
|
... and we're back to discussing non-empty dataframes again? The OP put it in the title, discussed it in his description and I hyper-emphasized the point again in my comment. The question is about empty dataframes. Consistency. By which you mean doing the same thing whether it makes sense or not? Even if by doing so it's inconsistent with every other python data type out there? That's just bizzare. what's the point of this "consistency"? whom does it benefit? the OP, discussions in previous issue, the ml discussion and the PEP are all explicit about |
so you think that if I have an if statement I don't want it working when the test is False, but raising when it is True. *that makes no sense. That would be inconsistent |
If it's nonempty, you would get a True, irregardless of the contents of the frame. The bool check would be evaluating if the DataFrame (or other sequences for that matter, unless they are generators) is empty or not. @armaganthis3 raised a good point, that the issue is not with consistency, but rather matching expectations (pep8). I ran through the archive of the pointed out issue (thanks for the link!) and the discussion (https://groups.google.com/forum/#!topic/pydata/XzSHSLlTSZ8) seems to have flushed the baby (empty checks) with the bathwater (evaluating dataframe contents to determine the bool value), not considering the option to leave empty checks as the sane default. |
Yes, That's exactly what I think. So does the OP. So did the numpy developers. I have no idea where you came up with this definition of consistency, and I sure don't know Who says a method can't return a value when it's appropriate and raise otherwise? that's a method that always raises whenever it's called regardless of circumstances? |
@armaganthis3 btw, couldn't not notice that you have created a github burner account - while you are certainly helping with argumentation, i'd beg you to stay constructive if possible :) |
alright, so let' have a PR from you that fixes the issue (pretty trivial), but changes the documentation with an explanation as well look forward to it |
So to repeat @jseabold points (which I think are very rational), and were addressed, correctly me if I am wrong
|
@jreback, you might do well to take a step back and think about what the real issue is. I will not be opening a PR, I'm sure you'll have fun finding made-up problems (whitespace, @tbaugis, I opened up a GH account just so I can comment on this. It makes me so angry That said, you're completely right and I do sound like a troll. Going away now. |
Here's the proposed rationale (going reductionist):
And here's is why it's no ambiguous - and can be expressed in one sentence: The boolean check evaluates DataFrame for emptiness. In fact, it's so obvious that it wouldn't even require a line in the documentation (because, see PEP8). We could run a fun dev poll on this - "what do you think an empty DataFrame should evaluate to? (a) can't evaluate, it's ambiguous; (b) False" But i'm afraid i'm running circles now, heh. |
@tbaugis that completely breaks with numpy for 2) and 3). not that numpy is always right. for 2) numpy evaluates for 3) always raise the there is just too much room for screwing up your indexes in pandas because you did an operation which happened to return a different index and thus you end up with an empty frame. This is much harder in numpy; THAT's why pandas has to error in favor of more errors rather than less. I supposes 1) is ok, but I think it goees to the same exact reason above, you might have screwed something up. Yes, PEP8 is nice, but you want a failure to be obvious and loud. If you have an empty DataFrame you had better be sure that is what you actually want and just didn't make a mistake (this is QUITE common). |
@tbaugis - I agree that it's annoying that you can't do normal checks like
but if non-empty dataframes must raise, would you still want an empty dataframe to register as False? I've included some explanation from numpy devs on why you need to raise on the ambiguity, which mostly centers around this: >>> pd.Series(['a', 'b']) == pd.Series(['a', 'c'])
0 True
1 False
dtype: bool
>>> if pd.Series(['a', 'b']) == pd.Series(['c', 'd']):
... print "This is truthy" Except that Every built in Python object returns a bool on logical comparisons (with the exception of the special keywords There are a few mailing list threads you might check out that explains why numpy chose to raise on non-empty arrays (links from numpy/numpy#2031) Snippet from a mailing list post that I think breaks it down best:
|
We can't keep everyone happy here, so we keep everyone unhappy but with correct code (that is, code without surprises)... which ought to keep everyone happy (inside). As you've mentioned this was brought up in #4633 #4633 originally #1073 #1069 etc. and was discussed extensively in the mailing lists: https://groups.google.com/d/msg/pydata/XzSHSLlTSZ8/QEOsT4l3RFYJ I understand @tbaugis's sentiment that it should work pythonically, however experience/discussion suggest this is a bad idea. The current solution works well with a raise if it's ambiguous (and unfortunately it is ambiguous*), I stand by team's decision that raising is the correct solution, and writing The message on this exception tells you exactly and immediately how to fix your code to be non-ambiguous. * For example, consider calling bool on the following Series:
Choosing either is going to surprise/confuse. (That said, if you can persuade numpy to "fix" this potentially we'd be happy to as well...) @armaganthis3 I think you've misread the numpy examples above, checking:
does not check emptiness, it checks emptiness or whether it contains one item whose value is Falsey and if it contains more than one item it RAISES.... which is fine if that's what you meant, but IMO that's rarely what the user means. Personally I can't see a good reason to write code which could potentially depending on the length raise... but if you want to, for your convenience, you can use |
As I mentioned in the ML thread, I used the if df:
do stuff idiom quite a bit. It was super convenient and I never had a problem with the ambiguous cases. I had lots of legacy code break by making things more correct. If pandas were a personal project, it would work the way lists do. That being said, there are valid reasons for the inconvenience and it's not something I even register anymore. |
@jratner - what seems to be off for me is the attempt to do stuff like this
equality operator is not the correct one to use as the response is not a single Boolean, but rather a list of booleans. I'm having hard time remembering any precedents like this outside of the numpy/pandas. The proper thing here would be to use an explictly named function something like And the equality comparison should compare for full equality - that is if the hashes of the objects match. Hope i'm not repeating myself too much! |
@tbaugis This is the core of numpy (and pandas) that operations work elementwise. See also eg adding two arrays will add the individual elements, while adding two lists concatenates them:
and the equality operator behaves the same. |
@jorisvandenbossche thanks for the insight! Without grudge, i guess pandas can be described as "matlab or r in python", where the math computational world notations clash with the python ones. How about pandas3k then! :) Anyhoo, I guess you have explained all the things to me. Thank you all for your time! |
@tbaugis It's indeed true that pandas deviates from python in some ways, to have more convenient behaviour for numerical data analysis (what you can call matlab or R-like, but more general scientific computing like). But maybe this can be made clearer in the docs, for example a section on "differences between pandas and python" or "dataframes are not lists" (because it is certainly not going to change :-)). |
@tbaugis thanks for your patience, this has been an illuminating issue! going to close, but if anyone has comments, pls feel free to post! |
On Tue, Apr 29, 2014 at 6:59 AM, Toms Bauģis notifications@git.luolix.topwrote:
FWIW, this is what the
or
|
@tbaugis that's #1134 (a "feature"). There was some discussion of better way to check two pandas objects are equal... there's a couple of SO answers using assert_frame_equal :S ...hacky. |
@jseabold this has nothing to do with And if you want to chime a mantra, then, with respect, i'd suggest to go for "rectangular circles are better than circular squares", as the "explicit is better than implicit" seems to have lost it's meaning entirely. @hayd - i'm toying with the idea of creating a supplemental module for pandas that would overload the default non-pythonic behavior. I imagine it would render some pandas code incompatible and would be meant only for people like me, for whom [1, 2, 3, 4] + [1, 2] still is [1, 2, 3, 4, 1, 2] :) |
On Tue, Apr 29, 2014 at 9:20 AM, Toms Bauģis notifications@git.luolix.topwrote:
Sigh ok I'm not going to rehash this and am now sorry I chimed in on this |
@hayd isn't this the newly added @tbaugis this is maybe leading to another discussion, but what features then do you want from pandas/numpy instead of lists? If not all standard elementwise operations (+, -, /, , ==, <, ...). This is the *strength of numpy/pandas, and changing that will not render some pandas code incompatible, but almost all I think. |
in 0.13.1 @tbaugis eg.
|
@jorisvandenbossche - In my case, i use pandas to load in data and then to analyze it - there is filtering, sorting, resampling, grouping and pivoting going on, but i haven't yet used any of the standard math operators. in addition, i'm using TimeSeries, and then all the export functions. I can also see the other cases and why it might not make sense to tweak pandas to be more pythonic, as that will trip up the people coming/using r, matplotlib and so on. |
Right now, as per the gotchas section, pandas raises a value error on an empty frame
http://pandas.pydata.org/pandas-docs/dev/gotchas.html#gotchas-truth
While documentation describes ambiguity, in the python world there is none.
PEP8 (http://legacy.python.org/dev/peps/pep-0008/) recommends: "For sequences, (strings, lists, tuples), use the fact that empty sequences are false."
The check returns False for None and empty values for all iterables. Further, there are any (https://docs.python.org/2/library/functions.html#any) and all(https://docs.python.org/2/library/functions.html#all) functions for the proposed ambiguity checks.
As currently it's just raising a ValueError, fixing the problem to correctly evaluate representing DataFrame.empty would be backwards compatible.
The check becomes important when working in a mixed environment with pandas and other structures - a simple
if foo
can become as convoluted asempty = foo.empty() if (foo is not None and isinstance(foo, pd.DataFrame)) else not foo
The text was updated successfully, but these errors were encountered: