Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Have pivot and pivot_table take similar arguments #5505

Merged
merged 1 commit into from
Mar 14, 2014

Conversation

jsexauer
Copy link
Contributor

Right now, there is

df.pivot(index=foo, columns=bar, values=baz)

and

df.pivot_table(rows=foo, cols=bar, values=baz)

Because of dirtyness in my data I occasionally have to go back and forth between the two forms (and use df.pivot_table(..., aggfunc='count') to see who had the bad data). It would be nice if I didn't have to change the keyword arguments each time and one had an alias of the other. It would also be easier to remember, especially in the trivial case of cols vs columns. Finally, the two are related operations and having an option for consistent keyword parameters makes sense.

I could make a PR to solve this easy enough, if this is an enhancement you are interested in. I'm not really sure what your stance is on redundant keyword arguments.

@jtratner
Copy link
Contributor

If you want to propose something we'd likely be open to it (depending on
how much the API is changed), but you'd also need to maintain order of
arguments. Probably easiest to just add **kwargs to it and then validate
that only cols or columns were passed and then just convert to what
function expects

if rows is None:
rows = kwarg['index']
else:
raise TypeError("Can only specify either 'rows' or 'index'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think better way to do this is (without all this for loop stuff):

index = kwargs.get('index', None)
if index is not None:
   if rows is None:
      rows = index
   else:
      raise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could index be an ndarray?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, yes, should be "is not None"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtratner I'm sorry, but I don't follow how index being an ndarray breaks things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsexauer if something does bool(something) under the hood and ndarrays longer than length one can't be converted to bool:

>>> bool(np.array([1, 2, 3,4])
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hayd I think the flaw with your approach is that if the user passes an incorrect keyword argument, no error is raised. It seems to me we would want to raise a type error in such an event. This is especially import I feel for pivot_table() where for example someone might pass agg_func instead of aggfunc and get frustrated wondering why the function is returning apparently incorrect results.

@jtratner The following experiment works for me:

>>> if np.ndarray([1,2]) is None:
...     print 'None'
... else:
...     print 'Not None'
... 
Not None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it a pop rather than a get then check the length of kwargs is 0. (if kwargs: raise...)

@hayd
Copy link
Contributor

hayd commented Nov 20, 2013

I guess the problem is that the API (order of args) won't be changed, so it's not really going to ever fully fix... :(

@jsexauer
Copy link
Contributor Author

@hayd hopefully my last commit addresses your concerns.

@jorisvandenbossche
Copy link
Member

A remark: would it make sense to choose which of the keywords (col/columns and rows/index) is preferential and only provide those to the other function (and not have both options in both functions)?

@jreback
Copy link
Contributor

jreback commented Dec 5, 2013

@jsexauer does this have an issue? (aside from this PR)

@jsexauer
Copy link
Contributor Author

jsexauer commented Dec 6, 2013

Not sure what you're asking @jreback. The included PR, which has no issues to my knowledge, fixes the issue regarding a discrepancy between the parameters for pivot and pivot_table.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2013

am asking if their is a reference issue for this (eg a prior issue); guess not

@jsexauer
Copy link
Contributor Author

jsexauer commented Dec 6, 2013

I submitted this originally as an issue ticket but changed into a PR ticket when I coded the enhancement.

msg = "Can only specify either 'cols' or 'columns'"
raise TypeError(msg)

if len(kwarg) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pythonic tip - you can just use "if kwarg" (my bad for saying otherwise)

@hayd
Copy link
Contributor

hayd commented Dec 6, 2013

This has tripped me in the past, so I think I'm +1 on this.

@jsexauer
Copy link
Contributor Author

jsexauer commented Dec 7, 2013

I've implemented @hayd 's suggestions.

@hayd
Copy link
Contributor

hayd commented Dec 7, 2013

(waiting til after 0.13 is final, but then it would need something in release notes and some tests, sorry I hadn't realised there were no tests 'til just now!)

@jorisvandenbossche
Copy link
Member

I want to repeat my remark (as there was not really a response). Isn't it possible to really make it more consistent instead of just providing both options to both functions as additional keyword arguments?

  • for cols/columns we could choose one of the two as the preferential one. E.g. choosing for columns, and only add it as a keyword argument to pivot_table (and not also add cols to pivot).
    I think it would maybe even be possible to change cols to columns in the default function call of pivot_table, and provide cols as additional one (so both argument by order as by name still works, no backwards incompatible change, but the function call itself are also more consistent).
  • for index/rows it is maybe less clear wich is the preferential?

@jreback
Copy link
Contributor

jreback commented Dec 8, 2013

+1 on joris suggestion

maybe attach a depr warning to use of cols?

columns and index would be the most consistent with the rest of pandas, though rows is ok 2 (definitely not cols though)

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

I think lets break pivot_table and just go with pivot args; you can for 0.14 provide a nice warning message, but I would just change the signature to using index/columns. This is used everywhere else; rows/cols not used anywhere.

@hayd
Copy link
Contributor

hayd commented Feb 18, 2014

should probably deprecate old args too.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@jsexauer can you respond to @hayd comments?

@jsexauer
Copy link
Contributor Author

jsexauer commented Mar 9, 2014

Assuming this passes CI, I believe this addresses everyone's concerns. Sorry for the delay.

self.assertIsInstance(w[0], FutureWarning)
self.assertIsInstance(w[1], FutureWarning)

with warnings.catch_warnings(record=True) as w2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.assert_produces_warning here (same thing just more consistent)

@jsexauer
Copy link
Contributor Author

Ok good to know. Sorry you're going to get my full commit history then. I will do this in the future.

@jorisvandenbossche
Copy link
Member

But you can do it now, that is no problem.
If you just execute the lines above, you should be able to meld all the commits into one or a few.

@jsexauer
Copy link
Contributor Author

God I hope I did this right...

@jorisvandenbossche
Copy link
Member

Hmm, you have one commit (the third last one) that seems a squashed one, but the original commits are not gone, so clearly something went wrong, but I don't know what.

If git rebase -i does not work, another option is to create a new branch and cherry-pick that one commit that is the good one, something like:

git checkout master
git rebase upstream/master
git checkout -b new_branch
git cherry-pick 1e1caf6e367d1f04cfce43cdff062e80bb99ebaf

and then push this branch (to a new PR)

@jsexauer
Copy link
Contributor Author

So I deleted branch fix5505 on my local repo, and created a new version which was a squashed rebase of commit e3dad66 onto upstream/master (which is commit 0d74287). I then tried to push this new version of branch fix5505 onto jsexauer/pandas/fix5505 but it said I had to pull down the old version first, so I made a new empty commit [2409c0e] to make it happy. If there's a better way to do this last step, let me know, but this was the best I could figure out short of opening a new PR. If you'd prefer I did that I will, but it seems like it would be best to keep this mess in one ticket instead of spreading the wealth :)

@hayd
Copy link
Contributor

hayd commented Mar 14, 2014

You have to force push at the end (as you've changed the history), you've now done it as a merge. Best advice is to create new branch off master and cherry-pick as Joris describes. then force push.

git push origin new_branch:fix5505 --force

@jsexauer
Copy link
Contributor Author

Bam, looks good I think?

@jsexauer
Copy link
Contributor Author

I'm really at a loss, as the test it fails on works for me:

import pandas
import numpy as np
import pandas.util.testing as tm

index = ['A', 'B']
columns = 'C'
df = pandas.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
                                     'bar', 'bar', 'bar', 'bar',
                                     'foo', 'foo', 'foo'],
                               'B': ['one', 'one', 'one', 'two',
                                     'one', 'one', 'one', 'two',
                                     'two', 'two', 'one'],
                               'C': ['dull', 'dull', 'shiny', 'dull',
                                     'dull', 'shiny', 'shiny', 'dull',
                                     'shiny', 'shiny', 'shiny'],
                               'D': np.random.randn(11),
                               'E': np.random.randn(11),
                               'F': np.random.randn(11)})

with tm.assert_produces_warning(FutureWarning):
    pandas.pivot_table(df, values='D', rows=index,
                                    cols=columns)

with tm.assert_produces_warning(False):
    pandas.pivot_table(df, values='D', index=index,
                                    columns=columns)

pandas.pivot_table(df, values='D', rows=index,
                                cols=columns)

@jorisvandenbossche
Copy link
Member

@jsexauer In any case, the rebase looks good now! One clean commit.

Regarding the test warning, this is because the warning is already triggered somewhere else, namely in the TestCrosstab tests in the same file (and only triggered once). You should also replace there all rows/cols to index/columns, as well as in eg the function crosstab (which is based on pivot_table).

@jsexauer
Copy link
Contributor Author

Ok I'll do that when I get home from work today. Would you like that squashed into the previous commit or as a new one?

@jorisvandenbossche
Copy link
Member

You can try to squash it at once, see if it works now .. :-) If you have further git questions, just ask!

@jsexauer
Copy link
Contributor Author

Thank you all for your patience in helping me figure out how to do this within your process. Unless someone indicates otherwise, I believe this PR is ready for merging.

@jorisvandenbossche
Copy link
Member

And you thanks fo the contribution! And travis is happy now

A last thing, this should also be added to the release notes, under api changes I think. You can add in both https://github.com/pydata/pandas/blob/master/doc/source/v0.14.0.txt and https://github.com/pydata/pandas/blob/master/doc/source/release.rst a short line mentioning this.

@jsexauer
Copy link
Contributor Author

Updated

@jreback
Copy link
Contributor

jreback commented Mar 14, 2014

looks good..

@jorisvandenbossche merge when you are ready

@jorisvandenbossche
Copy link
Member

Looks good! Thanks a lot @jsexauer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants