Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: allow dependent assignment? #14207

Closed
chris-b1 opened this issue Sep 12, 2016 · 19 comments · Fixed by #17632 or #18852
Closed

API: allow dependent assignment? #14207

chris-b1 opened this issue Sep 12, 2016 · 19 comments · Fixed by #17632 or #18852

Comments

@chris-b1
Copy link
Contributor

chris-b1 commented Sep 12, 2016

Since kwarg order will be guaranteed in python 3.6, we could allow this...though would only want to try with a 3.6+ version check, otherwise code could more or less randomly work/not work, depending on dict order.

In [44]: df = pd.DataFrame({'a': [1,2,3]})

In [45]: df.assign(b=1, c=lambda x: x['b'] * 2)
@TomAugspurger
Copy link
Contributor

xref: #9777 (comment)

This would be a great, but I want to think a bit about exposing a feature that would so easily lead to subtle differences between 3.5 and 3.6. Not sure how best to handle it :/

Either way, we could probably deprecate the sorting of keys (documentation only; can't print a warning each time someone uses .assign).

@TomAugspurger
Copy link
Contributor

FYI, here's the current docs (in a warning block):

Since the function signature of assign is **kwargs, a dictionary, the order of the new columns in the resulting DataFrame cannot be guaranteed to match the order you pass in. To make things predictable, items are inserted alphabetically (by key) at the end of the DataFrame.
All expressions are computed first, and then assigned. So you can't refer to another column being assigned in the same call to assign.

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 12, 2016

Yeah, the back-compat is awkward, maybe this could be opt-in from an option, so it would at least raise on older pythons, although that's ugly too.
pd.set_option('ordered_assign', True) # raises on py < 3.6

@TomAugspurger
Copy link
Contributor

@mrocklin thoughts on this? IIRC you wanted this initially to make things easier for dask? How much would break for you / dask users if we just removed the sorting of keys?

@mrocklin
Copy link
Contributor

mrocklin commented Feb 9, 2017

I think that Pandas should make whatever decisions make sense for Pandas. Dask.dataframe will adapt as necessary. I suspect that this won't cause a problem. Dask.dataframe will just try the assign operation on an empty dataframe to determine the column ordering. Whatever Pandas does we'll probably do automatically.

@shoyer
Copy link
Member

shoyer commented Feb 9, 2017

Ultimately, I think this we should switch to this behavior and stop sorting.

But I don't see much of a rush. It might be better to wait until Python 3.6 is more widely used, even until pandas 2.0.

@shoyer
Copy link
Member

shoyer commented Feb 9, 2017

I would also be comfortable just changing this for Python 3.6+. My guess is that there aren't that many folks using Python 3.6 in production yet, so this will probably cause minimal issues.

@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Sep 21, 2017
@TomAugspurger
Copy link
Contributor

I think we should sort this out for 0.21. I propose

  • No change to 2.7 - 3.5; order is sorted by key
  • For 3.6+, we use the original order the user passed in
  • We don't allow dependent assignment yet

Maybe in the future, when there's less 3.5 and lower code around, we can allow for dependent assignment.

@jorisvandenbossche
Copy link
Member

+1 on your summary

@jreback
Copy link
Contributor

jreback commented Sep 22, 2017

sgtm on @TomAugspurger summary as well.

@bobhaffner
Copy link
Contributor

HI All, Tom said this was a good one for new contributors so I'm going after it.

@jorisvandenbossche
Copy link
Member

Keep this open for the second part: allow dependent assignment somewhere in the future?

@jreback
Copy link
Contributor

jreback commented Sep 25, 2017

I would create a new separate issue if you really want to. That is a long time from now though.

@jorisvandenbossche
Copy link
Member

The title and description in the top post is exactly this, so no need to create a new one I think, if we want an issue for it

@TomAugspurger
Copy link
Contributor

If you didn't see, as of python 3.7 dictionaries are ordered, as part of the language https://mail.python.org/pipermail/python-dev/2017-December/151283.html

My objection to allowing dependent assignment now was that it'd (potentially) lead to subtle bugs / confusing errors between versions of python. But since Python itself is opening the door, let's slip through too :) I think we should allow dependent assignment in 0.22.

@datajanko
Copy link
Contributor

Obviously, I am not good at looking through existing issues. Please accept my aplogogies.

As of #18797 I am also interested in letting assign be able to accept dependent kwargs. I just performed the (in my opinion) naive/direct implementation (after adapting the test cases accordingly) and it works fine - for callables where references to the "depenendet columns" are evaluated lazily. Something like df.assign(b=1, c=df['b']) will - of course - not work.

So if it's okay, that dependent assignments only work for certain callables, I'd be happy to issue a pull request in the next days.

@TomAugspurger
Copy link
Contributor

if it's okay, that dependent assignments only work for certain callables

Yes, that sounds right. You'll need to use a callable to refer to newly created columns in the .assign.

datajanko pushed a commit to datajanko/pandas that referenced this issue Dec 18, 2017
Specifically, 'df.assign(b=1, c=lambda x:x['b'])'
does not throw an exception in python 3.6 and above.
Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
@datajanko
Copy link
Contributor

before issuing pull request, I obviously need to provide an entry in the what's new file. I assume other api changes is the right place, correct? Sorry for asking again, but this is may first coding contribution for an open source project.

@chris-b1
Copy link
Contributor Author

chris-b1 commented Dec 19, 2017 via email

datajanko pushed a commit to datajanko/pandas that referenced this issue Dec 19, 2017
Specifically, 'df.assign(b=1, c=lambda x:x['b'])'
does not throw an exception in python 3.6 and above.
Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
datajanko pushed a commit to datajanko/pandas that referenced this issue Dec 23, 2017
Specifically, 'df.assign(b=1, c=lambda x:x['b'])'
does not throw an exception in python 3.6 and above.
Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
@jreback jreback modified the milestones: Someday, 0.23.0 Dec 23, 2017
datajanko pushed a commit to datajanko/pandas that referenced this issue Jan 9, 2018
Specifically, 'df.assign(b=1, c=lambda x:x['b'])'
does not throw an exception in python 3.6 and above.
Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
datajanko pushed a commit to datajanko/pandas that referenced this issue Feb 9, 2018
Specifically, 'df.assign(b=1, c=lambda x:x['b'])'
does not throw an exception in python 3.6 and above.
Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.

populates dsintro and frame.py with examples and warning

- adds example to frame.py
- reworked warning in dsintro
- reworked Notes in frame.py

Remains open:

frame.py probably is responsible vor travis not passing: doc test that requires python 3.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants