Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use boolean indexing via getitem to trigger masking; add inplace keyword to where #2230

Merged
merged 5 commits into from
Nov 13, 2012

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 11, 2012

in core/frame.py

changed method getitem to use mask directly (e.g. df.mask(df > 0) is equivalent semantically to df[df>0])
this would be a small API change as before df[df >0] returned a boolean np array

added inplace keyword to where method (to update the dataframe in place, default is NOT to use inplace, and return a new dataframe)

changed method boolean_set to use where and inplace=True (this allows alignment of the passed values and is slightly less strict than the current method)

all tests pass (as well as an added test in boolean frame indexing)

if included in 0.9.1 would be great (sorry for the late addition)

changed method __getitem__ to use .mask directly (e.g. df.mask(df > 0) is equivalent semantically to df[df>0])

added inplace keyword to where method (to update the dataframe in place, default is NOT to use inplace, and return a new dataframe)

changed method _boolean_set_ to use where and inplace=True (this allows alignment of the passed values and is slightly less strict than the current method)

all tests pass (as well as an added test in boolean frame indexing)
…al sized frame

thus we now allow: df[df[:-1]<0] = 2 (essentially partial boolean indexing)

all tests continue to pass (added new test to test partial boolean indexing, removed test requiring an equal indexed frame)
@wesm
Copy link
Member

wesm commented Nov 11, 2012

The files here were made executable again (notably, this causes nose to exclude them from runs). Something must be wrong with your env-- as a band-aid you can globally configure git so it doesn't change file permissions

@jreback
Copy link
Contributor Author

jreback commented Nov 11, 2012

git config core.filemode false
seems to fix this....pushed a change to update modes on these 2 files

@jreback
Copy link
Contributor Author

jreback commented Nov 11, 2012

I also think something like this should be added to the docs:

selecting with a single column of a df (eg df[df['A']>0])
will potentially change the shape of the frame (as entire rows can be dropped)

while masking with the entire frame (eg df[df>0])
returns an equal sized frame
(even in the case of partial masking on the rows)
eg df[df[:-1]>0]

@changhiskhan
Copy link
Contributor

@jseabold :)

@changhiskhan
Copy link
Contributor

@wesm @jreback, what's more intuitive to you:

  1. mask where cond is True or False? Guess depends on whether you think of it as putmask-like or boolean selection.
  2. should where take values from self or other when cond is True?

@wesm
Copy link
Member

wesm commented Nov 12, 2012

I'm unconvinced about the df[bool_dataframe] yielding a masked DataFrame behavior (instead of asking that people explicitly use mask). I guess the use case you're thinking of is to be able to do stuff like:

df[df > 0].mean(1)

vs

df.mask(df > 0).mean(1)

?

@jreback
Copy link
Contributor Author

jreback commented Nov 12, 2012

actually no; I use mask like this:

df = start_frame
for m in masks:
    df = df.mask(m)

so I am 'applying' a series of 'filters' to df, but I want to avoid reindexing at each step (as I like to preserve the shape);
(these operations are not necessarily simple either - so they are not all 'ands' - this is also a simplification - I don't do it in a loop like this but pass it around - and other functions need to know about the whole frame)

I am fine with making users do df.mask explicity; I was actually thinking that since df[df > 0 ] = 2 is supported, then the converse operation should directly yield a frame (rather than an ndarray)

df[df > 0] (in the new version)

is equivalent to the more verbose (assume that df[df >0] returns a frame rather than an ndarray)

df[df > 0].reindex_like(df)

@jreback
Copy link
Contributor Author

jreback commented Nov 12, 2012

in response to @changhiskhan

  1. I agree 'True' is more natural (as that is what masking implies).....maybe it shouldn't be called mask, but 'select' (but that's taken!)..
  2. i don't see a problem with how 'where' is done now....?

@changhiskhan
Copy link
Contributor

@jreback, on #2, I was just thinking whether it's more intuitive that self.where(cond, other) is like np.where(cond, self, other) or np.where(cond, other, self). In your pull request, the cond is used in putmask where you need to invert the cond to match non-inplace DataFrame.where.

@jreback
Copy link
Contributor Author

jreback commented Nov 13, 2012

@changhiskhan
I had justed copied what was in setitem; here other is really the value that gets set, so it is what you want if cond is true

inplace = true is setitem, while inplace = false (e.g. use np.where) is getitem
and wanted to use the same alignment code (in theory these could actually/maybe should be separate methods)...inplace=True is kind of 'internal'; I dont' think it would ever would be called directly (instead you would call indirectly via setitem)

here's what I would do:

  1. eliminate mask (the user can always recreate by inverting the condition / or make it call where with an inverted condition)
  2. change the where signalture to: where(self, cond, other = NA, inplace = False)
  3. change getitem to use where (rather than mask)

so then syntax is pretty clean: df.where(cond) is equiv to df[cond] when cond is a Frame

I have tested and am ready to push if this looks ok

removed mask method
made other optional kw parm in where
changed __setitem__ to use where (rather than mask)
added condition testing to where that raised ValueError on an invalid condition (e.g. not an ndarray like object)
added tests for same
@wesm wesm merged commit a414346 into pandas-dev:master Nov 13, 2012
@wesm
Copy link
Member

wesm commented Nov 13, 2012

I don't think the semantics for inplace=True are right-- it's doing the opposite of what you would want (leaving values in df where the condition is True otherwise making them values from the other value). Making the necessary changes...

@jreback
Copy link
Contributor Author

jreback commented Nov 13, 2012

ok...make sense.....only (minor) issue....in numpy pretty sure that negation (-) is completely equivalent of invert (~) when applied to boolean?

in mask should change ~cond to -cond ?

@wesm
Copy link
Member

wesm commented Nov 13, 2012

Yeah, no biggie either way

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2012

these might be helpful examples for whatsnew and/or docs.....

import pandas
import numpy as np

df = pandas.DataFrame(np.random.randn(5, 3),columns=['A','B','C'])
df

standard frame selection (with a boolean series as the condition),
get a frame that is not necessarily the same shape as the input frame

df[df.A>0]

standard frame selection (but with boolean frame as the condition,
get a masked frame as the result, the same size as the original frame)

df[df>0]

where is the underlying mechanism

df.where(df>0)

substitue values that meet the condition

df.where(df>0,-df)

setting values

df2 = df.copy()
df2[df2>0] = -df2
df2

masking is the inverse boolean operation of where

df.mask(df>0)

df.where(df<=0)

advanced: partial selection and setting

df2 = df.copy()
crit = df2[1:2]>0
crit
df2[crit] = 3
df2

@wesm
Copy link
Member

wesm commented Nov 15, 2012

if someone wants to do a PR to add these to the docs/what's new that'd be great. i am doing too many other things to do it myself

@jreback
Copy link
Contributor Author

jreback commented Nov 15, 2012

done....added a small pull-request with the changes for whatsnew for 0.9.1...let me know if you need anything further!

yarikoptic added a commit to neurodebian/pandas that referenced this pull request Nov 15, 2012
* commit 'v0.9.1rc1-27-ge374f0f': (52 commits)
  BUG: axes.color_cycle from mpl rcParams should not be joined as single string
  BUG: icol duplicate columns with integer sequence failure. close pandas-dev#2228
  TST: unit test for pandas-dev#2214
  BUG: coerce ndarray dtype to object when comparing series
  ENH: make vbench_suite/run_suite executable
  ENH: Use __file__ to determine REPO_PATH in vb_suite/suite.py
  BUG: 1 ** NA issue in computing new fill value in SparseSeries. close pandas-dev#2220
  BUG: make inplace semantics of DataFrame.where consistent. pandas-dev#2230
  BUG: fix internal error in constructing DataFrame.values with duplicate column names. close pandas-dev#2236
  added back mask method that does condition inversion added condition testing to where that raised ValueError on an invalid condition (e.g. not an ndarray like object) added tests for same
  in core/frame.py
  TST: getting column from and applying op to a df should commute
  TST: add dual ( x op y <-> y op x ) tests for arith operators
  BUG: Incorrect error message due to zero based levels. close pandas-dev#2226
  fixed file modes for core/frame.py, test/test_frame.py
  relaxed __setitem__ restriction on boolean indexing a frame on an equal sized frame
  in core/frame.py
  ENH: warn user when invoking to_dict() on df with non-unique columns
  BUG: modify df.iteritems to support duplicate column labels pandas-dev#2219
  TST: df.iteritems() should yield Series even with non-unique column labels
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants