Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: should boolean indexing preserve input dtypes where possible? #2794

Closed
jreback opened this issue Feb 3, 2013 · 22 comments
Closed

ENH: should boolean indexing preserve input dtypes where possible? #2794

jreback opened this issue Feb 3, 2013 · 22 comments

Comments

@jreback
Copy link
Contributor

jreback commented Feb 3, 2013

Should pandas preserve the input dtype on when doing
boolean indexing, if possible?

Its a pretty limited case, in that you have to have all values in a column non-null
and they all have to be integers (just casted as floats)

this is actually a little tricky to implement as this has to be done column by column (and its possibly that blocks have to be split)

the question here is will this be not what the user expects? (currently all dtypes
on boolean output operations are cast to float64/object)

In [24]: df = pd.DataFrame(dict(
  a = pd.Series([1]*3,dtype='int32'), 
  b = pd.Series([1]*3,dtype='float32')),
index=range(3))

In [25]: df.ix[2,1] = 0

In [26]: df
Out[26]: 
   a  b
0  1  1
1  1  1
2  1  0

In [27]: df.dtypes
Out[27]: 
a      int32
b    float32
Dtype: object

In [28]: df[df>0]
Out[28]: 
   a   b
0  1   1
1  1   1
2  1 NaN

##### if it is possible to preserve column a as the original int input dtype
which in this case it is #####
In [29]: df[df>0].dtypes
Out[29]: 
a    float64
b    float32
Dtype: object

### possible output ###
Out[29]: 
a    int32
b    float32
Dtype: object
@cpcloud
Copy link
Member

cpcloud commented Feb 4, 2013

I get dtype conversion on DataFrame construction with dicts; what version of pandas are you using?

@jreback
Copy link
Contributor Author

jreback commented Feb 4, 2013

This is a development version (either 0.10.2 or 0.11-dev). The above code won't work on 0.10.1 (well it 'works', but it converts dtypes). Dtypes are preserved in some limited cases in 0.10.1. This is what #2708 is all about.

I am asking this:

if you input an integer dtype and you perform an operation that that results in an integerlike number, but
we since we have to round trip it thru floats (mainly because it can have nan in the array), BUT that the result on a particular column in a DataFrame CAN be converted to integer - SHOULD we convert it back to the original dtype?

This is not currently the case in 0.10.1 or lower

@stephenwlin
Copy link
Contributor

@jreback -- i'm a bit confused, because i've checked out your dtypes branch and see that you have implemented a parameter try_cast to DataFrame.where that seems to do what you're talking about. Are you just asking whether it should be turned on by default from getitem/setitem?

@jreback
Copy link
Contributor Author

jreback commented Feb 4, 2013

thats exactly what I am asking. Its actually not fully implemented, because it can happen that an IntBlock needs to split to multiple dtypes (not hard, just didn't do it yet).

I turned it off because I had a few failing tests - basically the 'user' is expected always to convert to float64. It is important to try to make the dtype back to int, where possible?

@stephenwlin
Copy link
Contributor

i'm too new to this to be able to be an authority of how things should be, so don't take my opinion too seriously...but i'm personally inclined to think that try_cast should be off by default, because having it on means that the dtypes of the result depends on what values happen to match a boolean condition, which is a bit odd: it makes more sense to me that the type of the result of an operation should only depend on the types of its inputs, not the types and their particular values within those types.

i know this rule doesn't hold true for a lot of pandas behavior right now though, so maybe my concern isn't really apropos. (it probably also betrays my biases coming from statically typed languages)

@cpcloud
Copy link
Member

cpcloud commented Feb 4, 2013

Personally, I've found that this doesn't matter for me, but it seems like it makes sense to keep the dtype from boolean indexing if possible.

@stephenwlin
Copy link
Contributor

@jreback: By the way, you mentioned the case of an IntBlock needing to be split into separate blocks based on casting post-pass. How are you thinking about doing this? If you implement it such that each block separately decides how to split itself into multiple blocks in the casting post-pass and does so independently, the I could imagine you might end up doing three copying passes on the data: once for the where op, another to split the resulting multi-column blocks into separate smaller blocks, and a final one to consolidate (in case you end up splitting an int into an int and a float, where there was already another float column)

Instead of that, you could implement where at the BlockManager level and having it do column-by-column upcasting (as necessary), masking, and consolidation together with the minimal amount of copying. To precompute the new blocks you need, you can implement a function on each block which returns the dtype necessary (or maybe the necessary Block subclass, not entirely sure the best semantics) to hold its existing type and the other value. Then, you can allocate one uninitialized array for each new type necessary (consolidating ahead of time) and putmask into each one from self/other appropriately (optimizing for the case where it's an inplace operation and the result block is the same size and location as the original block, in which case you can reuse the existing array).

That might seem like overkill, but reducing the amount of copying would make a difference in the case of a large amount of data, so I'm willing to work on that if you think it's a good idea, unless you were planning on doing something similar yourself already.

@jreback
Copy link
Contributor Author

jreback commented Feb 4, 2013

I don't believe there are any copies in putmask, except with a int to float conversion which does an astype (which copies). you can determine before u do this whether it will create multiple int and /or float blocks (this is in the block level putmask btw), so I think will still just be 1 copy. where has at least 1, and an int to float conversion could add a copy. consolidation could add a copy as well (it's a vstack which I think copies)

in the latest commit what I did was let these routines possibly return more than 1 block, so code is easy.
since the amount of copying is somewhat dependent on the other blocks and/or conversions I am not sure how much extra effort we should do here.

but u r right about doing things at the block manager level; u do have more information and so can create blocks that are already consilidated - I set it up with all of the key methods doing an 'apply' on their blocks and producing new ones. definitely could be optimized

certainly open to having u take a crack at it

this pr is pretty much done

@stephenwlin
Copy link
Contributor

yup, no copies in putmask, right. but definitely one in where, definitely one in astype when casting between different types, and most likely in vstack (it might be optimized to avoid it in the case that two arrays happen to already be aligned in memory appropriately, but i doubt it would be the case here)

anyway, ok, i'll work on it and based off your branch, unless you're planning on doing a big commit still.

@stephenwlin
Copy link
Contributor

(actually, just tested it, apparently vstack copies even if given inputs that were split from the same original array using vsplit, so definitely a copy here)

@jreback
Copy link
Contributor Author

jreback commented Feb 4, 2013

great!

btw in theory u can pass copy = False to astype and its creates a view with the new dtype (of course if u then putmask it will copy the underlying data)

and the approach of trying to create an already consolidated block should prob work well

@jreback
Copy link
Contributor Author

jreback commented Feb 4, 2013

u could start by creating a vbench (there might be some already for blocking, not sure)
and then see if u can improve it

@stephenwlin
Copy link
Contributor

yep, will do

@jreback
Copy link
Contributor Author

jreback commented Feb 4, 2013

@stephenwlin

while you are at it, I am pretty sure this is related (and might now be fixed because of the putmask changes....)

#2746

@stephenwlin
Copy link
Contributor

will do

@stephenwlin
Copy link
Contributor

@jreback, I saw what you did with where, putmask, etc. (with the apply). Very nice and clean!

I think it might be overkill to try to make it tricky and do everything in one step, unless it can be done generically somehow: i'll think about it....in the meantime I optimized consolidation to remove an intermediate copy step (#2819) so that'll help with all these operations

@jreback
Copy link
Contributor Author

jreback commented Feb 8, 2013

yep

I saw your PR - looks pretty awesome

I put a note on the PR - I think I did an update or 2 today - looks like u might be using a slightly older dtypes branch;
I added int16/int8 support, but should be pretty trivial for u to merge and update

and this should fix the up casting issue!

as a side note - I am pretty sure that say u need to upcast an int, so u convert to float (well np.float_)
I think that if its a smaller type float then u can upcast to a smaller float
eg int16 -> float32

any idea?

@stephenwlin
Copy link
Contributor

yeah, there's a whole chain of logic in common._maybe_promote which takes the existing dtype, a fill_value, and returns the required output dtype; i can make it even more granular if necessary...

right now there's no special handling of different sizes, except for integers: if an integer is given as fill_value, then it will check to see if that fill_value overflows the existing dtype; if it does, then it goes with the fill_value's dtype--however, it's possible that it might upcast from, say, int8 to int64, even if the actual fill_value can fit in, say, int16 (since it checks only the dtype, not the actual value at this point)

i could probably do similar logic for the other types instead of defaulting to float64/complex128 always, but it might be overkill.

anyway, i will rebase/resolve to your latest dtypes

@jreback
Copy link
Contributor Author

jreback commented Feb 8, 2013

_maybe_promote is awesome - was dreading if in needs to actually write it!
thanks!

FYI - u have a print statement if up casting int to bigger int

@stephenwlin
Copy link
Contributor

oops, thanks, took out the print statement. also improved the test coverage a bit:

        _test_dtype(np.int8, np.int16(127), np.int8)
        _test_dtype(np.int8, np.int16(128), np.int16)

arguments are (input dtype, fill_value, output dtype)

@jreback
Copy link
Contributor Author

jreback commented Feb 9, 2013

So I finally realized the criteria for when things get upcasted.
since df[df > 0] == df.where(df>0,np.nan), there is a a float fill value
see below - thus the upcast CAN be avoided by specifying a fill value

The question is, say NO values are changed in the boolean selection, then the fill value dtype
doesn't matter, and the output dtype should be the same as the input; but a-priori we don't know this.

is it worth trying to fix this case?
its a little bit non-trivial, because it could be that an item in say an integer block can be preserverd, but not the whole block, so I have to have an item-by-item in where.....not a big deal....but is it worth it?

In [14]: df.where(df>=0).dtypes
Out[14]: 
float32    float32
float64    float64
int32      float64
int64      float64
Dtype: object

The normal cases

In [4]: df = pd.DataFrame(dict([ (c,pd.Series([1]*3,dtype=c)) for c in ['int64','int32','float32','float64'] ]))

In [5]: df
Out[5]: 
   float32  float64  int32  int64
0        1        1      1      1
1        1        1      1      1
2        1        1      1      1

In [6]: df.dtypes
Out[6]: 
float32    float32
float64    float64
int32        int32
int64        int64
Dtype: object

In [7]: df.ix[1,:] = 0

In [8]: df
Out[8]: 
   float32  float64  int32  int64
0        1        1      1      1
1        0        0      0      0
2        1        1      1      1

In [9]: df.where(df>0,0)
Out[9]: 
   float32  float64  int32  int64
0        1        1      1      1
1        0        0      0      0
2        1        1      1      1

# this preserves dtypes as we specified a 'fill' value that is the correct type
In [10]: df.where(df>0,0).dtypes
Out[10]: 
float32    float32
float64    float64
int32        int32
int64        int64
Dtype: object

# we implicity specified a fill value here that is float, so automatically upcast
In [11]: df.where(df>0)
Out[11]: 
   float32  float64  int32  int64
0        1        1      1      1
1      NaN      NaN    NaN    NaN
2        1        1      1      1

In [12]: df.where(df>0).dtypes
Out[12]: 
float32    float32
float64    float64
int32      float64
int64      float64
Dtype: object

jreback added a commit that referenced this issue Feb 15, 2013
ENH: implement Block splitting to avoid upcasts where possible (GH #2794)
@jreback
Copy link
Contributor Author

jreback commented Feb 15, 2013

closed via #2871

@jreback jreback closed this as completed Feb 15, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants