-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: should boolean indexing preserve input dtypes where possible? #2794
Comments
I get dtype conversion on DataFrame construction with dicts; what version of pandas are you using? |
This is a development version (either 0.10.2 or 0.11-dev). The above code won't work on 0.10.1 (well it 'works', but it converts dtypes). Dtypes are preserved in some limited cases in 0.10.1. This is what #2708 is all about. I am asking this: if you input an integer dtype and you perform an operation that that results in an integerlike number, but This is not currently the case in 0.10.1 or lower |
@jreback -- i'm a bit confused, because i've checked out your dtypes branch and see that you have implemented a parameter try_cast to DataFrame.where that seems to do what you're talking about. Are you just asking whether it should be turned on by default from getitem/setitem? |
thats exactly what I am asking. Its actually not fully implemented, because it can happen that an IntBlock needs to split to multiple dtypes (not hard, just didn't do it yet). I turned it off because I had a few failing tests - basically the 'user' is expected always to convert to float64. It is important to try to make the dtype back to int, where possible? |
i'm too new to this to be able to be an authority of how things should be, so don't take my opinion too seriously...but i'm personally inclined to think that try_cast should be off by default, because having it on means that the dtypes of the result depends on what values happen to match a boolean condition, which is a bit odd: it makes more sense to me that the type of the result of an operation should only depend on the types of its inputs, not the types and their particular values within those types. i know this rule doesn't hold true for a lot of pandas behavior right now though, so maybe my concern isn't really apropos. (it probably also betrays my biases coming from statically typed languages) |
Personally, I've found that this doesn't matter for me, but it seems like it makes sense to keep the dtype from boolean indexing if possible. |
@jreback: By the way, you mentioned the case of an Instead of that, you could implement That might seem like overkill, but reducing the amount of copying would make a difference in the case of a large amount of data, so I'm willing to work on that if you think it's a good idea, unless you were planning on doing something similar yourself already. |
I don't believe there are any copies in putmask, except with a int to float conversion which does an astype (which copies). you can determine before u do this whether it will create multiple int and /or float blocks (this is in the block level putmask btw), so I think will still just be 1 copy. where has at least 1, and an int to float conversion could add a copy. consolidation could add a copy as well (it's a vstack which I think copies) in the latest commit what I did was let these routines possibly return more than 1 block, so code is easy. but u r right about doing things at the block manager level; u do have more information and so can create blocks that are already consilidated - I set it up with all of the key methods doing an 'apply' on their blocks and producing new ones. definitely could be optimized certainly open to having u take a crack at it this pr is pretty much done |
yup, no copies in anyway, ok, i'll work on it and based off your branch, unless you're planning on doing a big commit still. |
(actually, just tested it, apparently |
great! btw in theory u can pass copy = False to astype and its creates a view with the new dtype (of course if u then putmask it will copy the underlying data) and the approach of trying to create an already consolidated block should prob work well |
u could start by creating a vbench (there might be some already for blocking, not sure) |
yep, will do |
while you are at it, I am pretty sure this is related (and might now be fixed because of the putmask changes....) |
will do |
@jreback, I saw what you did with I think it might be overkill to try to make it tricky and do everything in one step, unless it can be done generically somehow: i'll think about it....in the meantime I optimized consolidation to remove an intermediate copy step (#2819) so that'll help with all these operations |
yep I saw your PR - looks pretty awesome I put a note on the PR - I think I did an update or 2 today - looks like u might be using a slightly older dtypes branch; and this should fix the up casting issue! as a side note - I am pretty sure that say u need to upcast an int, so u convert to float (well np.float_) any idea? |
yeah, there's a whole chain of logic in right now there's no special handling of different sizes, except for integers: if an integer is given as i could probably do similar logic for the other types instead of defaulting to anyway, i will rebase/resolve to your latest dtypes |
_maybe_promote is awesome - was dreading if in needs to actually write it! FYI - u have a print statement if up casting int to bigger int |
oops, thanks, took out the print statement. also improved the test coverage a bit: _test_dtype(np.int8, np.int16(127), np.int8)
_test_dtype(np.int8, np.int16(128), np.int16) arguments are (input dtype, fill_value, output dtype) |
So I finally realized the criteria for when things get upcasted. The question is, say NO values are changed in the boolean selection, then the fill value dtype is it worth trying to fix this case?
The normal cases
|
ENH: implement Block splitting to avoid upcasts where possible (GH #2794)
closed via #2871 |
Should pandas preserve the input dtype on when doing
boolean indexing, if possible?
Its a pretty limited case, in that you have to have all values in a column non-null
and they all have to be integers (just casted as floats)
this is actually a little tricky to implement as this has to be done column by column (and its possibly that blocks have to be split)
the question here is will this be not what the user expects? (currently all dtypes
on boolean output operations are cast to float64/object)
The text was updated successfully, but these errors were encountered: