-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iloc with boolean mask #3631
Comments
normally the generated masks have the same index as to what you are doing, e.g in your example in the SO question. I think |
Here's an example to summarise the current behaviours: locs = np.arange(4)
nums = 2**locs
reps = map(bin, nums)
df = pd.DataFrame({'locs':locs, 'nums':nums}, reps)
print df
for idx in [None, 'index', 'locs']:
mask = (df.nums>2).values
if idx:
mask = pd.Series(mask, list(reversed(getattr(df, idx))))
for method in ['', '.loc', '.iloc']:
try:
if method:
accessor = getattr(df, method[1:])
else:
accessor = df
ans = bin(accessor[mask]['nums'].sum())
except Exception, e:
ans = str(e)
print "{:>5s}: df{}[mask].sum()=={}".format(idx, method, ans) with output locs nums
0b1 0 1
0b10 1 2
0b100 2 4
0b1000 3 8
None: df[mask].sum()==0b1100
None: df.loc[mask].sum()==0b1100
None: df.iloc[mask].sum()==0b1100
index: df[mask].sum()==0b11
index: df.loc[mask].sum()==0b11
index: df.iloc[mask].sum()==0b11
locs: df[mask].sum()==Unalignable boolean Series key provided
locs: df.loc[mask].sum()==Unalignable boolean Series key provided
locs: df.iloc[mask].sum()==Unalignable boolean Series key provided If I'm understanding your discussion correctly then in the last line the output should also be 0b11. |
yes...that looks right, if you happen to think of a simple example where you would actually use this pls post that keep in mind that masks are the same length as what you are indexing, so I don't think there is ever ambiguity, but I could be wrong |
Fair enough, I can't actually think of anything. It just seems that by symmetry the behaviour should be there. Similarly, it doesn't seem quite right that in line 6 of my example above, Given these observations, I would probably vote for the following behaviour:
It's true that I haven't thought about this for more than 10 minutes though so if there are many use cases where this causes a problem then nevermind. |
I think it's possible there could be an ambiguity, if the index is in a different order (e.g. was taken from somewhere else where it may well mean the location rather than the label). Also, I'm given a Correct me if I wrong, but isn't "realign based on mask.index" different from location? |
Still thinking about this, but I think a nice feature of boolean masking is that since we don't ever have an ambiguity whether its label or position based (as the mask must be the same length as what you are indexing), then it can be used in either The basic question is do we drop the index effectively (and make it not matter) when its the right length? e.g. should
or the first align (currently I retract my earlier statement, I don't think there is an alignment on the index itself) but this may depend, e.g. |
@hayd you make a good point, but to my knowledge that is the issue with an unlabeled index, the user has to make sure its in the right order, pandas cannot help (but the case we are talking about it could help, by making sure the index is aligned) |
@jreback With regards to the alignment, see my example above. I threw in the reversed(...) to see whether it realigns based on the labels or not. The results above show that [], .loc and .iloc all perform an alignment step based on the labels when these are in a different order. Therefore I think the fact that the length is the same is lulling you into a false sense of security. Apologies if my example isn't clear. I thought the binary representation was a nice way of concisely summarising which items were selected or not. The bottom line is that the results differ between ndarray and pd.Series because in the pd.Series case the .index is used to do an alignment step first. Also there's mistake in my output because it should read:
|
@snth you are right....I was eye-balling the code....as its a failry tricky path ok....so bottom line is wether to make So from the original SO question, this should then raise?
What about this?
I there is NO alignment happening, instead its using the values to actually index |
I think iloc should throw an error if it's not integer based, it should definitely use position. These integers needn't be in order, just like the labels needn't be in order. Putting an array to
? |
@jreback I agree that in your first example by my reasoning In[4] should raise an Exception rather than realign based on labels within in .iloc. I don't understand your second example. There's no boolean indexing involved there and they just seem to be examples of .iloc. Is that what's happening in the source code? |
@snth I was trying to have it index with a Series that had a different index, (that is essentially what boolean masking does) |
Ok, if you had the one above, there is an easy workaround (with arrays):
But if your index was not in order:
? |
@hayd In terms of workarounds you could probably do
|
Nothing else broke...I'll put up the PR in a minute....pls try out
|
@hayd I think if the user tries what you suggest (with an index not in order), they should be shot :) |
ha! Well, I'm sulking! Ignoring the index just seems a little dodge... I think we should either:
Surely Otherwise it's just salt for *I not convinced the (granted, obtuse) example is impossible to envisage. |
@hayd does the PR not do the first? (e.g. raises) |
No... at least I didn't think so, (from your pr):
At the moment ? hides |
hmm...I see what you mean, so essentially eliminate boolean masking from |
Yeah... perhaps an optimistic |
hahha...ok will fix |
@snth output from your script with my PR
|
I think the notes look clear and understandable. I'm happy. :) |
any other cases you think? |
will merge maybe in a day or 2....if anyone thinks of anything |
@y-p any thoughts? |
Some twists and turns in the discussion, not sure I got it all. Here's my take on the discussion above, does it match the PR?
|
@y-p I think that's an excellent summary. The only thing atm (in Jeff's current PR) it raises a |
So I disambuigated these 2 cases, which I think corresponds to @y-p (and @hayd) points
|
@jreback Yes, that is what we should be doing! :) |
sad face |
Sad because this PR tool this away?
|
I totally did not understand that that worked in your pr. I thought it gave a not implemented error! (and you'd have to change |
Those were the cases we separated
|
Sorry, this made little sense without the context, which you found: http://stackoverflow.com/questions/16729574/how-to-get-a-value-from-a-cell-of-a-data-frame So you can use integer mask with |
I am not sure I understand your last? |
in the above I'm using position in I realise that it is obviously the intent of the user to mask like that (since the df doesn't have integer index) but semantically they are using |
When you say above, you mean your example whre you do
I don't see where there is possbile confusion?
can you give an example of where you think there is a problem? |
I take it all back. I was sure when I did this before I needed to set the index (because somewhere along the line it had gone). Now I see |
its nice having a skeptical eye! thanks |
Currently masking by boolean vectors it doesn't matter which syntax you use:
are all equivalent. Should mask
df.iloc[mask]
mask by position? (this makes sense if mask is integer index).This SO question.
The text was updated successfully, but these errors were encountered: