iloc with boolean mask #3631

hayd · 2013-05-17T10:20:57Z

Currently masking by boolean vectors it doesn't matter which syntax you use:

df[mask]
df.iloc[mask]
df.loc[mask]

are all equivalent. Should mask df.iloc[mask] mask by position? (this makes sense if mask is integer index).

This SO question.

The text was updated successfully, but these errors were encountered:

jreback · 2013-05-17T11:55:24Z

normally the generated masks have the same index as to what you are doing, e.g in your example in the SO question.

I think .iloc could/should do this, does make sense. There is an alignment step for the mask in .where

snth · 2013-05-17T13:07:12Z

Here's an example to summarise the current behaviours:

locs = np.arange(4)
nums = 2**locs
reps = map(bin, nums)
df = pd.DataFrame({'locs':locs, 'nums':nums}, reps)
print df
for idx in [None, 'index', 'locs']:
    mask = (df.nums>2).values
    if idx:
        mask = pd.Series(mask, list(reversed(getattr(df, idx))))
    for method in ['', '.loc', '.iloc']:
        try:
            if method:
                accessor = getattr(df, method[1:])
            else:
                accessor = df
            ans = bin(accessor[mask]['nums'].sum())
        except Exception, e:
            ans = str(e)
        print "{:>5s}: df{}[mask].sum()=={}".format(idx, method, ans)

with output

        locs  nums
0b1        0     1
0b10       1     2
0b100      2     4
0b1000     3     8

 None: df[mask].sum()==0b1100
 None: df.loc[mask].sum()==0b1100
 None: df.iloc[mask].sum()==0b1100
index: df[mask].sum()==0b11
index: df.loc[mask].sum()==0b11
index: df.iloc[mask].sum()==0b11
 locs: df[mask].sum()==Unalignable boolean Series key provided
 locs: df.loc[mask].sum()==Unalignable boolean Series key provided
 locs: df.iloc[mask].sum()==Unalignable boolean Series key provided

If I'm understanding your discussion correctly then in the last line the output should also be 0b11.

jreback · 2013-05-17T13:14:47Z

yes...that looks right, if you happen to think of a simple example where you would actually use this pls post that

keep in mind that masks are the same length as what you are indexing, so I don't think there is ever ambiguity, but I could be wrong

snth · 2013-05-17T13:33:28Z

Fair enough, I can't actually think of anything. It just seems that by symmetry the behaviour should be there.

Similarly, it doesn't seem quite right that in line 6 of my example above, df.iloc[mask] is actually realigning the mask based on mask.index rather than throwing an error.

Given these observations, I would probably vote for the following behaviour:

df[mask] should only work with .values in the order they're given and do no realignment. Perhaps for performance reasons.
df.loc[mask] should work as it currently does and realign based on mask.index if present. If mask.index is of the wrong type it should throw an Error.
df.iloc[mask] should realign based on mask.index if present and of integer type. Otherwise throw an error.

It's true that I haven't thought about this for more than 10 minutes though so if there are many use cases where this causes a problem then nevermind.

hayd · 2013-05-17T13:37:08Z

I think it's possible there could be an ambiguity, if the index is in a different order (e.g. was taken from somewhere else where it may well mean the location rather than the label).

Also, I'm given a /core/frame.py:1943: UserWarning: Boolean Series key will be reindexed to match DataFrame index. somewhere along the way messing with integer index booleans.

Correct me if I wrong, but isn't "realign based on mask.index" different from location?

jreback · 2013-05-17T14:06:38Z

Still thinking about this, but I think a nice feature of boolean masking is that since we don't ever have an ambiguity whether its label or position based (as the mask must be the same length as what you are indexing), then it can be used in either .loc or .iloc. I don't think restricting this is a good idea.

The basic question is do we drop the index effectively (and make it not matter) when its the right length?

e.g. should

df[mask_with_index] and df[mask_ndarray] be the same?

or the first align (currently I retract my earlier statement, I don't think there is an alignment on the index itself)

but this may depend, e.g. iloc/loc don't align, but I think [] might....

jreback · 2013-05-17T14:08:17Z

@hayd you make a good point, but to my knowledge that is the issue with an unlabeled index, the user has to make sure its in the right order, pandas cannot help (but the case we are talking about it could help, by making sure the index is aligned)

snth · 2013-05-17T14:13:19Z

@jreback With regards to the alignment, see my example above. I threw in the reversed(...) to see whether it realigns based on the labels or not. The results above show that [], .loc and .iloc all perform an alignment step based on the labels when these are in a different order. Therefore I think the fact that the length is the same is lulling you into a false sense of security.

Apologies if my example isn't clear. I thought the binary representation was a nice way of concisely summarising which items were selected or not. The bottom line is that the results differ between ndarray and pd.Series because in the pd.Series case the .index is used to do an alignment step first.

Also there's mistake in my output because it should read:

df[mask]['nums'].sum()==...

jreback · 2013-05-17T14:24:34Z

@snth you are right....I was eye-balling the code....as its a failry tricky path

ok....so bottom line is wether to make iloc align on the labels or position (for a boolean mask), but that itself is somewhat of an issue, the is certainly probably and likely that the index is not integer-based, so then iloc should throw an error

So from the original SO question, this should then raise?

In [1]: df = pd.DataFrame(range(5), list('ABCDE'), columns=['a'])

In [2]: mask = (df.a%2 == 0)

In [3]: mask
Out[3]: 
A     True
B    False
C     True
D    False
E     True
Name: a, dtype: bool

In [4]: df.iloc[mask]
Out[4]: 
   a
A  0
C  2
E  4

What about this?

In [5]: mask.nonzero()
Out[5]: (array([0, 2, 4]),)

In [6]: mask.nonzero()[0]
Out[6]: array([0, 2, 4])

In [7]: df.iloc[mask.nonzero()[0]]
Out[7]: 
   a
A  0
C  2
E  4

In [8]: df.iloc[Series(mask.nonzero()[0])]
Out[8]: 
   a
A  0
C  2
E  4

In [9]: Series(mask.nonzero()[0])
Out[9]: 
0    0
1    2
2    4
dtype: int64

I there is NO alignment happening, instead its using the values to actually index

hayd · 2013-05-17T14:34:21Z

I think iloc should throw an error if it's not integer based, it should definitely use position.

These integers needn't be in order, just like the labels needn't be in order.

Putting an array to iloc should do what it does now, but a boolean Series is a different beast:

msk = pd.Series([True, True, True, False, False])
df.iloc[msk] == df.iloc[0:3]

?

snth · 2013-05-17T14:34:27Z

@jreback I agree that in your first example by my reasoning In[4] should raise an Exception rather than realign based on labels within in .iloc.

I don't understand your second example. There's no boolean indexing involved there and they just seem to be examples of .iloc. Is that what's happening in the source code?

jreback · 2013-05-17T14:39:12Z

@snth I was trying to have it index with a Series that had a different index, (that is essentially what boolean masking does)

hayd · 2013-05-17T14:39:44Z

Ok, if you had the one above, there is an easy workaround (with arrays):

msk = pd.Series([True, True, True, False, False])
df.iloc[msk.values]  # instead of df.iloc[msk]

But if your index was not in order:

msk = pd.Series([True, True, True, False, False], index=[1, 2, 3, 4, 0])
df.iloc[msk] == df[1:4]

?

snth · 2013-05-17T14:47:46Z

@hayd In terms of workarounds you could probably do

df.iloc[msk.index][msk.values]

jreback · 2013-05-17T14:57:12Z

Nothing else broke...I'll put up the PR in a minute....pls try out

In [3]: df = DataFrame(range(5), list('ABCDE'), columns=['a'])

In [4]: mask = (df.a%2 == 0)

In [5]: df.iloc[mask]
ValueError: iLocation based boolean indexing can only have an integer index [string] was inferred

jreback · 2013-05-17T15:01:11Z

@hayd I think if the user tries what you suggest (with an index not in order), they should be shot :)

hayd · 2013-05-17T16:16:48Z

ha! Well, I'm sulking!

Ignoring the index just seems a little dodge... I think we should either:

raise whatever if you pass in a (boolean) Series to iloc (maybe with a NotImplementedError :p).
implement analgous behaviour to []/loc *

Surely iloc (for boolean masking) is only for integer location, so if mask.index isn't integer they are calling the wrong thing... and we shouldn't let them.

Otherwise it's just salt for df.iloc[msk.values], and if it is, let people do that.

*I not convinced the (granted, obtuse) example is impossible to envisage.

jreback · 2013-05-17T16:46:57Z

@hayd does the PR not do the first? (e.g. raises)

hayd · 2013-05-17T17:25:38Z

No... at least I didn't think so, (from your pr):

df = pd.DataFrame(range(5), columns=['a'], index=range(4, -1, -1))
msk = pd.Series([True, True, False, False, False], index=[1,2,3,4,0])

In [3]: df[msk]
pandas/core/frame.py:1999: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
Out[3]:
   a
2  2
1  3

In [4]: df.iloc[msk]  # should raise
Out[4]:
   a
2  2
1  3

In [5]: df.iloc[1:3]  # or should be this
Out[5]:
   a
3  1
2  2

At the moment iloc isn't getting by location, but rather by label (and that's true in the other examples, just it was getting the same results).. I think it should just be disabled for boolean indexing (at least for now).

?

hides

jreback · 2013-05-17T17:59:32Z

hmm...I see what you mean, so essentially eliminate boolean masking from iloc entirely?

hayd · 2013-05-17T19:06:33Z

Yeah... perhaps an optimistic NotImplementedError. :p

jreback · 2013-05-17T19:29:51Z

hahha...ok will fix

jreback · 2013-05-17T20:26:52Z

@snth output from your script with my PR

        locs  nums
0b1        0     1
0b10       1     2
0b100      2     4
0b1000     3     8
 None: df[mask].sum()==0b1100
 None: df.loc[mask].sum()==0b1100
 None: df.iloc[mask].sum()==0b1100
/mnt/home/jreback/pandas/pandas/core/frame.py:2001: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
index: df[mask].sum()==0b11
index: df.loc[mask].sum()==0b11
index: df.iloc[mask].sum()==iLocation based boolean indexing cannot use an indexable as a mask
 locs: df[mask].sum()==Unalignable boolean Series key provided
 locs: df.loc[mask].sum()==Unalignable boolean Series key provided
 locs: df.iloc[mask].sum()==iLocation based boolean indexing cannot use an indexable as a mask

jreback · 2013-05-17T21:05:02Z

@hayd @snth I update the release notes / v0.11.1 whatsnew. pls take a look and see if they tell what has changed are not too confusing...thxs

hayd · 2013-05-17T21:15:12Z

I think the notes look clear and understandable. I'm happy. :)

jreback · 2013-05-17T21:18:47Z

any other cases you think?

jreback · 2013-05-17T21:19:13Z

will merge maybe in a day or 2....if anyone thinks of anything

jreback · 2013-05-17T21:19:45Z

@y-p any thoughts?

ghost · 2013-05-18T13:21:41Z

Some twists and turns in the discussion, not sure I got it all.

Here's my take on the discussion above, does it match the PR?

When given a labeled indexer, pandas implicitly aligns. That's the rule.
If the user wants the indexing to behave as if he passed in an array, he should pass in
an array.
Then, wrt what should we align a bool indexer passed in to iloc?
- Since .loc can always be used, .iloc obviously shouldn't duplicate the behaviour by aligning wrt to the underlying frame labels. Hence, should raise when passed a labeled indexer with non-integer index.
- An indexer with integer labels given to iloc should be realigned (re 1.), since we've established it shouldn't do that wrt to labels, the only alternatives is wrt position.
- Since that doesn't ring like a very common use case, raising NotImplementedError is perfectly fine until there's a demand for it.

hayd · 2013-05-18T13:57:30Z

@y-p I think that's an excellent summary.

The only thing atm (in Jeff's current PR) it raises a ValueError, but maybe NotImplementedError ("one day") better describes it.

jreback · 2013-05-18T22:48:35Z

So I disambuigated these 2 cases, which I think corresponds to @y-p (and @hayd) points

In [6]: df = DataFrame(range(5), list('ABCDE'), columns=['a'])

In [7]: mask = (df.a%2 == 0)

In [8]: df
Out[8]: 
   a
A  0
B  1
C  2
D  3
E  4

In [9]: mask
Out[9]: 
A     True
B    False
C     True
D    False
E     True
Name: a, dtype: bool

In [10]: df.iloc[mask]
ValueError: iLocation based boolean indexing cannot use an indexable as a mask


In [11]: mask.index=range(len(mask))

In [12]: mask
Out[12]: 
0     True
1    False
2     True
3    False
4     True
Name: a, dtype: bool

In [13]: df.iloc[mask]
NotImplementedError: iLocation based boolean indexing on an integer type is not available

hayd · 2013-05-18T23:09:51Z

@jreback Yes, that is what we should be doing! :)

hayd · 2013-05-24T12:37:46Z

In [28]: df1
Out[28]:
              1    2    3    4
1983-02-16  512  517  510  514
1983-02-17  513  520  513  517
1983-02-18  500  500  500  500
1983-02-21  505  505  496  496

In [29]: msk = df1.apply(lambda col: df[1] != col).any(axis=1)

In [30]: df1.iloc[msk]

sad face

jreback · 2013-05-24T12:44:14Z

Sad because this PR tool this away?
I thought that was the point?

In [10]: df = DataFrame(np.random.randint(490,520,size=16).reshape(4,4),index=date_range('1983-02-16',periods=4))

In [11]: df.iloc[2] = 500

In [12]: df
Out[12]: 
              0    1    2    3
1983-02-16  495  510  500  493
1983-02-17  519  517  508  504
1983-02-18  500  500  500  500
1983-02-19  514  519  498  503

In [13]: msk = df.apply(lambda col: df[1] != col).any(axis=1)

In [14]: msk
Out[14]: 
1983-02-16     True
1983-02-17     True
1983-02-18    False
1983-02-19     True
Freq: D, dtype: bool

In [15]: df.iloc[msk]
ValueError: iLocation based boolean indexing cannot use an indexable as a mask

In [16]: df.loc[msk]
Out[16]: 
              0    1    2    3
1983-02-16  495  510  500  493
1983-02-17  519  517  508  504
1983-02-19  514  519  498  503

hayd · 2013-05-24T12:48:04Z

I totally did not understand that that worked in your pr. I thought it gave a not implemented error! (and you'd have to change msk.index). Hmmm

jreback · 2013-05-24T12:51:38Z

Those were the cases we separated
above, with labels is conceptually wrong, you cant iloc with labels,
this is ok, but we are disallowing it

In [6]: msk.index=range(4)

In [7]: msk
Out[7]: 
0     True
1     True
2    False
3     True
dtype: bool

In [8]: df.iloc[msk]
NotImplementedError: iLocation based boolean indexing on an integer type is not available

hayd · 2013-05-24T15:19:19Z

Sorry, this made little sense without the context, which you found: http://stackoverflow.com/questions/16729574/how-to-get-a-value-from-a-cell-of-a-data-frame

So you can use integer mask with loc if you index is not integer, I didn't realise that. Not sure where I am on the semantics (this stuff is confusing).

jreback · 2013-05-24T15:43:20Z

I am not sure I understand your last?

hayd · 2013-05-24T16:54:55Z

in the above I'm using position in loc. Compare to using df.loc[0] where I get a key error.

I realise that it is obviously the intent of the user to mask like that (since the df doesn't have integer index) but semantically they are using iloc, and I worry that one day if they happen to be using this mask technique with an integer indexed df with loc then they'll get unexpected results. These feels like the ix thing...

jreback · 2013-05-24T17:23:23Z

When you say above, you mean your example whre you do sad face? or this example?

In [1]: df = pd.DataFrame(range(5), columns=['a'], index=range(4, -1, -1))

In [2]: msk = pd.Series([True, True, False, False, False], index=[1,2,3,4,0])

In [3]: df[msk]
pandas/core/frame.py:2013: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
Out[3]: 
   a
2  2
1  3

In [4]: df.iloc[msk]
NotImplementedError: iLocation based boolean indexing on an integer type is not available

In [5]: df.iloc[1:3]
Out[5]: 
   a
3  1
2  2

In [6]: df.loc[msk]
Out[6]: 
   a
2  2
1  3

I don't see where there is possbile confusion? iloc will give an error if you try to index with a mask (either value or not implemented)

loc is by definition label based indexing, again you are indexing, it will be label based (and NEVER positional), that's the point (I think ix would fall back and that's of course why we created loc)

can you give an example of where you think there is a problem?

hayd · 2013-05-24T19:09:20Z

I take it all back. I was sure when I did this before I needed to set the index (because somewhere along the line it had gone). Now I see msk.index == df1.index anyway, so I was talking utter rubbish. Sorry!

jreback · 2013-05-24T19:23:27Z

its nice having a skeptical eye! thanks

jreback mentioned this issue May 17, 2013

API: Raise on iloc indexing with a non-integer based boolean mask (GH3631) #3635

Merged

jreback closed this as completed in #3635 May 19, 2013

jcjf mentioned this issue Jul 2, 2015

DOC: Distinguish between different types of boolean indexing #10492

Closed

jorisvandenbossche mentioned this issue Apr 27, 2016

ENH: Allow where/mask/Indexers to accept callable #12539

Closed

5 tasks

iloc with boolean mask #3631

iloc with boolean mask #3631

Comments

hayd commented May 17, 2013

jreback commented May 17, 2013

snth commented May 17, 2013

jreback commented May 17, 2013

snth commented May 17, 2013

hayd commented May 17, 2013

jreback commented May 17, 2013

jreback commented May 17, 2013

snth commented May 17, 2013

jreback commented May 17, 2013

hayd commented May 17, 2013

snth commented May 17, 2013

jreback commented May 17, 2013

hayd commented May 17, 2013

snth commented May 17, 2013

jreback commented May 17, 2013

jreback commented May 17, 2013

hayd commented May 17, 2013

jreback commented May 17, 2013

hayd commented May 17, 2013

jreback commented May 17, 2013

hayd commented May 17, 2013

jreback commented May 17, 2013

jreback commented May 17, 2013

jreback commented May 17, 2013

hayd commented May 17, 2013

jreback commented May 17, 2013

jreback commented May 17, 2013

jreback commented May 17, 2013

ghost commented May 18, 2013

hayd commented May 18, 2013

jreback commented May 18, 2013

hayd commented May 18, 2013

hayd commented May 24, 2013

jreback commented May 24, 2013

hayd commented May 24, 2013

jreback commented May 24, 2013

hayd commented May 24, 2013

jreback commented May 24, 2013

hayd commented May 24, 2013

jreback commented May 24, 2013

hayd commented May 24, 2013

jreback commented May 24, 2013