You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a dataset that is inherently hierarchical (from some experiments). For each large category 1, there are several category 2 entries, and in turn for each cat2 entry there are several cat3 entries. For each (cat1, cat2, cat3) tuple I have a measurement value, val below.
The analysis I am doing is per category 1. I first rank all values within cat1, and then I use those ranks to calculate a statistic for each cat2. The code below exemplifies, what I am doing.
When I group by cat1 and apply a function for the calculations, I get an error when I assign to the group. This can be reproduced with the below code.
The error goes away when a copy of the sub-DataFrame is made (in comment below in function f) in the applied function.
I am prepared to think that this should indeed raise an error, as it doesn't strike me as unreasonable that altering the groups over which one is iterating can be bad idea. However, if this is indeed expected behavior, then it's not mentioned in the documentation (as far as I can tell).
def f(x):
# no error if copy is used:
# x = x.copy()
# the following assignment causes an error
x['rank'] = x.val.rank(method='min')
return x.groupby('cat2')['rank'].min()
grpby = mydf.groupby('cat1').apply(f)
The text was updated successfully, but these errors were encountered:
groups are passed slices (not copies) for perf reasons. A test is done on the first group to see if mutation occurs, that was not catching column mutation (it catches index mutation). Thanks for the catch, fixed in #3384
I have a dataset that is inherently hierarchical (from some experiments). For each large category 1, there are several category 2 entries, and in turn for each cat2 entry there are several cat3 entries. For each (cat1, cat2, cat3) tuple I have a measurement value, val below.
The analysis I am doing is per category 1. I first rank all values within cat1, and then I use those ranks to calculate a statistic for each cat2. The code below exemplifies, what I am doing.
When I group by cat1 and apply a function for the calculations, I get an error when I assign to the group. This can be reproduced with the below code.
The error goes away when a copy of the sub-DataFrame is made (in comment below in function f) in the applied function.
I am prepared to think that this should indeed raise an error, as it doesn't strike me as unreasonable that altering the groups over which one is iterating can be bad idea. However, if this is indeed expected behavior, then it's not mentioned in the documentation (as far as I can tell).
Looking forward to any comments on this.
example of assignment in groupby
mydf = pd.DataFrame({
'cat1' : ['a'] * 8 + ['b'] * 6,
'cat2' : ['c'] * 2 + ['d'] * 2 + ['e'] * 2 + ['f'] * 2 + ['c'] * 2 + ['d'] * 2 + ['e'] * 2,
'cat3' : map(lambda x: 'g{}'.format(x), range(1,15)),
'val' : np.random.randint(100, size=14),
})
def f(x):
# no error if copy is used:
# x = x.copy()
# the following assignment causes an error
x['rank'] = x.val.rank(method='min')
return x.groupby('cat2')['rank'].min()
grpby = mydf.groupby('cat1').apply(f)
The text was updated successfully, but these errors were encountered: