Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pure Python GroupBy bug #618

Closed
yarikoptic opened this issue Jan 12, 2012 · 6 comments
Closed

Pure Python GroupBy bug #618

yarikoptic opened this issue Jan 12, 2012 · 6 comments
Labels
Milestone

Comments

@yarikoptic
Copy link
Contributor

I have tried to find related issue but failed... so pardon me if it is a duplicate:

ATM if groupping doesn't result in actually all possible combinations, the pandas spits out non-informative

/home/yoh/deb/gits/pkg-exppsy/pandas/pandas/core/groupby.py in _aggregate_series_pure_python(self, obj, func, ngroups)
    431                     raise ValueError('function does not reduce')
    432 
--> 433             counts[label] = group.shape[0]
    434             result[label] = res
    435 

IndexError: index out of bounds
> /home/yoh/deb/gits/pkg-exppsy/pandas/pandas/core/groupby.py(433)_aggregate_series_pure_python()
    432 
--> 433             counts[label] = group.shape[0]
    434             result[label] = res

imho there could be an option to still handle those but place NaNs for those entries, OR at least spit out an informative exception something like "combination f1='x', f2='y' doesn't have data entries in the original data, or smth like that

@wesm
Copy link
Member

wesm commented Jan 12, 2012

could you give me a self-contained test case?

related to #443

@yarikoptic
Copy link
Contributor Author

pity, but I fail to come up with a minimal example -- indeed it just inserts NAs for those, so may be it is a different scenario... I will keep it in mind - may be I would come up with one eventually ;)

@yarikoptic
Copy link
Contributor Author

ok -- here is a non-minimalistic example. seems to boil down to me somewhat abusing index (I have 'subject' column which is also used as a part of MultiIndex for rows). But here is a sample data (just gunzip it):
http://www.onerussian.com/tmp/data4wes.hdf5.gz and this is a snippet to demonstrate the problem:

from pandas import *
store_ = HDFStore('/tmp/data4wes.hdf5')
pivot_table(store_['d'], 'RT', rows=['subject'], cols=['condition', 'pgender', 'gaze'], margins=True)

@wesm
Copy link
Member

wesm commented Jan 17, 2012

@yarikoptic there are actually a couple of bugs here. Note this works fine:


In [8]: d = store['d']

In [9]: d.groupby(['condition', 'pgender', 'gaze', 'subject'])['RT'].mean()
Out[9]: 
condition   pgender  gaze  subject  
full_ag     f        a     01jul10sc    2.312
full_ag     m        a     01jul10sc    2.507
full_dg     f        d     01jul10sc    1.905
full_dg     m        d     01jul10sc    2.137
profile_ag  f        a     01jul10sc    1.698
profile_ag  m        a     01jul10sc    2.408
profile_dg  f        d     01jul10sc    2.481
profile_dg  m        d     01jul10sc    2.912

but this does not:


In [10]: d.groupby(['condition', 'pgender', 'gaze', 'subject'])['RT'].agg(np.mean)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/home/wesm/Downloads/<ipython-input-10-976e1dfbc45e> in <module>()
----> 1 d.groupby(['condition', 'pgender', 'gaze', 'subject'])['RT'].agg(np.mean)

/home/wesm/code/pandas/pandas/core/groupby.pyc in agg(self, func, *args, **kwargs)
    283         See docstring for aggregate
    284         """
--> 285         return self.aggregate(func, *args, **kwargs)
    286 
    287     def _iterate_slices(self):

/home/wesm/code/pandas/pandas/core/groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs)
    779         else:
    780             if len(self.groupings) > 1:
--> 781                 return self._python_agg_general(func_or_funcs, *args, **kwargs)
    782 
    783             try:

/home/wesm/code/pandas/pandas/core/groupby.pyc in _python_agg_general(self, func, *args, **kwargs)
    394             try:
    395                 result, counts = self._aggregate_series(obj, agg_func,
--> 396                                                         comp_ids, max_group)
    397                 output[name] = result                                                                                                                                                                      
    398             except TypeError:

/home/wesm/code/pandas/pandas/core/groupby.pyc in _aggregate_series(self, obj, func, group_index, ngroups)
    412             return _aggregate_series_fast(obj, func, group_index, ngroups)
    413         except Exception:
--> 414             return self._aggregate_series_pure_python(obj, func, ngroups)
    415 
    416     def _aggregate_series_pure_python(self, obj, func, ngroups):

/home/wesm/code/pandas/pandas/core/groupby.pyc in _aggregate_series_pure_python(self, obj, func, ngroups)
    432                     raise ValueError('function does not reduce')
    433 
--> 434             counts[label] = group.shape[0]
    435             result[label] = res
    436 

IndexError: index out of bounds

thanks for reproducing! This is a blocker for 0.7.0 so I will fix asap...

@yarikoptic
Copy link
Contributor Author

@yarikoptic there are actually a couple of bugs here. Note this works fine:
...
thanks for reproducing!

Glad to be of "help" ;-)

This is a blocker for 0.7.0 so I will fix asap...

Cool -- thanks in advance

=------------------------------------------------------------------=
Keep in touch www.onerussian.com
Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic

@wesm
Copy link
Member

wesm commented Jan 17, 2012

Alright, this is all set and fixed in master

@wesm wesm closed this as completed Jan 17, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants