Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Problem with count aggregation of a boolean column #3752

Closed
cameronhatfield opened this issue Jun 4, 2013 · 4 comments
Closed

BUG: Problem with count aggregation of a boolean column #3752

cameronhatfield opened this issue Jun 4, 2013 · 4 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@cameronhatfield
Copy link

The issue I am having is that in pandas 0.10.0, a count of boolean column with a single item in it would be 1, while a count of multiple items would of course be the number of items. In 0.11.0, it is now True if it is a single item, and the count otherwise.

Result in 0.10.00

Single item Groupby
DataFrame
   0   bool
0  0  False
Group By
   0  bool
0  0     1
Multi item Groupby
DataFrame
   0   bool
0  0  False
1  0   True
Group By
   0  bool
0  0     2

In 0.11.0, this is now the result:

Single item Groupby
DataFrame
   0   bool
0  0  False
Group By
   0  bool
0  0  True
Multi item Groupby
DataFrame
   0   bool
0  0  False
1  0   True
Group By
   0  bool
0  0     2

Test Code:

print('Single item Groupby')
dsd = DataFrame.from_records([(0, False)], columns=['0', 'bool'])
print('DataFrame')
print(dsd)
print('Group By')
print(dsd.groupby(['0'], as_index=False).agg({'bool':Series.count}))
print('Multi item Groupby')
dsd = DataFrame.from_records([(0, False), (0, True)], columns=['0', 'bool'])
print('DataFrame')
print(dsd)
print('Group By')
print(dsd.groupby(['0'], as_index=False).agg({'bool':Series.count}))
``'
@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

This is a 'side' effect of pandas trying to coerce the output of groupby back to the same data type as the input (if possible); since you are using a user defined function, it is impossible to disambuigate this case.

You are after all grouping on a boolean column. I agree this is a somewhat degenerate case, and I suppose we maybe should not coerce on a boolean column at all (in a groupby).

Can you give me some more context on what you are trying to do?

Is the following is what you are after?

In [15]: DataFrame.from_records([(0, False)], columns=['0', 'bool']).groupby(['0']).count()
Out[15]: 
   0  bool
0         
0  1     1

@cameronhatfield
Copy link
Author

The main issue I am running into is that I am doing multiple aggregations, which is similar to the following (Note: I did not compile or run this. If you want me to make sure this is a working example, let me know):

MyInputTuple = namedtuple('MyInputTuple', 'attr_0, attr_1, attr_2, success, value_average')
data_frame = DataFrame.from_records([MyInputTuple(0, 1, 2, True, 4.7)], columns=MyInputTuple._fields)
result = data_frame.groupby(['attr_0', 'attr_1', 'attr_2'], as_index=False)
    .agg(OrderedDict([
        ('success', OrderedDict([
             ('num_tests', Series.count),
              'num_failed', lambda x: x.count() - np.count_nonzero(x))
             )),
        ('value_average', OrderedDict([
             ('min', np.min),
             ('max', np.max),
             ('avg', np.mean)
             ])),
       ]))

MyOutputTuple = namedtuple('MyOutputTuple', 'attr_1, attr_2, num_tests, num_failed, min_value_avg, max_value_avg, avg_value_avg')

for row in result.itertuples():
    attr_1 = row[1]
    output_tuple = row[2:]
    yield attr_1, MyOutputTuple._make(output_tuple)

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

I would just data_frame['success'] = data_frame['success'].astype(int) before you start

or you could easily handle this in a single function, e.g.

def function(grp):
    do whatever here
data_frame.groupby(.....).agg(function)

easier to read/understand too, my 2c

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

closing in favor of #7001

@jreback jreback closed this as completed Apr 29, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

2 participants