pandas.core.groupby.GroupBy.apply fails #20949

MBlistein · 2018-05-04T11:10:31Z

Code Sample:

>>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
>>> g = df.groupby('A')
>>> g.apply(lambda x: x / x.sum())

Problem description

Applying a function to a grouped data frame fails. The code above is the example code from the official pandas documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply.html

Output to the above code:

/usr/local/lib/python2.7/dist-packages/pandas/core/computation/check.py:17: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.4.6

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 805, in apply
    return self._python_apply_general(f)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 809, in _python_apply_general
    self.axis)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 1969, in apply
    res = f(group)
  File "<stdin>", line 1, in <lambda>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 1262, in f
    return self._combine_series(other, na_op, fill_value, axis, level)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 3944, in _combine_series
    try_cast=try_cast)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 3958, in _combine_series_infer
    try_cast=try_cast)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 3981, in _combine_match_columns
    try_cast=try_cast)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3435, in eval
    return self.apply('eval', **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3329, in apply
    applied = getattr(b, f)(**kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1377, in eval
    result = get_result(other)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1346, in get_result
    result = func(values, other)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 1216, in na_op
    yrav.fill(yrav.item())
ValueError: can only convert an array of size 1 to a Python scalar

The error can be 'fixed' by applying another command to the grouped object first:

>>> g.sum()
   B   C
A       
a  3  10
b  3   5

>>> g.apply(lambda x: x / x.sum())
          B    C
0  0.333333  0.4
1  0.666667  0.6
2  1.000000  1.0

Expected Output

>>> g.apply(lambda x: x / x.sum())
          B    C
0  0.333333  0.4
1  0.666667  0.6
2  1.000000  1.0

Output of `pd.show_versions()`

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-122-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.utf8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 2.8.7
pip: 9.0.1
setuptools: 20.7.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: None
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2014.10
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.3
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.0
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.5.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.0.11
pymysql: 0.7.2.None
psycopg2: 2.6.1 (dt dec mx pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-05-04T13:24:00Z

Thanks for the bug report.

WillAyd · 2018-05-04T17:31:27Z

Hmm interesting. FWIW when I remove numexpr I can't get this to run at all, regardless of whether or not I run another agg function first.

WillAyd · 2018-05-04T18:03:00Z

Numexpr may be a red herring. From what I can tell the problem occurs at the following line of code:

pandas/pandas/core/groupby/groupby.py

Line 5063 in ef019fa

results, mutated = reduction.apply_frame_axis0(sdata, f, names,

sdata when run without another agg function first includes the Grouping as part of the data and throws here, causing it to go down another path. sdata comes from _selected_obj.

For agg functions like sum, mean, etc... they have a call to _set_group_selection which takes care of setting the appropriately cached value for _selected_obj. I suppose a quick fix is to add a call to that at the beginning of apply, though I can't tell from the code alone why that isn't done across the board

cc @jreback for any insight

Dr-Irv · 2018-05-04T20:28:47Z

Here's another example that fails with 0.23rc2 (and in 0.22.0 as well), based on code from pandas\core\indexes\datetimes.py in test_agg_timezone_round_trip:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.0rc2'

In [3]: dates = [pd.Timestamp("2016-01-0%d 12:00:00" % i, tz='US/Pacific')
   ...:          for i in range(1, 5)]
   ...: df = pd.DataFrame({'A': ['a', 'b'] * 2, 'B': dates})
   ...: grouped = df.groupby('A')
   ...:

In [4]: df
Out[4]:
   A                         B
0  a 2016-01-01 12:00:00-08:00
1  b 2016-01-02 12:00:00-08:00
2  a 2016-01-03 12:00:00-08:00
3  b 2016-01-04 12:00:00-08:00

In [5]: grouped.apply(lambda x: x.iloc[0])[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5720)()
    138
--> 139     cpdef get_loc(self, object val):
    140         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5566)()
    160         try:
--> 161             return self.mapping.get_item(val)
    162         except (TypeError, ValueError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22442)()
   1491
-> 1492     cpdef get_item(self, object val):
   1493         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22396)()
   1499         else:
-> 1500             raise KeyError(val)
   1501

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-2b16555d6e05> in <module>()
----> 1 grouped.apply(lambda x: x.iloc[0])[0]

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\frame.py in __getitem__(self, key)
   2685             return self._getitem_multilevel(key)
   2686         else:
-> 2687             return self._getitem_column(key)
   2688
   2689     def _getitem_column(self, key):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\frame.py in _getitem_column(self, key)
   2692         # get column
   2693         if self.columns.is_unique:
-> 2694             return self._get_item_cache(key)
   2695
   2696         # duplicate columns & possible reduce dimensionality

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\generic.py in _get_item_cache(self, item)
   2485         res = cache.get(item)
   2486         if res is None:
-> 2487             values = self._data.get(item)
   2488             res = self._box_item_values(item, values)
   2489             cache[item] = res

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\internals.py in get(self, item, fastpath)
   4113
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3063                 return self._engine.get_loc(key)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066
   3067         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5720)()
    137             util.set_value_at(arr, loc, value)
    138
--> 139     cpdef get_loc(self, object val):
    140         if is_definitely_invalid_key(val):
    141             raise TypeError("'{val}' is an invalid key".format(val=val))


C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5566)()
    159
    160         try:
--> 161             return self.mapping.get_item(val)
    162         except (TypeError, ValueError):
    163             raise KeyError(val)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22442)()
   1490                                        sizeof(uint32_t)) # flags
   1491
-> 1492     cpdef get_item(self, object val):
   1493         cdef khiter_t k
   1494         if val != val or val is None:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22396)()
   1498             return self.table.vals[k]
   1499         else:
-> 1500             raise KeyError(val)
   1501
   1502     cpdef set_item(self, object key, Py_ssize_t val):

KeyError: 0

However, if you do the following, it works:

In [6]: grouped.nth(0)['B'].iloc[0]
Out[6]: Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')

In [7]: grouped.apply(lambda x: x.iloc[0])[0]
Out[7]: Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')

So doing one operation (in this case nth) prior to the apply then makes the apply work.

WillAyd · 2018-05-04T23:09:20Z

@Dr-Irv seems related. Some code below illustrating what I think is going on:

>>> grouped.apply(lambda x: x.iloc[0])[0]  # KeyError as indicator
KeyError

>>> grouped._set_group_selection()
>>> grouped.apply(lambda x: x.iloc[0])[0]  # Works now, as 'A' was not part of data
Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')

>>> grouped._reset_group_selection()  # Clear out the group selection
>>> grouped.apply(lambda x: x.iloc[0])[0]  # Back to failing
KeyError

Unfortunately just adding this call before _python_apply_general broke other tests where the grouping was supposed to be part of the returned object (at least according to the tests). Reviewing in more detail hope to have a PR soon

jreback · 2018-05-05T13:14:42Z

this didn't work even in 0.20.3. not sure how we don't have a test for it though.

jreback · 2018-05-05T13:15:49Z

@Dr-Irv your example is a separate issue. pls make a new report for that one.

…pes and the user supplied function can fail on the grouping column closes pandas-dev#20949

MBlistein changed the title ~~pandas.core.groupby.GroupBy.apply is broken~~ pandas.core.groupby.GroupBy.apply fails May 4, 2018

TomAugspurger added this to the 0.23.0 milestone May 4, 2018

TomAugspurger added Groupby Blocker Blocking issue or pull request for an upcoming release Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version labels May 4, 2018

Dr-Irv mentioned this issue May 4, 2018

BUG: Fix Series.get() for ExtensionArray and Categorical #20885

Merged

4 tasks

Dr-Irv mentioned this issue May 5, 2018

Using apply on a grouper works only if done after another operation on grouper #20958

Closed

jreback added a commit to jreback/pandas that referenced this issue May 5, 2018

BUG in .groupby.apply when applying a function that has mixed data ty…

cf9ee69

…pes and the user supplied function can fail on the grouping column closes pandas-dev#20949

jreback mentioned this issue May 5, 2018

BUG in .groupby.apply when applying a function that has mixed data types and the user supplied function can fail on the grouping column #20959

Merged

jreback added a commit to jreback/pandas that referenced this issue May 7, 2018

BUG in .groupby.apply when applying a function that has mixed data ty…

41d930b

…pes and the user supplied function can fail on the grouping column closes pandas-dev#20949

jreback closed this as completed in #20959 May 8, 2018

WillAyd mentioned this issue May 9, 2018

Consistent Return Structure for Rolling Apply #20984

Merged

4 tasks

venatir mentioned this issue Jan 6, 2020

KeyError: 0 error on groupby apply #30731

Closed

AlexKirko mentioned this issue Jul 31, 2020

BUG: agg on groups with different sizes fails with out of bounds IndexError #35275

Open

3 tasks

panda-byte mentioned this issue Mar 11, 2022

BUG: core.groupby.GroupBy.apply unexpected behavior with TypeError raised in UDF #46324

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.core.groupby.GroupBy.apply fails #20949

pandas.core.groupby.GroupBy.apply fails #20949

MBlistein commented May 4, 2018 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented May 4, 2018

WillAyd commented May 4, 2018

WillAyd commented May 4, 2018 •

edited

Loading

Dr-Irv commented May 4, 2018

WillAyd commented May 4, 2018

jreback commented May 5, 2018

jreback commented May 5, 2018

pandas.core.groupby.GroupBy.apply fails #20949

pandas.core.groupby.GroupBy.apply fails #20949

Comments

MBlistein commented May 4, 2018 • edited Loading

Code Sample:

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented May 4, 2018

WillAyd commented May 4, 2018

WillAyd commented May 4, 2018 • edited Loading

Dr-Irv commented May 4, 2018

WillAyd commented May 4, 2018

jreback commented May 5, 2018

jreback commented May 5, 2018

MBlistein commented May 4, 2018 •

edited

Loading

Output of `pd.show_versions()`

WillAyd commented May 4, 2018 •

edited

Loading