Weird behaviour of groupby with quantile #10842

pekaalto · 2015-08-18T08:32:31Z

Hi

import pandas as pd
X = pd.DataFrame(dict(g=['a','a','b','b','b'],a=range(5)))

For some reason this loses the grouping variable 'g':

X.groupby(['g'],as_index=False).quantile(0.5)

     a
0  0.5
1  3.0

I was expecting this output:

X.groupby(['g']).quantile(0.5).reset_index()

   g    a
0  a  0.5
1  b  3.0

Maybe related to the above this throws an error (this is useful if I have tons of columns but need result only for one).

X.groupby(['g'],as_index=False)['a'].quantile(0.5)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 40, in quantile
  File "C:\Anaconda\lib\site-packages\pandas\core\groupby.py", line 579, in wrapper
    return self._aggregate_item_by_item(name, *args, **kwargs)
  File "C:\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2789, in _aggregate_item_by_item
    raise errors
TypeError: quantile() got an unexpected keyword argument 'numeric_only'

If I replace .quantile() with other aggregate like mean or sum it works just fine.

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: FI

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.6
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

jreback · 2015-08-18T10:41:10Z

This is the behavior of as_index=False by -definition. You normally won't want to to pass this, it drops the groups relatiionships.

In [22]: X.groupby(['g']).quantile(0.5)
Out[22]: 
     a
g     
a  0.5
b  3.0

In [23]: X.groupby(['g'])['a'].quantile(0.5)
Out[23]: 
g
a    0.5
b    3.0
Name: a, dtype: float64

pekaalto · 2015-08-18T11:14:26Z

@jreback
But why quantile is different from other aggregates like .sum() and .mean() ?

See these:

X.groupby(['g'],as_index=False).mean()
X.groupby(['g'],as_index=False).sum()

Now I get 'g' as column.

However, when I do

X.groupby(['g'],as_index=False).quantile()

The 'g' has disappeared.

Why the difference in behaviour?

jreback · 2015-08-18T11:38:55Z

see #5755 their is some inconsitencies here. requires a bit of an effort to fix.

pekaalto · 2015-08-18T11:47:13Z

Oh ok thanks. Didn't find that issue by myself. Seems to be similar case.

pierre-haessig · 2017-06-06T13:14:29Z

I've also got bitten by the inconsistency of quantile vs. min/max/mean in the context of time series resampling: it makes it more difficult (one needs to use apply(...)) to compute the quantile over each period. Here is an example:

>>> rng = pd.date_range('1/1/2011', periods=72, freq='H')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)
>>> ts.resample('1D').max() # easy

2011-01-01    2.544352
2011-01-02    1.594400
2011-01-03    1.395219
Freq: D, dtype: float64

>>> ts.resample('1D').quantile(0.999) # doesn't work as expected, like max()
/home/pierre/Programmes/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 
.resample() is now a deferred operation
You called quantile(...) on this deferred object which materialized it into a series
by implicitly taking the mean.  Use .resample(...).mean() instead
  """Entry point for launching an IPython kernel.

0.1853409719492341

>>> ts.resample('1D').apply(lambda x: x.quantile(0.999)) # works, but more typing

2011-01-01    2.530162
2011-01-02    1.588451
2011-01-03    1.388551
Freq: D, dtype: float64

I'm using pandas 0.20.2.

Is this related to the as_index parameter mentioned in #5755 or is it something else? Should I file a separate issue?

Also, as a side note, the example in the API reference http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.quantile.html doesn't use groupy...

jreback · 2017-06-06T13:25:56Z

Also, as a side note, the example in the API reference http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.quantile.html doesn't use groupy...

docs are mostly just references to the Series methods themselves. certainly would take a PR with a better example.

jreback · 2017-06-06T13:27:34Z

@pekaalto your other issue is this: #15023

PR's welcome on that as well! (its easy!)

pierre-haessig · 2017-06-07T07:19:54Z

Thanks @jreback for the pointer! For now I'll stick with the apply() work around.

jreback closed this as completed Aug 18, 2015

jreback added Groupby Usage Question labels Aug 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird behaviour of groupby with quantile #10842

Weird behaviour of groupby with quantile #10842

pekaalto commented Aug 18, 2015

jreback commented Aug 18, 2015

pekaalto commented Aug 18, 2015

jreback commented Aug 18, 2015

pekaalto commented Aug 18, 2015

pierre-haessig commented Jun 6, 2017

jreback commented Jun 6, 2017

jreback commented Jun 6, 2017

pierre-haessig commented Jun 7, 2017

Weird behaviour of groupby with quantile #10842

Weird behaviour of groupby with quantile #10842

Comments

pekaalto commented Aug 18, 2015

jreback commented Aug 18, 2015

pekaalto commented Aug 18, 2015

jreback commented Aug 18, 2015

pekaalto commented Aug 18, 2015

pierre-haessig commented Jun 6, 2017

jreback commented Jun 6, 2017

jreback commented Jun 6, 2017

pierre-haessig commented Jun 7, 2017