Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird behaviour of groupby with quantile #10842

Closed
pekaalto opened this issue Aug 18, 2015 · 8 comments
Closed

Weird behaviour of groupby with quantile #10842

pekaalto opened this issue Aug 18, 2015 · 8 comments

Comments

@pekaalto
Copy link

Hi

import pandas as pd
X = pd.DataFrame(dict(g=['a','a','b','b','b'],a=range(5)))

For some reason this loses the grouping variable 'g':

X.groupby(['g'],as_index=False).quantile(0.5)

     a
0  0.5
1  3.0

I was expecting this output:

X.groupby(['g']).quantile(0.5).reset_index()

   g    a
0  a  0.5
1  b  3.0

Maybe related to the above this throws an error (this is useful if I have tons of columns but need result only for one).

X.groupby(['g'],as_index=False)['a'].quantile(0.5)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 40, in quantile
  File "C:\Anaconda\lib\site-packages\pandas\core\groupby.py", line 579, in wrapper
    return self._aggregate_item_by_item(name, *args, **kwargs)
  File "C:\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2789, in _aggregate_item_by_item
    raise errors
TypeError: quantile() got an unexpected keyword argument 'numeric_only'

If I replace .quantile() with other aggregate like mean or sum it works just fine.


>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: FI

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.6
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Aug 18, 2015

This is the behavior of as_index=False by -definition. You normally won't want to to pass this, it drops the groups relatiionships.

In [22]: X.groupby(['g']).quantile(0.5)
Out[22]: 
     a
g     
a  0.5
b  3.0

In [23]: X.groupby(['g'])['a'].quantile(0.5)
Out[23]: 
g
a    0.5
b    3.0
Name: a, dtype: float64

@pekaalto
Copy link
Author

@jreback
But why quantile is different from other aggregates like .sum() and .mean() ?

See these:

X.groupby(['g'],as_index=False).mean()
X.groupby(['g'],as_index=False).sum()

Now I get 'g' as column.

However, when I do

X.groupby(['g'],as_index=False).quantile()

The 'g' has disappeared.

Why the difference in behaviour?

@jreback
Copy link
Contributor

jreback commented Aug 18, 2015

see #5755 their is some inconsitencies here. requires a bit of an effort to fix.

@pekaalto
Copy link
Author

Oh ok thanks. Didn't find that issue by myself. Seems to be similar case.

@pierre-haessig
Copy link
Contributor

I've also got bitten by the inconsistency of quantile vs. min/max/mean in the context of time series resampling: it makes it more difficult (one needs to use apply(...)) to compute the quantile over each period. Here is an example:

>>> rng = pd.date_range('1/1/2011', periods=72, freq='H')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)
>>> ts.resample('1D').max() # easy

2011-01-01    2.544352
2011-01-02    1.594400
2011-01-03    1.395219
Freq: D, dtype: float64

>>> ts.resample('1D').quantile(0.999) # doesn't work as expected, like max()
/home/pierre/Programmes/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 
.resample() is now a deferred operation
You called quantile(...) on this deferred object which materialized it into a series
by implicitly taking the mean.  Use .resample(...).mean() instead
  """Entry point for launching an IPython kernel.

0.1853409719492341

>>> ts.resample('1D').apply(lambda x: x.quantile(0.999)) # works, but more typing

2011-01-01    2.530162
2011-01-02    1.588451
2011-01-03    1.388551
Freq: D, dtype: float64

I'm using pandas 0.20.2.

Is this related to the as_index parameter mentioned in #5755 or is it something else? Should I file a separate issue?

Also, as a side note, the example in the API reference http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.quantile.html doesn't use groupy...

@jreback
Copy link
Contributor

jreback commented Jun 6, 2017

Also, as a side note, the example in the API reference http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.quantile.html doesn't use groupy...

docs are mostly just references to the Series methods themselves. certainly would take a PR with a better example.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2017

@pekaalto your other issue is this: #15023

PR's welcome on that as well! (its easy!)

@pierre-haessig
Copy link
Contributor

Thanks @jreback for the pointer! For now I'll stick with the apply() work around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants