Dropping nuisance columns in groupby is a nuisance #21664

postelrich · 2018-06-28T15:46:08Z

Code Sample, a copy-pastable example if possible

import pandas as pd
from decimal import Decimal
df = pd.DataFrame({'id': [1], 'x': [1], 'y': [Decimal(1)]})
df.groupby('id')[['x', 'y']].sum()

#     x
# id   
# 1   1

Problem description

I unknowingly encountered the feature described here when running the above code. While I see how this can be a useful feature, it's a nuisance not knowing that it happened and that I can't disable it. I feel that in the case of doing a groupby on explicitly selected columns groupby(...)[COLS], it should not drop any columns and let whatever errors that occur raise. I also think that a warning could be added and/or an option to disable the feature.

Expected Output

#     x  y
# id   
# 1   1  1

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.10.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.28.1
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-06-28T15:51:17Z

Generally I think this would be infeasible and break a lot of backwards compatibility (i.e. if the value of column 'y' was 'one' wouldn't you want this behavior)?

Are you just more concerned specifically with the Decimal type? If so there has been discussion about that as a community delivered Extension Array that could help, so this would probably be a duplicate of another issue. Let me know

postelrich · 2018-06-28T17:41:57Z

@WillAyd I would think if I was explicitly selecting columns that they should be returned (or cause an error), not silently dropped.

WillAyd · 2018-06-28T17:50:07Z

Hmm OK I see your argument there, though it would cause an inconsistency in the handling of explicit vs implicit column selection and would break backwards compatibility.

If you feel like investigating and proposing a PR it could be considered!

postelrich · 2018-06-28T19:32:49Z

@WillAyd yea not sure about the backwards compatibility. Maybe we could start with offering the option to toggle column dropping and add a deprecationwarning (or appropriate warning) that the behavior will change.

Also, what are your thoughts about always warning when a column is dropped as opposed to being silent?

jchia · 2018-10-31T07:16:06Z

This issue forms a nice pair with #17382. When your mean aggregation involves a timedelta column, the timedelta column silently disappears. This behavior is surprising to users unaware of the limitations of timedelta.

ghost711 · 2021-12-05T21:37:48Z

Correct me if I'm wrong, but I thought the OP's concern was that columns shouldn't be dropped if they're explicitly specified.

Otherwise, it seems to me that the automatic dropping of nuisance columns is something that most people would want by default, with extra typing required to turn it off, not to turn it on.

At an iPython prompt for instance, where I used to be able to just type df.sum(), we now have to type df.sum(numeric_only=True) (long enough to make me question if I really want the answer bad enough to type it).

Regardless, almost no one will ever want their string columns summed for instance, so it seems like the default should be to drop them, with the rare person that actually wants that behavior able to specify numeric_only=False.

This seems to me to be just like how NaNs are silently ignored when summing or similar, without throwing errors or warnings.

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jun 28, 2018

WillAyd added Groupby Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate and removed Needs Info Clarification about behavior needed to assess issue labels Jun 28, 2018

drudd mentioned this issue Aug 11, 2018

Decimal fields dropped in group by with more than one column #22275

Closed

jbrockmendel removed the Difficulty Intermediate label Oct 21, 2019

jbrockmendel added Reduction Operations sum, mean, min, max, etc. Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Sep 21, 2020

jbrockmendel mentioned this issue May 14, 2021

DEPR: dropping nuisance columns in DataFrameGroupby apply, agg, transform #41475

Merged

4 tasks

jreback added this to the 1.3 milestone May 17, 2021

jreback closed this as completed in #41475 May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropping nuisance columns in groupby is a nuisance #21664

Dropping nuisance columns in groupby is a nuisance #21664

postelrich commented Jun 28, 2018

INSTALLED VERSIONS

WillAyd commented Jun 28, 2018

postelrich commented Jun 28, 2018

WillAyd commented Jun 28, 2018

postelrich commented Jun 28, 2018

jchia commented Oct 31, 2018 •

edited

Loading

ghost711 commented Dec 5, 2021 •

edited

Loading

Dropping nuisance columns in groupby is a nuisance #21664

Dropping nuisance columns in groupby is a nuisance #21664

Comments

postelrich commented Jun 28, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Jun 28, 2018

postelrich commented Jun 28, 2018

WillAyd commented Jun 28, 2018

postelrich commented Jun 28, 2018

jchia commented Oct 31, 2018 • edited Loading

ghost711 commented Dec 5, 2021 • edited Loading

Output of `pd.show_versions()`

jchia commented Oct 31, 2018 •

edited

Loading

ghost711 commented Dec 5, 2021 •

edited

Loading