Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropping nuisance columns in groupby is a nuisance #21664

Closed
postelrich opened this issue Jun 28, 2018 · 6 comments · Fixed by #41475
Closed

Dropping nuisance columns in groupby is a nuisance #21664

postelrich opened this issue Jun 28, 2018 · 6 comments · Fixed by #41475
Labels
Error Reporting Incorrect or improved errors from pandas Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc.
Milestone

Comments

@postelrich
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
from decimal import Decimal
df = pd.DataFrame({'id': [1], 'x': [1], 'y': [Decimal(1)]})
df.groupby('id')[['x', 'y']].sum()

#     x
# id   
# 1   1

Problem description

I unknowingly encountered the feature described here when running the above code. While I see how this can be a useful feature, it's a nuisance not knowing that it happened and that I can't disable it. I feel that in the case of doing a groupby on explicitly selected columns groupby(...)[COLS], it should not drop any columns and let whatever errors that occur raise. I also think that a warning could be added and/or an option to disable the feature.

Expected Output

#     x  y
# id   
# 1   1  1

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.10.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.28.1
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Jun 28, 2018

Generally I think this would be infeasible and break a lot of backwards compatibility (i.e. if the value of column 'y' was 'one' wouldn't you want this behavior)?

Are you just more concerned specifically with the Decimal type? If so there has been discussion about that as a community delivered Extension Array that could help, so this would probably be a duplicate of another issue. Let me know

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jun 28, 2018
@postelrich
Copy link
Author

@WillAyd I would think if I was explicitly selecting columns that they should be returned (or cause an error), not silently dropped.

@WillAyd
Copy link
Member

WillAyd commented Jun 28, 2018

Hmm OK I see your argument there, though it would cause an inconsistency in the handling of explicit vs implicit column selection and would break backwards compatibility.

If you feel like investigating and proposing a PR it could be considered!

@WillAyd WillAyd added Groupby Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate and removed Needs Info Clarification about behavior needed to assess issue labels Jun 28, 2018
@postelrich
Copy link
Author

@WillAyd yea not sure about the backwards compatibility. Maybe we could start with offering the option to toggle column dropping and add a deprecationwarning (or appropriate warning) that the behavior will change.

Also, what are your thoughts about always warning when a column is dropped as opposed to being silent?

@jchia
Copy link

jchia commented Oct 31, 2018

This issue forms a nice pair with #17382. When your mean aggregation involves a timedelta column, the timedelta column silently disappears. This behavior is surprising to users unaware of the limitations of timedelta.

@jbrockmendel jbrockmendel added Reduction Operations sum, mean, min, max, etc. Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Sep 21, 2020
@jreback jreback added this to the 1.3 milestone May 17, 2021
@ghost711
Copy link

ghost711 commented Dec 5, 2021

Correct me if I'm wrong, but I thought the OP's concern was that columns shouldn't be dropped if they're explicitly specified.

Otherwise, it seems to me that the automatic dropping of nuisance columns is something that most people would want by default, with extra typing required to turn it off, not to turn it on.

At an iPython prompt for instance, where I used to be able to just type df.sum(), we now have to type df.sum(numeric_only=True) (long enough to make me question if I really want the answer bad enough to type it).

Regardless, almost no one will ever want their string columns summed for instance, so it seems like the default should be to drop them, with the rare person that actually wants that behavior able to specify numeric_only=False.

This seems to me to be just like how NaNs are silently ignored when summing or similar, without throwing errors or warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants