-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropping nuisance columns in groupby is a nuisance #21664
Comments
Generally I think this would be infeasible and break a lot of backwards compatibility (i.e. if the value of column 'y' was 'one' wouldn't you want this behavior)? Are you just more concerned specifically with the Decimal type? If so there has been discussion about that as a community delivered Extension Array that could help, so this would probably be a duplicate of another issue. Let me know |
@WillAyd I would think if I was explicitly selecting columns that they should be returned (or cause an error), not silently dropped. |
Hmm OK I see your argument there, though it would cause an inconsistency in the handling of explicit vs implicit column selection and would break backwards compatibility. If you feel like investigating and proposing a PR it could be considered! |
@WillAyd yea not sure about the backwards compatibility. Maybe we could start with offering the option to toggle column dropping and add a deprecationwarning (or appropriate warning) that the behavior will change. Also, what are your thoughts about always warning when a column is dropped as opposed to being silent? |
This issue forms a nice pair with #17382. When your mean aggregation involves a timedelta column, the timedelta column silently disappears. This behavior is surprising to users unaware of the limitations of timedelta. |
Correct me if I'm wrong, but I thought the OP's concern was that columns shouldn't be dropped if they're explicitly specified. Otherwise, it seems to me that the automatic dropping of nuisance columns is something that most people would want by default, with extra typing required to turn it off, not to turn it on. At an iPython prompt for instance, where I used to be able to just type Regardless, almost no one will ever want their string columns summed for instance, so it seems like the default should be to drop them, with the rare person that actually wants that behavior able to specify This seems to me to be just like how NaNs are silently ignored when summing or similar, without throwing errors or warnings. |
Code Sample, a copy-pastable example if possible
Problem description
I unknowingly encountered the feature described here when running the above code. While I see how this can be a useful feature, it's a nuisance not knowing that it happened and that I can't disable it. I feel that in the case of doing a groupby on explicitly selected columns
groupby(...)[COLS]
, it should not drop any columns and let whatever errors that occur raise. I also think that a warning could be added and/or an option to disable the feature.Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.10.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.28.1
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: