Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby first()/last() change datatype of all NaN columns to float; nth() preserves datatype #33591

Closed
jdmarino opened this issue Apr 16, 2020 · 5 comments
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@jdmarino
Copy link

  • [x ] I have checked that this issue has not already been reported.
  • [x ] I have confirmed this bug exists on the latest version of pandas.
    Version 1.0.3 from the conda distro.

Code Sample, a copy-pastable example

import pandas as pd
print('pandas version is', pd.__version__, '\n')

df = pd.DataFrame({'id':['a','b','b','c'], 'sym':['ibm','msft','msft','goog'], 'idx':range(4), 'sectype':['E','E','E','E']})
df['osi'] = df.sym.where(df.sectype=='O')  # add a column of NaNs that are object/str
print(df)
print(df.info())

# .nth(-1) does the right thing
x = df.groupby('id').nth(-1)
print('\nnth(-1)')
print(x.info())

# .last() converts the osi col to floats
x = df.groupby('id').last()
print('\nlast()')
print(x.info())

# .nth(0) does the right thing
x = df.groupby('id').nth(-1)
print('\nnth(0)')
print(x.info())

# .first() converts the osi col to floats
x = df.groupby('id').first()
print('\nfirst()')
print(x.info())

Problem description

Given a dataframe with an all-NaN column of str/objects, performing a groupby().first() will convert the all-NaN column to float. This is true for .last() as well, but not for .nth(0) and .nth(-1), so these are workarounds.

This is a problem for me as the groupby().first() is in general code that is iteratively called and the results either pd.concat'd (resulting in a column with mixed types) or appended to an hdf file (causing failure on the write).

Expected Output

The resulting dataframe from a groupby().first()/.last() should have the same metadata (column structure and datatypes) as the input dataframe. The result of .first()/.last() should match that of .nth(0)/.nth(-1) .

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200330
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.8.3
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.8
numba : 0.48.0

@jdmarino jdmarino added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 16, 2020
@dsaxton
Copy link
Member

dsaxton commented Apr 17, 2020

I think this is fixed on master:

[ins] In [1]: import pandas as pd
         ...:
         ...: df = pd.DataFrame(
         ...:     {
         ...:         "id": ["a", "b", "b", "c"],
         ...:         "sym": ["ibm", "msft", "msft", "goog"],
         ...:         "sectype": ["E", "E", "E", "E"],
         ...:     }
         ...: )
         ...: df["osi"] = df.sym.where(df.sectype == "O")
         ...: print(df.dtypes)
         ...: print(df.groupby("id")["osi"].first())
         ...: print(df.groupby("id")["osi"].last())
         ...: print(pd.__version__)
         ...:
id         object
sym        object
sectype    object
osi        object
dtype: object
id
a    NaN
b    NaN
c    NaN
Name: osi, dtype: object
id
a    NaN
b    NaN
c    NaN
Name: osi, dtype: object
1.1.0.dev0+1288.g3a5ae505b

@dsaxton dsaxton added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 17, 2020
@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Groupby labels Apr 17, 2020
@simonjayhawkins
Copy link
Member

I think this is fixed on master:

The code sample is different from the OP. The issue in OP does not appear to be fixed.

>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1302.ge878fdc41'
>>>
>>> df = pd.DataFrame(
...     {
...         "id": ["a", "b", "b", "c"],
...         "sym": ["ibm", "msft", "msft", "goog"],
...         "idx": range(4),
...         "sectype": ["E", "E", "E", "E"],
...     }
... )
>>> df["osi"] = df.sym.where(df.sectype == "O")  # add a column of NaNs that are object/str
>>>
>>>
>>> x = df.groupby("id").first()
>>> x.osi.dtype
dtype('float64')
>>>
>>> x = df.groupby("id").osi.first()
>>> x.dtypes
dtype('O')
>>>

@dsaxton
Copy link
Member

dsaxton commented Apr 19, 2020

@simonjayhawkins Oops, yeah I think you're right

@simonjayhawkins simonjayhawkins added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Tests Unit test(s) needed to prevent regressions good first issue labels Apr 19, 2020
@KenilMehta
Copy link
Contributor

If no one has taken this, I would like to help solve this issue.

@jreback jreback added this to the 1.1 milestone Apr 23, 2020
@TomAugspurger
Copy link
Contributor

#33627 added a test for series, but the issue remains for dataframe.

This is related to groupby calling maybe_cast_result on the output. I thought we had a better issue for it, but #17035 at least describes the issue. Going to close as a duplicate of that.

@TomAugspurger TomAugspurger added the Duplicate Report Duplicate issue or pull request label Jun 17, 2020
@TomAugspurger TomAugspurger removed this from the 1.1 milestone Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

7 participants