BUG: groupby first()/last() change datatype of all NaN columns to float; nth() preserves datatype #33591

jdmarino · 2020-04-16T15:01:32Z

[x ] I have checked that this issue has not already been reported.
[x ] I have confirmed this bug exists on the latest version of pandas.
Version 1.0.3 from the conda distro.

Code Sample, a copy-pastable example

import pandas as pd
print('pandas version is', pd.__version__, '\n')

df = pd.DataFrame({'id':['a','b','b','c'], 'sym':['ibm','msft','msft','goog'], 'idx':range(4), 'sectype':['E','E','E','E']})
df['osi'] = df.sym.where(df.sectype=='O')  # add a column of NaNs that are object/str
print(df)
print(df.info())

# .nth(-1) does the right thing
x = df.groupby('id').nth(-1)
print('\nnth(-1)')
print(x.info())

# .last() converts the osi col to floats
x = df.groupby('id').last()
print('\nlast()')
print(x.info())

# .nth(0) does the right thing
x = df.groupby('id').nth(-1)
print('\nnth(0)')
print(x.info())

# .first() converts the osi col to floats
x = df.groupby('id').first()
print('\nfirst()')
print(x.info())

Problem description

Given a dataframe with an all-NaN column of str/objects, performing a groupby().first() will convert the all-NaN column to float. This is true for .last() as well, but not for .nth(0) and .nth(-1), so these are workarounds.

This is a problem for me as the groupby().first() is in general code that is iteratively called and the results either pd.concat'd (resulting in a column with mixed types) or appended to an hdf file (causing failure on the write).

Expected Output

The resulting dataframe from a groupby().first()/.last() should have the same metadata (column structure and datatypes) as the input dataframe. The result of .first()/.last() should match that of .nth(0)/.nth(-1) .

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200330
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.8.3
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.8
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-04-17T17:17:22Z

I think this is fixed on master:

[ins] In [1]: import pandas as pd
         ...:
         ...: df = pd.DataFrame(
         ...:     {
         ...:         "id": ["a", "b", "b", "c"],
         ...:         "sym": ["ibm", "msft", "msft", "goog"],
         ...:         "sectype": ["E", "E", "E", "E"],
         ...:     }
         ...: )
         ...: df["osi"] = df.sym.where(df.sectype == "O")
         ...: print(df.dtypes)
         ...: print(df.groupby("id")["osi"].first())
         ...: print(df.groupby("id")["osi"].last())
         ...: print(pd.__version__)
         ...:
id         object
sym        object
sectype    object
osi        object
dtype: object
id
a    NaN
b    NaN
c    NaN
Name: osi, dtype: object
id
a    NaN
b    NaN
c    NaN
Name: osi, dtype: object
1.1.0.dev0+1288.g3a5ae505b

simonjayhawkins · 2020-04-19T15:38:17Z

I think this is fixed on master:

The code sample is different from the OP. The issue in OP does not appear to be fixed.

>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1302.ge878fdc41'
>>>
>>> df = pd.DataFrame(
...     {
...         "id": ["a", "b", "b", "c"],
...         "sym": ["ibm", "msft", "msft", "goog"],
...         "idx": range(4),
...         "sectype": ["E", "E", "E", "E"],
...     }
... )
>>> df["osi"] = df.sym.where(df.sectype == "O")  # add a column of NaNs that are object/str
>>>
>>>
>>> x = df.groupby("id").first()
>>> x.osi.dtype
dtype('float64')
>>>
>>> x = df.groupby("id").osi.first()
>>> x.dtypes
dtype('O')
>>>

dsaxton · 2020-04-19T16:07:49Z

@simonjayhawkins Oops, yeah I think you're right

KenilMehta · 2020-04-21T13:33:04Z

If no one has taken this, I would like to help solve this issue.

TomAugspurger · 2020-06-17T18:54:37Z

#33627 added a test for series, but the issue remains for dataframe.

This is related to groupby calling maybe_cast_result on the output. I thought we had a better issue for it, but #17035 at least describes the issue. Going to close as a duplicate of that.

jdmarino added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 16, 2020

dsaxton added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 17, 2020

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Groupby labels Apr 17, 2020

jamescobonkerr mentioned this issue Apr 18, 2020

TST: Groupby first/last/nth nan column test #33627

Merged

5 tasks

simonjayhawkins added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Tests Unit test(s) needed to prevent regressions good first issue labels Apr 19, 2020

jreback added this to the 1.1 milestone Apr 23, 2020

TomAugspurger closed this as completed Jun 17, 2020

TomAugspurger added the Duplicate Report Duplicate issue or pull request label Jun 17, 2020

TomAugspurger removed this from the 1.1 milestone Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby first()/last() change datatype of all NaN columns to float; nth() preserves datatype #33591

BUG: groupby first()/last() change datatype of all NaN columns to float; nth() preserves datatype #33591

jdmarino commented Apr 16, 2020

INSTALLED VERSIONS

dsaxton commented Apr 17, 2020

simonjayhawkins commented Apr 19, 2020

dsaxton commented Apr 19, 2020

KenilMehta commented Apr 21, 2020

TomAugspurger commented Jun 17, 2020

BUG: groupby first()/last() change datatype of all NaN columns to float; nth() preserves datatype #33591

BUG: groupby first()/last() change datatype of all NaN columns to float; nth() preserves datatype #33591

Comments

jdmarino commented Apr 16, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

dsaxton commented Apr 17, 2020

simonjayhawkins commented Apr 19, 2020

dsaxton commented Apr 19, 2020

KenilMehta commented Apr 21, 2020

TomAugspurger commented Jun 17, 2020

Output of `pd.show_versions()`