Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combine_first not retaining dtypes #7509

Closed
altaurog opened this issue Jun 19, 2014 · 15 comments · Fixed by #39051
Closed

combine_first not retaining dtypes #7509

altaurog opened this issue Jun 19, 2014 · 15 comments · Fixed by #39051
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@altaurog
Copy link

altaurog commented Jun 19, 2014

I found a number of issues that seemed related, all closed over a year ago, but there still seem to be some inconsistencies here:

In [1]: from datetime import datetime
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.13.1'
In [4]: dfa = pd.DataFrame([[datetime.now(), 2]], columns=['a','b'])
In [5]: dfb = pd.DataFrame([[4],[5]], columns=['b'])
In [6]: dfa.dtypes
Out[6]: 
a    datetime64[ns]
b             int64
dtype: object
In [7]: dfb.dtypes
Out[7]: 
b    int64
dtype: object
In [8]: # int64 becomes float64 when combining the two frames
In [9]: dfa.combine_first(dfb).dtypes
Out[9]: 
a    datetime64[ns]
b           float64
dtype: object
In [10]: # datetime64[ns] becomes float64 if the first frame is empty
In [11]: dfa.iloc[:0].combine_first(dfb).dtypes
Out[11]: 
a    float64
b      int64
dtype: object
@jreback
Copy link
Contributor

jreback commented Jun 19, 2014

can you ref the issues?

@altaurog
Copy link
Author

My search turned up issues #3041, #3043, #3552, and #3555

(whoa, github's auto-complete on those issue numbers seems totally unrelated...)

@jreback
Copy link
Contributor

jreback commented Jun 19, 2014

none of those address empty types (and thus their are prob no tests).

care to do a pull-request to put in tests/fix?

@jreback jreback added this to the 0.15.0 milestone Jun 19, 2014
@altaurog
Copy link
Author

I haven't looked under the hood; I've no clue how to go about fixing this. Also, there seem to be two separate problems here (perhaps I should have opened two issues):

  1. Specifically when the two DataFrames are both NOT empty, it changes int64 to float64.
  2. When the one with the datetime column IS empty, the datetime dtype is not preserved.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2014

the first is not feasible to fix, since int CANNOT hold nan. It is very tricky to convert this to float, then convert back if necessary (and hits perf).

The 2nd could be fixed (as datetime64[ns] CAN hold na (via NaT))

why don't you write up some tests then....get's you started :)

@altaurog
Copy link
Author

Oh, of course. The float64 is needed for the nan. But if nan does not result, the int64 is retained:

In [3]: dfa = pd.DataFrame([[datetime.now(), 2]], columns=['a','b'])
In [4]: dfb = pd.DataFrame([[4]], columns=['b'])
In [5]: dfa.combine_first(dfb).dtypes
Out[5]: 
a    datetime64[ns]
b             int64
dtype: object

I should have thought of that.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@VelizarVESSELINOV
Copy link

When you combine_first the original float32 column is transformed to float64:

"""Example of Pandas bug."""
from pandas import DataFrame
from numpy import float32

print('-' * 15)
d1 = DataFrame(index=[0])
d1['A'] = [3.5]
d1['A'] = d1['A'].astype(float32)
print(d1.dtypes)
print('-' * 15)
d2 = DataFrame(index=[0])
d2['B'] = [35]  # if uncomment this line the result is correct, nonsense for me
d2 = d2.combine_first(d1)
print(d2.dtypes)

Current output:

---------------
A    float32
dtype: object
---------------
A    float64
B      int64
dtype: object

Expected output:

---------------
A    float32
dtype: object
---------------
A    float32
B      int64
dtype: object

jreback pushed a commit that referenced this issue Aug 12, 2016
closes #7630
closes #10567
closes #13947

xref #7509

Author: sinhrks <sinhrks@gmail.com>

Closes #13970 from sinhrks/combine_bug and squashes the following commits:

2046cb5 [sinhrks] BUG/DEPR: combine dtype fixes
@lsorber
Copy link

lsorber commented Mar 23, 2020

This is still a bug in 2020:

>>> dfa = pd.DataFrame({"A": [0, 1, 2]}, index=[0, 1, 2])
>>> dfb = pd.DataFrame({"A": [7, 8, 9]}, index=[1, 2, 3])
>>> dfa.combine_first(dfb).dtypes  # Expect to see int64
A    float64
dtype: object

Any plans on addressing this issue?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2020

pandas is an all volunteer project.

if you would like to submit a PR then one of the volunteers would be able to code review

@danielhrisca
Copy link
Contributor

pandas is an all volunteer project.

if you would like to submit a PR then one of the volunteers would be able to code review

I think think this is a serious issue. If I had the knowledge to fix it myself I would. Maybe someone from the core developers can have a look

@jreback
Copy link
Contributor

jreback commented Jan 8, 2021

@danielhrisca there are many 'serious issues' again pandas is all volunteer and there are quite a number of open issues
i cannot instruct anyone to look at anything in particular

@danielhrisca
Copy link
Contributor

There is a dtype promotion going on in take_nd

Does it make sense to try there to apply the older dtype or just before returning from combine_first?

@danielhrisca
Copy link
Contributor

On a second look the root cause seems to be that the alignment is done with fill_value=None (here https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L6368) which promotes all numeric columns to f8.

I think that if we check that the two dataframe have identical columns, then we can provide a better fill_value (here https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L6485) which then should be applied at the above mentioned line. This reasoning is valid since we can be sure that in the end there will be no newly inserted nan values after combining the dataframes.

@jreback what do you think?

@jreback
Copy link
Contributor

jreback commented Jan 8, 2021

sure can check if they have identical columns then no action, also find_common_dtype (the pandas one would be useful here).

@danielhrisca
Copy link
Contributor

@jreback
#39051 is my first stab at trying to preserved dtypes for DataFrame with the same columns. I'll add some test today

@jreback jreback removed this from the Contributions Welcome milestone Jan 14, 2021
@jreback jreback added this to the 1.3 milestone Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
5 participants