Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TypeError when merging categorical dates #16986

Closed
wants to merge 10 commits into from
13 changes: 9 additions & 4 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -877,7 +877,7 @@ def _get_merge_keys(self):
return left_keys, right_keys, join_names

def _maybe_coerce_merge_keys(self):
# we have valid mergee's but we may have to further
# we have valid mergees but we may have to further
# coerce these if they are originally incompatible types
#
# for example if these are categorical, but are not dtype_equal
Expand All @@ -894,6 +894,7 @@ def _maybe_coerce_merge_keys(self):
if is_categorical_dtype(lk) and is_categorical_dtype(rk):
if lk.is_dtype_equal(rk):
continue

elif is_categorical_dtype(lk) or is_categorical_dtype(rk):
pass

Expand All @@ -904,7 +905,7 @@ def _maybe_coerce_merge_keys(self):
# kinds to proceed, eg. int64 and int8
# further if we are object, but we infer to
# the same, then proceed
if (is_numeric_dtype(lk) and is_numeric_dtype(rk)):
if is_numeric_dtype(lk) and is_numeric_dtype(rk):
if lk.dtype.kind == rk.dtype.kind:
continue

Expand All @@ -915,11 +916,15 @@ def _maybe_coerce_merge_keys(self):
# Houston, we have a problem!
# let's coerce to object
if name in self.left.columns:
cat = is_categorical_dtype(lk)
typ = lk.categories.dtype if cat else object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do this above L898, e.g. something like

lk_to = object
rk_to = object
....
# L898
           elif is_categorical_dtype(lk) or is_categorical_dtype(rk):
               if is_categorical_dtype(lk):
                    lk_to = lk.categories.dtype
               if is_categorycal_dtype(rk):
                    rk_to = rk.categories.dtype

then use lk_to and rk_to where you used typ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in that elif block, but it could be done in the one above:

            if is_categorical_dtype(lk) and is_categorical_dtype(rk):
                if lk.is_dtype_equal(rk):
                    continue

                lk_to = lk.categories.dtype
                rk_to = rk.categories.dtype

but that doesn't seem cleaner to me - if we spread the lk_to/rk_to logic all over the method, then we make it much more difficult to debug compared with having all of the coercion logic in the block at the bottom where the coercion happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, pls add an instructive comment block here explaining what is going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done + minor tweaks to avoid calling is_categorical_dtype repeatedly.

self.left = self.left.assign(
**{name: self.left[name].astype(object)})
**{name: self.left[name].astype(typ)})
if name in self.right.columns:
cat = is_categorical_dtype(rk)
typ = rk.categories.dtype if cat else object
self.right = self.right.assign(
**{name: self.right[name].astype(object)})
**{name: self.right[name].astype(typ)})

def _validate_specification(self):
# Hm, any way to make this logic less complicated??
Expand Down
29 changes: 28 additions & 1 deletion pandas/tests/reshape/test_merge.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# pylint: disable=E1103

import pytest
from datetime import datetime
from datetime import datetime, date
from numpy.random import randn
from numpy import nan
import numpy as np
Expand Down Expand Up @@ -1515,6 +1515,33 @@ def test_self_join_multiple_categories(self):

assert_frame_equal(result, df)

def test_categorical_dates(self):
# GH 16900
# dates should not be coerced to ints

df = pd.DataFrame(
[[date(2001, 1, 1), 1.1],
[date(2001, 1, 2), 1.3]],
columns=['date', 'num2']
)
df['date'] = df['date'].astype('category')

df2 = pd.DataFrame(
[[date(2001, 1, 1), 1.3],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need testing on inner as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use parametrize instead of duplicating code here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can do this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been updated as per your previous comment:

construct the expected result and use tm.assert_frame_equal for both examples

did you want it changed to use parametrize instead?

[date(2001, 1, 3), 1.4]],
columns=['date', 'num4']
)
df2['date'] = df2['date'].astype('category')

result = pd.merge(df, df2, how='outer', on=['date'])
assert result.shape == (3, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construct the expected result and use tm.assert_frame_equal for both examples

assert result['date'].iloc[0] == pd.Timestamp('2001-01-01')
assert result['date'].iloc[-1] == pd.Timestamp('2001-01-03')

result_inner = pd.merge(df, df2, how='inner', on=['date'])
assert result_inner.shape == (1, 3)
assert result_inner['date'].iloc[-1] == pd.Timestamp('2001-01-01')


@pytest.fixture
def left_df():
Expand Down