-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/API: .merge() and .join() on category dtype columns will now preserve category dtype #15321
Conversation
cc @amelio-vazquez-reina as you have posted issues about this in the past. |
Codecov Report
@@ Coverage Diff @@
## master #15321 +/- ##
==========================================
+ Coverage 91.01% 91.02% +<.01%
==========================================
Files 143 143
Lines 49376 49338 -38
==========================================
- Hits 44941 44909 -32
+ Misses 4435 4429 -6
Continue to review full report at Codecov.
|
@jorisvandenbossche if you'd have a look (mainly at the tests). |
Do you want to add an asv for this? Probably a separate issue, but should we be warning about the implicit conversion to |
@chris-b1 yes I should add an asv. that's easy.
hmm, we don't do this generally on merges, though maybe we should. Can you open another issue (with some examples), even things like string/int merging should be warn/raised I think. Maybe need a |
@chris-b1 added benchmark. |
c7008a0
to
4c67377
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at the tests, and added few questions.
I have to say that the tests are a bit hard to follow due to all the asigning and manipulation of the original left/right frames
pandas/tests/test_categorical.py
Outdated
tm.assert_frame_equal(result, expected) | ||
|
||
# cat-object | ||
cleft = left.copy() | ||
cleft['b'] = cleft['b'].astype('category') | ||
result = pd.merge(cleft, cright, how='left', left_on='b', right_on='c') | ||
expected['b'] = expected['b'].astype('category') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one I am not sure about, as it is maybe a bit of a corner case. As the both columns b and c on which is merged don't have the same categories, the resulting merged column should be object?
pandas/tools/tests/test_merge.py
Outdated
assert_series_equal(result, expected) | ||
|
||
# swap the categories and ordered on one | ||
# but should still work (end return categorical) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this one should work?
For concat it returns object (as the categories are not identical)
pandas/tools/tests/test_merge.py
Outdated
np.dtype('O'), | ||
np.dtype('O')], | ||
index=['X', 'Y', 'Z']) | ||
assert_series_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a test here added with actual different categories that results in object type of the merge column X
so after looking at this there are some cases that we ought to consider. assume we are joining with
So I think these follow our rules exactly from here, on the bottom |
Ah, yes, I didn't consider the different join operations.
For inner and outer, I would follow the same rules as |
@jorisvandenbossche ok revised. If you'd play with this when you have a chance. |
89221bd
to
cfeafbb
Compare
@jorisvandenbossche if you have a chance |
7fc1084
to
80c4961
Compare
70574a7
to
16348be
Compare
@jorisvandenbossche if you'd have a look. going to merge soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I would just mention this somewhere in the merging and/or categorical docs
pandas/tests/types/test_common.py
Outdated
|
||
# strict w.r.t. datetime64 | ||
assert not is_dtype_equal(dtypes['dt_tz'], | ||
pandas_dtype('datetime64[ns, CET]')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason there are no categorical dtype in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no will add
…erve the category dtype when possible closes pandas-dev#10409
a2d9c20
to
16e2fbe
Compare
Some feedback. I had this -- 1.5 year old -- function that I have used to compensate for the bug # NOTE: This is a temporary fix due to bug
# https://github.com/pydata/pandas/issues/10409
# Remove when that bug is fixed
import pandas.api.types as pdtypes
def preserve_categories(ref, other):
for col in ref.columns & other.columns:
if pdtypes.is_categorical_dtype(ref[col]):
other[col] = other[col].astype(
'category', categories=ref[col].cat.categories) I have removed it and my tests pass. |
will facilitate some changes in ``tools/merge`` w.r.t. pandas-dev#15321, plus these are independent anyhow. Author: Jeff Reback <jeff@reback.net> Closes pandas-dev#15358 from jreback/concat and squashes the following commits: ba34c51 [Jeff Reback] CLN: strip out and form tools/concat.py from tools/merge.py
…erve category dtype closes pandas-dev#10409 Author: Jeff Reback <jeff@reback.net> Closes pandas-dev#15321 from jreback/merge_cat and squashes the following commits: 3671dad [Jeff Reback] DOC: merge docs a4b2ee6 [Jeff Reback] BUG/API: .merge() and .join() on category dtype columns will now preserve the category dtype when possible
closes #10409