Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: .merge() and .join() on category dtype columns will now preserve category dtype #15321

Closed
wants to merge 2 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Feb 6, 2017

closes #10409

    before     after       ratio
  [6d2293f7] [c352573f]
-  410.11ms   260.46ms      0.64  join_merge.MergeCategoricals.time_merge_cat

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback jreback added Bug Categorical Categorical Data Type labels Feb 6, 2017
@jreback jreback added this to the 0.20.0 milestone Feb 6, 2017
@jreback
Copy link
Contributor Author

jreback commented Feb 6, 2017

cc @amelio-vazquez-reina
cc @psychemedia
cc @watercrossing

as you have posted issues about this in the past.

@codecov-io
Copy link

codecov-io commented Feb 7, 2017

Codecov Report

Merging #15321 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15321      +/-   ##
==========================================
+ Coverage   91.01%   91.02%   +<.01%     
==========================================
  Files         143      143              
  Lines       49376    49338      -38     
==========================================
- Hits        44941    44909      -32     
+ Misses       4435     4429       -6
Impacted Files Coverage Δ
pandas/tools/merge.py 93.61% <100%> (+1.5%)
pandas/core/internals.py 93.64% <100%> (ø)
pandas/io/gbq.py 25% <0%> (-58.34%)
pandas/core/frame.py 97.87% <0%> (-0.1%)
pandas/sparse/array.py 91.42% <0%> (-0.05%)
pandas/indexes/base.py 96.1% <0%> (-0.03%)
pandas/sparse/frame.py 96.69% <0%> (ø)
pandas/core/generic.py 96.25% <0%> (ø)
pandas/util/testing.py 81.11% <0%> (+0.18%)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5dee1f1...3671dad. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2017

@jorisvandenbossche if you'd have a look (mainly at the tests).

@chris-b1
Copy link
Contributor

chris-b1 commented Feb 7, 2017

Do you want to add an asv for this?

Probably a separate issue, but should we be warning about the implicit conversion to object when the categories don't match?

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2017

@chris-b1 yes I should add an asv. that's easy.

Probably a separate issue, but should we be warning about the implicit conversion to object when the categories don't match?

hmm, we don't do this generally on merges, though maybe we should. Can you open another issue (with some examples), even things like string/int merging should be warn/raised I think. Maybe need a errors='warn|raise' generally in merge.

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2017

@chris-b1 added benchmark.

@jreback jreback force-pushed the merge_cat branch 2 times, most recently from c7008a0 to 4c67377 Compare February 7, 2017 16:15
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at the tests, and added few questions.
I have to say that the tests are a bit hard to follow due to all the asigning and manipulation of the original left/right frames

tm.assert_frame_equal(result, expected)

# cat-object
cleft = left.copy()
cleft['b'] = cleft['b'].astype('category')
result = pd.merge(cleft, cright, how='left', left_on='b', right_on='c')
expected['b'] = expected['b'].astype('category')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I am not sure about, as it is maybe a bit of a corner case. As the both columns b and c on which is merged don't have the same categories, the resulting merged column should be object?

assert_series_equal(result, expected)

# swap the categories and ordered on one
# but should still work (end return categorical)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this one should work?
For concat it returns object (as the categories are not identical)

np.dtype('O'),
np.dtype('O')],
index=['X', 'Y', 'Z'])
assert_series_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test here added with actual different categories that results in object type of the merge column X

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2017

so after looking at this there are some cases that we ought to consider.

assume we are joining with left and right with a single on (it can be the index or not, but ultimately these come down to a left and right on a column (single) for now.

  • how='left': will be the dtype of the left, regardless of right
  • how='right': will be the dtype of the right, regardless of left
  • how='inner': if left and right are match on dtype (IOW if categorical then categories & ordered must match), OR right is object, then can return categorical of left flavor, otherwise must be object
  • how='outer': I think same as inner, these must match exactly (or be object that has categories that are fully contained within left).

So I think these follow our rules exactly from here, on the bottom

@jorisvandenbossche
Copy link
Member

Ah, yes, I didn't consider the different join operations.

how='inner': if left and right are match on dtype (IOW if categorical then categories & ordered must match), OR right is object, then can return categorical of left flavor, otherwise must be object

For inner and outer, I would follow the same rules as concat: this means only retaining category dtype if categories and orderedness match exactly, so for both left and right frame.
(the "be object that has categories that are fully contained within the other" rule is something we removed from concat, so I wouldn't introduce that again?)

jreback added a commit that referenced this pull request Feb 9, 2017
will facilitate some changes in ``tools/merge`` w.r.t. #15321, plus
these are independent anyhow.

Author: Jeff Reback <jeff@reback.net>

Closes #15358 from jreback/concat and squashes the following commits:

ba34c51 [Jeff Reback] CLN: strip out and form tools/concat.py from tools/merge.py
@jreback
Copy link
Contributor Author

jreback commented Feb 9, 2017

@jorisvandenbossche ok revised. If you'd play with this when you have a chance.

@jreback
Copy link
Contributor Author

jreback commented Feb 17, 2017

@jorisvandenbossche if you have a chance

@jreback jreback force-pushed the merge_cat branch 2 times, most recently from 7fc1084 to 80c4961 Compare March 2, 2017 14:53
@jreback jreback force-pushed the merge_cat branch 6 times, most recently from 70574a7 to 16348be Compare March 10, 2017 14:19
@jreback
Copy link
Contributor Author

jreback commented Mar 10, 2017

@jorisvandenbossche if you'd have a look. going to merge soon.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I would just mention this somewhere in the merging and/or categorical docs


# strict w.r.t. datetime64
assert not is_dtype_equal(dtypes['dt_tz'],
pandas_dtype('datetime64[ns, CET]'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason there are no categorical dtype in this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no will add

@jreback jreback force-pushed the merge_cat branch 2 times, most recently from a2d9c20 to 16e2fbe Compare March 10, 2017 22:06
@has2k1
Copy link
Contributor

has2k1 commented Mar 10, 2017

Some feedback. I had this -- 1.5 year old -- function that I have used to compensate for the bug

# NOTE: This is a temporary fix due to bug
# https://github.com/pydata/pandas/issues/10409
# Remove when that bug is fixed
import pandas.api.types as pdtypes

def preserve_categories(ref, other):
    for col in ref.columns & other.columns:
        if pdtypes.is_categorical_dtype(ref[col]):
            other[col] = other[col].astype(
                'category', categories=ref[col].cat.categories)

I have removed it and my tests pass.

@jreback jreback closed this in 026e748 Mar 10, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
will facilitate some changes in ``tools/merge`` w.r.t. pandas-dev#15321, plus
these are independent anyhow.

Author: Jeff Reback <jeff@reback.net>

Closes pandas-dev#15358 from jreback/concat and squashes the following commits:

ba34c51 [Jeff Reback] CLN: strip out and form tools/concat.py from tools/merge.py
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
…erve category dtype

closes pandas-dev#10409

Author: Jeff Reback <jeff@reback.net>

Closes pandas-dev#15321 from jreback/merge_cat and squashes the following commits:

3671dad [Jeff Reback] DOC: merge docs
a4b2ee6 [Jeff Reback] BUG/API: .merge() and .join() on category dtype columns will now preserve the category dtype when possible
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: merge with categoricals does not preserve categories dtype
5 participants