Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Respect dups in reindexing CategoricalIndex #17355

Merged
merged 1 commit into from
Aug 29, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,7 @@ Indexing
- Allow unicode empty strings as placeholders in multilevel columns in Python 2 (:issue:`17099`)
- Bug in ``.iloc`` when used with inplace addition or assignment and an int indexer on a ``MultiIndex`` causing the wrong indexes to be read from and written to (:issue:`17148`)
- Bug in ``.isin()`` in which checking membership in empty ``Series`` objects raised an error (:issue:`16991`)
- Bug in ``CategoricalIndex`` reindexing in which specified indices containing duplicates were not being respected (:issue:`17323`)

I/O
^^^
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -487,7 +487,7 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):
method = missing.clean_reindex_fill_method(method)
target = ibase._ensure_index(target)

if self.equals(target):
if self.is_unique and self.equals(target):
return np.arange(len(self), dtype='intp')

if method == 'pad' or method == 'backfill':
Expand Down
22 changes: 17 additions & 5 deletions pandas/tests/indexes/test_category.py
Original file line number Diff line number Diff line change
Expand Up @@ -365,18 +365,18 @@ def test_astype(self):
tm.assert_index_equal(result, expected)

def test_reindex_base(self):

# determined by cat ordering
idx = self.create_index()
# Determined by cat ordering.
idx = CategoricalIndex(list("cab"), categories=list("cab"))
expected = np.arange(len(idx), dtype=np.intp)

actual = idx.get_indexer(idx)
tm.assert_numpy_array_equal(expected, actual)
Copy link
Member Author

@gfyoung gfyoung Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche : This test does exactly what you both were asking for: unique elements tested against identical indexer


with tm.assert_raises_regex(ValueError, 'Invalid fill method'):
idx.get_indexer(idx, method='invalid')
with tm.assert_raises_regex(ValueError, "Invalid fill method"):
idx.get_indexer(idx, method="invalid")

def test_reindexing(self):
np.random.seed(123456789)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this, the fact it was random before actually catched a bug

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could, but "random" tests are generally not good design-wise. The fact that it happened to catch a bug doesn't change that IMO.

Copy link
Contributor

@jreback jreback Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you add a new test for this instead (I mean just split the section that you added to a new test); I would not use random either, rather you can use the specific case that was actually causing the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung already added the specific case that was failing below without using random data
(so the remark was only about the original test case whether to keep it random or not)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"random" tests are generally not good design-wise.

that might be in certain cases, but we have a ton of random tests in our test suite, and the majority has no seed set


ci = self.create_index()
oidx = Index(np.array(ci))
Expand All @@ -388,6 +388,18 @@ def test_reindexing(self):
actual = ci.get_indexer(finder)
tm.assert_numpy_array_equal(expected, actual)

# see gh-17323
#
# Even when indexer is equal to the
# members in the index, we should
# respect duplicates instead of taking
# the fast-track path.
for finder in [list("aabbca"), list("aababca")]:
expected = oidx.get_indexer_non_unique(finder)[0]

actual = ci.get_indexer(finder)
tm.assert_numpy_array_equal(expected, actual)

def test_reindex_dtype(self):
c = CategoricalIndex(['a', 'b', 'c', 'a'])
res, indexer = c.reindex(['a', 'c'])
Expand Down