Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug issue 16819 Index.get_indexer_not_unique inconsistent return types vs get_indexer #16826

Merged
merged 8 commits into from
Jul 6, 2017
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Backwards incompatible API changes
- :func:`read_csv` now treats ``'n/a'`` strings as missing values by default (:issue:`16078`)
- :class:`pandas.HDFStore`'s string representation is now faster and less detailed. For the previous behavior, use ``pandas.HDFStore.info()``. (:issue:`16503`).
- Compression defaults in HDF stores now follow pytable standards. Default is no compression and if ``complib`` is missing and ``complevel`` > 0 ``zlib`` is used (:issue:`15943`)
- Index.get_indexer_non_unique() now returns a ndarray indexer rather than an Index; this is consistent with Index.get_indexer() (:issue:`16819`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double-backticks on Index.get_indexer_non_unique(), Index, and Index.get_indexer()


.. _whatsnew_0210.api:

Expand Down
9 changes: 4 additions & 5 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2256,15 +2256,15 @@ def intersection(self, other):
indexer = indexer.take((indexer != -1).nonzero()[0])
except:
# duplicates
indexer = Index(other._values).get_indexer_non_unique(
self._values)[0].unique()
indexer = algos.unique1d(Index(other._values).get_indexer_non_unique(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do

np.array(Index(other._values).get_indexer_non_unique(self._values)[0].unique())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason I avoided that was because the behaviour of np unique is slightly different to the original unique function, it returns as sorted rather than the original order for example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would have to do numpy.unique(Index(other._values).get_indexer_non_unique(self._values)[0]) as get_indexer_non_inque returns np.ndarray which does not have a unique method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use numpy.unique at all. This is .unique() on an Index, then np.array turns it into an indexer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this whole change is to make get_indexer_non_unique return a ndarray rather than an Index so can no longer call this Index.unique() method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine.

self._values)[0])
indexer = indexer[indexer != -1]

taken = other.take(indexer)
if self.name != other.name:
taken.name = None
return taken

def difference(self, other):
"""
Return a new Index with elements from the index that are not in
Expand Down Expand Up @@ -2704,7 +2704,7 @@ def get_indexer_non_unique(self, target):
tgt_values = target._values

indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
return Index(indexer), missing
return indexer, missing

def get_indexer_for(self, target, **kwargs):
"""
Expand Down Expand Up @@ -2942,7 +2942,6 @@ def _reindex_non_unique(self, target):
else:

# need to retake to have the same size as the indexer
indexer = indexer.values
indexer[~check] = 0

# reset the new indexer to account for the new size
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/indexes/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1131,6 +1131,17 @@ def test_get_indexer_strings(self):
with pytest.raises(TypeError):
idx.get_indexer(['a', 'b', 'c', 'd'], method='pad', tolerance=2)

def test_get_indexer_consistency(self):
# See GH 16819
for name, index in self.indices.items():
indexer = index.get_indexer(index[0:2])
assert isinstance(indexer, np.ndarray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use

expected = np.array([0, 1, 2], dtype=np.intp)
tm.assert_numpy_array_equal(indexer, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls update to use this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue with that is that some of the indexes are empty or categorical indexes so not unique positions so can't assume that [0, 1, 2] are returned

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

assert indexer.dtype == np.intp

indexer, _ = index.get_indexer_non_unique(index[0:2])
assert isinstance(indexer, np.ndarray)
assert indexer.dtype == np.intp

def test_get_loc(self):
idx = pd.Index([0, 1, 2])
all_methods = [None, 'pad', 'backfill', 'nearest']
Expand Down
3 changes: 1 addition & 2 deletions pandas/tests/indexes/test_category.py
Original file line number Diff line number Diff line change
Expand Up @@ -386,8 +386,7 @@ def test_reindexing(self):
expected = oidx.get_indexer_non_unique(finder)[0]

actual = ci.get_indexer(finder)
tm.assert_numpy_array_equal(
expected.values, actual, check_dtype=False)
tm.assert_numpy_array_equal(expected, actual, check_dtype=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_dtype is True by default


def test_reindex_dtype(self):
c = CategoricalIndex(['a', 'b', 'c', 'a'])
Expand Down