Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: IntervalIndex.get_loc/get_indexer wrong return value / error #25090

Closed
wants to merge 25 commits into from

Conversation

samuelsinayoko
Copy link
Contributor

@samuelsinayoko samuelsinayoko commented Feb 2, 2019

@codecov
Copy link

codecov bot commented Feb 2, 2019

Codecov Report

Merging #25090 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #25090   +/-   ##
=======================================
  Coverage   92.37%   92.37%           
=======================================
  Files         166      166           
  Lines       52420    52420           
=======================================
  Hits        48423    48423           
  Misses       3997     3997
Flag Coverage Δ
#multiple 90.79% <100%> (ø) ⬆️
#single 42.88% <100%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 96.82% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb43726...f357101. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 2, 2019

Codecov Report

Merging #25090 into master will increase coverage by 0.41%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25090      +/-   ##
==========================================
+ Coverage   91.48%   91.89%   +0.41%     
==========================================
  Files         175      175              
  Lines       52885    52495     -390     
==========================================
- Hits        48380    48241     -139     
+ Misses       4505     4254     -251
Flag Coverage Δ
#multiple 90.45% <100%> (+0.4%) ⬆️
#single 40.74% <0%> (-1.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 95.56% <100%> (+0.3%) ⬆️
pandas/io/clipboard/__init__.py 39.21% <0%> (-17.65%) ⬇️
pandas/io/clipboard/clipboards.py 18.51% <0%> (-12.07%) ⬇️
pandas/plotting/_compat.py 83.33% <0%> (-4.17%) ⬇️
pandas/core/config_init.py 96.96% <0%> (-2.24%) ⬇️
pandas/plotting/_style.py 77.17% <0%> (-0.49%) ⬇️
pandas/compat/numpy/__init__.py 92.85% <0%> (-0.48%) ⬇️
pandas/core/groupby/grouper.py 98.18% <0%> (-0.35%) ⬇️
pandas/core/computation/pytables.py 90.24% <0%> (-0.31%) ⬇️
pandas/plotting/_misc.py 38.46% <0%> (-0.23%) ⬇️
... and 93 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 324bb84...9f6b5c0. Read the comment docs.

Write tests first
When tested with a variable that has the wrong dtype, this raises an
exception instead of False
When supplied a variable with the wrong type get_loc should raise a
KeyError (not type error). Otherwise things like checking if a variable
is in an index will fail.
@pep8speaks
Copy link

pep8speaks commented Feb 2, 2019

Hello @samuelsinayoko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-10 07:20:10 UTC

This is enough for making the test pass but it's not the right implementation
@jorisvandenbossche jorisvandenbossche changed the title 23264 BUG: IntervalIndex.get_loc/get_indexer wrong return value / error Feb 2, 2019
pandas/core/frame.py Outdated Show resolved Hide resolved
try:
start, stop = self._find_non_overlapping_monotonic_bounds(key)
except TypeError:
# get loc should raise KeyError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: .get_loc().

# get loc should raise KeyError
# if key is hashable but
# of an incorrect type
raise KeyError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be raise from, which is a Python 3 construct. To stay version-agnostic use six.raise_from.

Suggested change
raise KeyError
import six
...
try:
start, stop = self._find_non_overlapping_monotonic_bounds(key)
except TypeError as exc:
six.raise_from(KeyError('Key is hashable, but of an incorrect type'), exc)

Copy link
Member

@jschendel jschendel Feb 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is six.raise_from really necessary? I don't see us doing this anywhere else in the codebase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use python 3 only constructs yet, the existing is ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've left the code as is but have improved the reported message as suggested by @rs2

)
except TypeError:
# This is probably wrong
# but not sure what I should do here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯\_(ツ)_/¯

except TypeError:
# This is probably wrong
# but not sure what I should do here
return np.array([-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment on the choice of -1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be the same length as target and of intp dtype: np.repeat(np.intp(-1), len(target))

@rs2
Copy link
Contributor

rs2 commented Feb 2, 2019

xref: #25091, #25087

except TypeError:
# This is probably wrong
# but not sure what I should do here
return np.array([-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be the same length as target and of intp dtype: np.repeat(np.intp(-1), len(target))

# get loc should raise KeyError
# if key is hashable but
# of an incorrect type
raise KeyError
Copy link
Member

@jschendel jschendel Feb 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is six.raise_from really necessary? I don't see us doing this anywhere else in the codebase?

@@ -886,6 +886,13 @@ def test_symmetric_difference(self, closed, sort):
result = index.symmetric_difference(other, sort=sort)
tm.assert_index_equal(result, expected)

def test_interval_range_get_indexer_with_different_input_type(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename to test_get_indexer_errors and move to around line 618 where the other get_indexer tests are?

@@ -886,6 +886,13 @@ def test_symmetric_difference(self, closed, sort):
result = index.symmetric_difference(other, sort=sort)
tm.assert_index_equal(result, expected)

def test_interval_range_get_indexer_with_different_input_type(self):
# not sure about this one
index = pd.interval_range(0, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize over index and include an non-monotonic/overlapping IntervalIndex, e.g. pd.IntervalIndex.from_tuples([(1, 3), (2, 4), (0, 2)])

index = pd.interval_range(0, 1)
# behaviour should be the same as Int64Index and return an
# array with values of -1
assert np.all(index.get_indexer(['gg']) == np.array([-1]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.assert_numpy_array_equal and make sure your expected is 'intp' dtype.

""" GH25087, test get_loc returns key error for interval indexes"""
idx = pd.interval_range(0, 1.0)
with pytest.raises(KeyError):
idx.get_loc('gg')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of testing here, can you add this as a test case to test_get_loc_value in pandas/tests/indexes/interval/test_interval.py:

def test_get_loc_value(self):

Copy link
Contributor Author

@samuelsinayoko samuelsinayoko Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot, I must say I wasn't completely clear about the distinction between indexes and indexing with regards to tests.
I've implemented your suggestion in 93f75ea.

@@ -766,8 +766,13 @@ def get_loc(self, key, method=None):
key = Interval(left, right, key.closed)
else:
key = self._maybe_cast_slice_bound(key, 'left', None)

start, stop = self._find_non_overlapping_monotonic_bounds(key)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to do something similar in the else branch to cover the overlapping/non-monotonic case, e.g. I think something like pd.IntervalIndex.from_tuples([(1, 3), (2, 4), (0, 2)]).get_loc('foo') will still fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be the engine that should properly raise a KeyError? (eg the int64 engine does that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to look into @jorisvandenbossche's comment on raising the error in the engine itself (especially if that's the behaviour for int64), but I think I've addressed everything else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche : There is code in place within the engine that raises a KeyError, but strings queries fail before it gets there since the engine is expecting a scalar_t type (fused type consisting of numeric types) for key:

def get_loc(self, scalar_t key):

I'm not super well versed in Cython. Is there a graceful way to force this to raise a KeyError within the Cython code? Removing the scalar_t type gets a step further but still raises a TypeError as the code expects things to be comparable (probably some perf implications to removing it too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the other engines have the key as object typed, and then afterwards do a check of that.
But for me fine as well to leave that for now, and do the check here in the level above that. But on the long term would still be good to make the behaviour consistent throughout the different engines.

@jschendel
Copy link
Member

Thanks! To provide some additional context here: the indexing methods for IntervalIndex are currently in flux and a bit of a mess. There are specs to change the behavior (xref #16316) that I've been working through, albeit slowly due to not having much free time lately. In the process of doing so I've also been addressing stuff like this, removing the need for separate is_non_overlapping_monotonic branches, and generally cleaning things up.

That being said, this is still something that'd be viable for a 0.24.x change, since the new specs require breaking changes and would need to go in a major release (aiming for 0.25.0).

@jschendel jschendel added Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type labels Feb 2, 2019
# get loc should raise KeyError
# if key is hashable but
# of an incorrect type
raise KeyError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use python 3 only constructs yet, the existing is ok.

Include some non-monotonic/overlapping IntervalIndex.
This triggers another bug, due to the fact that self.get_loc(i)
is called on an unexpected key.
pandas/tests/indexes/interval/test_interval.py Outdated Show resolved Hide resolved
@@ -766,8 +766,13 @@ def get_loc(self, key, method=None):
key = Interval(left, right, key.closed)
else:
key = self._maybe_cast_slice_bound(key, 'left', None)

start, stop = self._find_non_overlapping_monotonic_bounds(key)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be the engine that should properly raise a KeyError? (eg the int64 engine does that)

Move the test from indexing/test_loc to index/interval/test_interval.
target was missing from call to _find_non_overlapping_monotonic_bounds
try:
start, stop = self._find_non_overlapping_monotonic_bounds(key)
except TypeError:
# get_loc should raise KeyError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here a comment as Tom proposed (TODO(py3): use raise from.) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can now use raise from (as we are PY3 only)

start, stop = self._find_non_overlapping_monotonic_bounds(key)
except TypeError:
# get_loc should raise KeyError
raise KeyError('key is hashable but of incorrect type')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for key errors we typically just pass the key itself as message, so might be good to be consistent with that. Or at least I would include the key in the message, something like: "Key {0} is of the incorrect type".format(key)

try:
return self._engine.get_loc(key)
except TypeError:
raise KeyError('No engine for key {!r}'.format(key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention "engine" here (that is something internal to pandas, while this error message will be visible for users)

self._find_non_overlapping_monotonic_bounds(target)
)
except TypeError:
return np.repeat(np.int(-1), len(target))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschendel I am not fully sure this will cover all cases (but what was there before also not).

target can in principle a mixture of valid and invalid keys. So maybe the easiest would be to fall back to the else branch that iterates through the elements separately in case this raises a TypeError.
That could be done by putting a pass here, and putting the return value of three lines below in the else part of a try/except/else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry getting a bit confused here. You're saying that we should pass on line 833 here and left the code run to the else branch starting on line 851 (non IntervalIndex) where it loops over each element and appends -1 to the list if a KeyError is raised (I would probably add TypeError too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is basically what I meant I think.

So that if you do idx.get_indexer(['a', 1]) (where 1 is a valid key), you get [-0, 1] as result instead of [-1, -1] (assuming that idx.get_loc(1) would return 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I've just pushed a commit implementing this. I had to make a few tweaks to the "non IntervalIndex" else branch (starting line 849 in core/indexes/interval.py) to make my two new tests pass, but hopefully haven't broken anything. See 2c48272

pandas/tests/indexes/interval/test_interval.py Outdated Show resolved Hide resolved
])
def test_get_indexer_errors(self, index):
expected = np.array([-1], dtype='intp')
assert tm.assert_numpy_array_equal(index.get_indexer(['gg']), expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you here also test multiple values and a mixture of values? Like get_indexer(['a', 'b'] and get_indexer([1, 'a'])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will look into this over the weekend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see 02127ff

@@ -435,6 +435,14 @@ def test_get_loc_value(self):
idx = IntervalIndex.from_arrays([0, 2], [1, 3])
pytest.raises(KeyError, idx.get_loc, 1.5)

# GH25087, test get_loc returns key error for interval indexes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in a new test? (you can leave it in this place, but just put a def test_get_loc_invalid_key(self) above this line)
Reason is that the other test is commented to be replaced, but this new test we want to keep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have got a few broken tests to fix. Hoping to make the build green in the coming days.

@jorisvandenbossche jorisvandenbossche added this to the 0.24.2 milestone Feb 7, 2019
Instead of returning [-1, -1, -1] when the middle value is incorrect
type, return [a, -1, b].
Add mix of invalid and valid values
Fixes test_with_overlaps test
interval.get_indexer() should still raise a TypeError in cases
where the types are unorderable. This is needed for DataFrame.append
for example, which was breaking tests in test_concat.
@samuelsinayoko
Copy link
Contributor Author

Any tips on making the tests pass on macOS, windows and Linux py27?

@jreback jreback modified the milestones: 0.24.2, Contributions Welcome Mar 3, 2019
@TomAugspurger
Copy link
Contributor

@samuelsinayoko do those tests fail for you locally? If not, I can take a look later.

@samuelsinayoko
Copy link
Contributor Author

samuelsinayoko commented Mar 8, 2019 via email

@samuelsinayoko
Copy link
Contributor Author

@TomAugspurger yes the tests pass locally on my linux machine. Not sure what's happening on Windows and OSX, so would be grateful for any help on this.

@WillAyd
Copy link
Member

WillAyd commented Apr 10, 2019

@samuelsinayoko can you merge master? Recently dropped Py2 support so should make things easier

@samuelsinayoko
Copy link
Contributor Author

Ok will do

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

@samuelsinayoko can you merge mater and update

@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

closing as stale, if you'd like to continue, pls ping.

@jreback jreback closed this Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in 0.24: TypeError exception when using dropna on dataframe with categorical index
8 participants