BUG: Fix Series.get() for ExtensionArray and Categorical #20885

Dr-Irv · 2018-04-30T16:42:33Z

closes BUG: Series.get() on ExtensionArray series (and Categorical) indexed by integer returns incorrect result #20882
tests added / passed
- pandas/tests/extension/base/getitem.py::test_get()
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
- Not created - assume will go in v0.23 as part of ExtensionArray support

TomAugspurger · 2018-04-30T18:31:53Z

Taking a look now.

TomAugspurger · 2018-04-30T18:49:42Z

OK, so something that confuses me here. I initially hacked this in with a if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):.

That's mainly because this line failed:

            return self._engine.get_value(s, k,
                                          tz=getattr(series.dtype, 'tz', None))

self._engine.get_value(s, k) expects an ndarray, but we don't want to convert to an ndarray here, since that's expensive. It doesn't seem easy to let self._engine.get_value take an ndarray, but what if we do a bit of indirect indexing, by passing np.arange(len(s)), k? Something like

diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py
index 2ceec1592..b5b175c11 100644
--- a/pandas/core/indexes/base.py
+++ b/pandas/core/indexes/base.py
@@ -36,6 +36,7 @@ from pandas.core.dtypes.common import (
     is_period_dtype,
     is_bool,
     is_bool_dtype,
+    is_extension_array_dtype,
     is_signed_integer_dtype,
     is_unsigned_integer_dtype,
     is_integer_dtype, is_float_dtype,
@@ -3068,21 +3069,21 @@ class Index(IndexOpsMixin, PandasObject):
         # if we have something that is Index-like, then
         # use this, e.g. DatetimeIndex
         s = getattr(series, '_values', None)
-        if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):
-            try:
-                return s[key]
-            except (IndexError, ValueError):
+        is_extension = is_extension_array_dtype(series)
 
-                # invalid type as an indexer
-                pass
+        if is_extension:
+            s = np.arange(len(series))
+        else:
+            s = com._values_from_object(series)
 
-        s = com._values_from_object(series)
         k = com._values_from_object(key)
-
         k = self._convert_scalar_indexer(k, kind='getitem')
         try:
-            return self._engine.get_value(s, k,
-                                          tz=getattr(series.dtype, 'tz', None))
+            result = self._engine.get_value(
+                s, k, tz=getattr(series.dtype, 'tz', None))
+            if is_extension:
+                result = series._values[result]
+            return result
         except KeyError as e1:
             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
                 raise

The basic idea is to index into the positions, and then do a secondary ExtensionArray.getitem once you know the positions. Does that make any sense?

TomAugspurger · 2018-04-30T18:51:16Z

pandas/tests/extension/base/getitem.py

+
+        s = pd.Series(data[:6], index=list('abcdef'))
+        assert s.get('c') == s.iloc[2]
+


Could you also add a test for a slice like s.get(slice(2))? That seems to be valid as far as .get is concerned.

Regarding your suggestion, I don't think you need all of those changes. (I also don't think they will work because self._engine.get_value() returns an item from the from the passed ndarray and you are using that to index into the series itself). Things work fine in how I did it when specifying a slice or multiple indices. The issue is when the key is a single value, which is what I took care of with my fix.

I'll push some additional tests.

TomAugspurger · 2018-04-30T18:51:32Z

FYI, I think this can go in after the RC.

codecov · 2018-04-30T19:23:24Z

Codecov Report

Merging #20885 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20885      +/-   ##
==========================================
+ Coverage   91.81%   91.81%   +<.01%     
==========================================
  Files         153      153              
  Lines       49481    49483       +2     
==========================================
+ Hits        45430    45434       +4     
+ Misses       4051     4049       -2

Flag	Coverage Δ
#multiple	`90.21% <100%> (ø)`	⬆️
#single	`41.84% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/base.py	`96.65% <100%> (ø)`	⬆️
pandas/util/testing.py	`84.59% <0%> (+0.2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd4332f...65087a6. Read the comment docs.

jreback · 2018-05-01T00:24:36Z

pandas/core/indexes/base.py

@@ -3068,13 +3068,23 @@ def get_value(self, series, key):
        # if we have something that is Index-like, then
        # use this, e.g. DatetimeIndex
        s = getattr(series, '_values', None)
-        if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):
-            try:
-                return s[key]


if you just use .get_loc(key) on an Index as well does this work? (perf should be similar). That way we can avoid separating this.

jreback · 2018-05-01T00:26:42Z

see also #14865 I believe this will also be solved by the same fix. If so, pls add that as a test (its for Categorical)

Dr-Irv · 2018-05-01T14:25:53Z

@jreback with respect to #14865, as @jorisvandenbossche stated in the original issue #20885 for this PR, I think they are separate issues, so I will not add additional tests.

Dr-Irv · 2018-05-01T14:43:22Z

@jreback with respect to your suggested change of using s[self.get_loc[key]] instead of s[key] for Index, it breaks two existing tests, so I think the fix as I've implemented it should be taken as is. (cc @TomAugspurger)

If you guys want me to do anything else with this PR, let me know, but IMHO, you ought to just merge it in.

jreback · 2018-05-03T00:17:50Z

it breaks two existing tests

what does it break? I really dont' like treating ExtensionArray different from Index, if that's the case its a bug in the API (for indexing).

Dr-Irv · 2018-05-03T21:13:12Z

@jreback The tests that fail are

pandas\tests\test_base.py in test_value_counts_unique_nunique
tests/groupby/aggregate/test_other.py in test_agg_timezone_round_trip

I've investigated the first one and here is what I have found out. (I think the second one is the same problem). If we use the way you propose, where we use get_loc(key) on the Index, the example below fails on the expression s2[0].

In [1]: import pandas as pd

In [2]: def makeDateIndex(k=10, freq='B', name=None):
   ...:     dt = pd.datetime(2000, 1, 1)
   ...:     dr = pd.bdate_range(dt, periods=k, freq=freq, name=name)
   ...:     return pd.DatetimeIndex(dr, name=name)
   ...:
   ...: dt_tz_index = makeDateIndex(10, name='a').tz_localize(tz='US/Eastern')
   ...: s1 = pd.Series([i for i in range(len(dt_tz_index))], index=dt_tz_index)
   ...:
   ...: s2 = pd.Series(dt_tz_index, index=dt_tz_index)
   ...:

In [3]: s1
Out[3]:
a
2000-01-03 00:00:00-05:00    0
2000-01-04 00:00:00-05:00    1
2000-01-05 00:00:00-05:00    2
2000-01-06 00:00:00-05:00    3
2000-01-07 00:00:00-05:00    4
2000-01-10 00:00:00-05:00    5
2000-01-11 00:00:00-05:00    6
2000-01-12 00:00:00-05:00    7
2000-01-13 00:00:00-05:00    8
2000-01-14 00:00:00-05:00    9
Freq: B, dtype: int64

In [4]: s2
Out[4]:
a
2000-01-03 00:00:00-05:00   2000-01-03 00:00:00-05:00
2000-01-04 00:00:00-05:00   2000-01-04 00:00:00-05:00
2000-01-05 00:00:00-05:00   2000-01-05 00:00:00-05:00
2000-01-06 00:00:00-05:00   2000-01-06 00:00:00-05:00
2000-01-07 00:00:00-05:00   2000-01-07 00:00:00-05:00
2000-01-10 00:00:00-05:00   2000-01-10 00:00:00-05:00
2000-01-11 00:00:00-05:00   2000-01-11 00:00:00-05:00
2000-01-12 00:00:00-05:00   2000-01-12 00:00:00-05:00
2000-01-13 00:00:00-05:00   2000-01-13 00:00:00-05:00
2000-01-14 00:00:00-05:00   2000-01-14 00:00:00-05:00
Freq: B, Name: a, dtype: datetime64[ns, US/Eastern]

In [5]: s1[0]
Out[5]: 0

In [6]: s2[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10784)()
    429
--> 430     cpdef get_loc(self, object val):
    431         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10616)()
    465             val = maybe_datetimelike_to_i8(val)
--> 466             return self.mapping.get_item(val)
    467         except (TypeError, ValueError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15389)()
    957
--> 958     cpdef get_item(self, int64_t val):
    959         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15333)()
    963         else:
--> 964             raise KeyError(val)
    965

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
   1609         try:
-> 1610             return Index.get_loc(self, key, method, tolerance)
   1611         except (KeyError, ValueError, TypeError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10784)()
    429
--> 430     cpdef get_loc(self, object val):
    431         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10616)()
    465             val = maybe_datetimelike_to_i8(val)
--> 466             return self.mapping.get_item(val)
    467         except (TypeError, ValueError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15389)()
    957
--> 958     cpdef get_item(self, int64_t val):
    959         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15333)()
    963         else:
--> 964             raise KeyError(val)
    965

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10434)()
    457         try:
--> 458             return self.mapping.get_item(val.value)
    459         except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15389)()
    957
--> 958     cpdef get_item(self, int64_t val):
    959         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15333)()
    963         else:
--> 964             raise KeyError(val)
    965

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10784)()
    429
--> 430     cpdef get_loc(self, object val):
    431         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10521)()
    459         except KeyError:
--> 460             raise KeyError(val)
    461         except AttributeError:

KeyError: Timestamp('1969-12-31 19:00:00-0500', tz='US/Eastern')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10434)()
    457         try:
--> 458             return self.mapping.get_item(val.value)
    459         except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15389)()
    957
--> 958     cpdef get_item(self, int64_t val):
    959         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15333)()
    963         else:
--> 964             raise KeyError(val)
    965

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
   1618                 stamp = Timestamp(key, tz=self.tz)
-> 1619                 return Index.get_loc(self, stamp, method, tolerance)
   1620             except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10784)()
    429
--> 430     cpdef get_loc(self, object val):
    431         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10521)()
    459         except KeyError:
--> 460             raise KeyError(val)
    461         except AttributeError:

KeyError: Timestamp('1969-12-31 19:00:00-0500', tz='US/Eastern')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_value(self, series, key)
   1559         try:
-> 1560             return com._maybe_box(self, Index.get_value(self, series, key),
   1561                                   series, key)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_value(self, series, key)
   3087 #                    return s[key]
-> 3088                     return s[self.get_loc(key)]
   3089                 except (IndexError, ValueError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
   1620             except KeyError:
-> 1621                 raise KeyError(key)
   1622             except ValueError as e:

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10434)()
    457         try:
--> 458             return self.mapping.get_item(val.value)
    459         except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15389)()
    957
--> 958     cpdef get_item(self, int64_t val):
    959         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15333)()
    963         else:
--> 964             raise KeyError(val)
    965

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_value(self, series, key)
   1569             try:
-> 1570                 return self.get_value_maybe_box(series, key)
   1571             except (TypeError, ValueError, KeyError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_value_maybe_box(self, series, key)
   1580         values = self._engine.get_value(com._values_from_object(series),

-> 1581                                         key, tz=self.tz)
   1582         return com._maybe_box(self, values, series, key)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4847)()
    104
--> 105     cpdef get_value(self, ndarray arr, object key, object tz=None):
    106         """

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4530)()
    112
--> 113         loc = self.get_loc(key)
    114         if PySlice_Check(loc) or cnp.PyArray_Check(loc):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10521)()
    459         except KeyError:
--> 460             raise KeyError(val)
    461         except AttributeError:

KeyError: Timestamp('1969-12-31 19:00:00-0500', tz='US/Eastern')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-ab7c8e26b0d3> in <module>()
----> 1 s2[0]

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\series.py in __getitem__(self, key)
    764         key = com._apply_if_callable(key, self)
    765         try:
--> 766             result = self.index.get_value(self, key)
    767
    768             if not is_scalar(result):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_value(self, series, key)
   1570                 return self.get_value_maybe_box(series, key)
   1571             except (TypeError, ValueError, KeyError):
-> 1572                 raise KeyError(key)
   1573
   1574     def get_value_maybe_box(self, series, key):

KeyError: 0

Here's an easier-to-read stack trace of calls for s2[0] in terms of where it bombs out:

pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
--> 3063                 return self._engine.get_loc(key)
pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
-->1610             return Index.get_loc(self, key, method, tolerance)
pandas\core\indexes\base.py in get_value(self, series, key)
--> 3088                     return s[self.get_loc(key)]
pandas\core\indexes\datetimes.py in get_value(self, series, key)
--> 1560             return com._maybe_box(self, Index.get_value(self, series, key),
    1561                                   series, key)
pandas\core\series.py in __getitem__(self, key)
--> 766             result = self.index.get_value(self, key)

So what may be the error here is that if the values of a Series containing TZ-aware values end up being a DateTimeIndex and get_loc(0) isn't defined for an Index. Note that in the code below, if you remove the .tz_localize() part, it works fine.

Maybe when DateTimeIndex is based on ExtensionArray, then this code as I wrote it could change so that ExtensionArray and Index are handled the same way.

Or maybe you know how to fix things for DateTimeIndex, but it's not clear what is the thing I should be testing to be sure that this issue is fixed.

jreback · 2018-05-04T10:20:39Z

this #20885 (comment)

is a bug and should be fixed as a pre-cursor to this PR, @Dr-Irv pls make a separate issue (I thought we had one, but couldn't fin it).

Dr-Irv · 2018-05-04T14:37:17Z

@jreback I'm not sure how to report the issue. The only way to replicate it is if we implement this PR as you suggested. Should I create the issue that way?

I can't figure out how to demonstrate the bad behavior on 0.23rc2, or, equivalently, what test I could write that would verify that the bug illustrated above exists (since the bug only occurs if the PR is implemented in a different way).

Dr-Irv · 2018-05-04T20:19:20Z

@jreback Above, I mentioned that the change you want also breaks pandas\core\indexes\datetimes.py in test_agg_timezone_round_trip . I think that is related to the bug reported in #20949 .

Dr-Irv · 2018-05-04T21:54:23Z

@jreback I've pushed a new commit here, as I found another bug in how I did things, so the new implementation looks more different for Index and ExtensionArray because of how the exceptions are handled. I added another test that captured this difference.

In the case of ExtensionArray, if the key is a scalar, we have a Series backed by an ExtensionArray, so we first have to try to translate that key into an integer index to pass down to ExtensionArray.__getitem__. If we can't do that, then if the scalar is an integer, we pass that integer down. If that fails, we pass back whatever exception is raised by the ExtensionArray.__getitem__() implementation..

If the Series is backed by an Index, then we need to let the existing machinery for Index return the result, and that machinery uses deep pandas internals to get the right value whether it is a loc-type key or an integer.

jorisvandenbossche

this looks good to me

jreback · 2018-05-05T11:43:25Z

pandas/core/indexes/base.py

+                    iloc = self.get_loc(key)
+                    return s[iloc]
+                except KeyError:
+                    if isinstance(key, (int, np.integer)):


use is_integer

why doesn’t the pass work?

needs comments

@jreback I've just pushed a new commit that uses is_integer and adds comments.

pass doesn't work because we are catching a different exception. So the way I unified the code between Index and ExtensionArray was by handling the exception differently in the case that the key was a scalar.

jreback · 2018-05-05T14:35:01Z

pandas/core/indexes/base.py

-                # invalid type as an indexer
-                pass
+        if is_scalar(key):
+            if isinstance(s, (Index, ExtensionArray)):


this could be an and here.

jreback · 2018-05-05T14:35:21Z

pandas/core/indexes/base.py

+                try:
+                    iloc = self.get_loc(key)
+                    return s[iloc]
+                except KeyError:


what happened to the Indexerror case? is that not possible now?

jreback · 2018-05-05T14:36:08Z

pandas/tests/extension/base/getitem.py

+        result = s.get(slice('b', 'd'))
+        expected = s.iloc[[1, 2, 3]]
+        self.assert_series_equal(result, expected)
+


can you add some cases with an out-of-range integer

jreback · 2018-05-08T00:25:21Z

@Dr-Irv ok this change looks ok. can you run the indexing asv's and report if anything is changing? this is a central piece of code so need to check.

Dr-Irv · 2018-05-08T14:38:28Z

@jreback I hope I did this right. I'm doing this on my 16GB 4 core laptop on Windows. I compared upstream/master (last commit on May 5) to my latest commit. Here are the results:

      before           after         ratio
     [bd4332f4]       [65087a6d]
+        30.8±0μs         44.0±5μs     1.43  indexing.NumericSeriesIndexing.time_iloc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>)
-         516±0ms          453±0ms     0.88  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>)
-        83.3±5μs         71.7±0μs     0.86  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Int64Index'>)
-         109±2μs         93.9±0μs     0.86  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>)
-     1.09±0.04ms         934±10μs     0.85  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>)
-        170±10μs          141±0μs     0.83  indexing.IntervalIndexing.time_getitem_list
-         427±0μs         351±30μs     0.82  indexing.MultiIndexing.time_series_ix
-         516±8ms          406±0ms     0.79  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>)
-        498±50ns          382±0ns     0.77  indexing.MethodLookup.time_lookup_loc
-        97.7±6μs         70.7±0μs     0.72  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>)
-         107±0μs         75.0±0μs     0.70  indexing.NumericSeriesIndexing.time_ix_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>)
-        562±20ms          383±4ms     0.68  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>)
-         103±6μs         67.2±2μs     0.65  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>)

So many things got faster. I'm not surprised by this, as we're not falling through exceptions any more when indexing by a scalar.

I also reran just the one that got slower, and when I did that, it flipped to 62.3±4μs for master and 55.6±0μs for the new version, showing an improvement.

I then ran the benchmarks again to better understand variability:

      before           after         ratio
     [bd4332f4]       [65087a6d]
+         567±0μs         758±50μs     1.34  indexing.DataFrameNumericIndexing.time_bool_indexer
+        72.1±6μs         88.8±0μs     1.23  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>)

And once more:

      before           after         ratio
     [bd4332f4]       [65087a6d]
+        2.15±0ms         62.5±4ms    29.09  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Int64Index'>)
-        92.5±0μs         83.8±0μs     0.91  indexing.NonNumericSeriesIndexing.time_getitem_label_slice('datetime')

Given the inconsistent results (probably affected by other background processes happening on my laptop), I think we're OK.

jreback · 2018-05-09T10:59:37Z

@Dr-Irv IIRC there is a flag for affinity to help prevent this kind of variable, also I thin we have some asv options to run this many times. but close enough for now.

jreback · 2018-05-09T11:00:08Z

thanks!

BUG: Fix Series.get() for ExtensionArray and Categorical

c269e13

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 30, 2018

TomAugspurger added the Indexing Related to indexing on series/frames, not to indexes themselves label Apr 30, 2018

TomAugspurger reviewed Apr 30, 2018

View reviewed changes

TomAugspurger added this to the 0.23.0 milestone Apr 30, 2018

Add additional tests

b7f2a6f

jreback requested changes May 1, 2018

View reviewed changes

TomAugspurger mentioned this pull request May 1, 2018

RLS: 0.23.0 #20531

Closed

71 tasks

Merge remote-tracking branch 'upstream/master' into issue20882

a6734c4

Additional test for .get with ExtensionArray

c04f77c

Get Exception handling right

1bbaa2b

jorisvandenbossche approved these changes May 5, 2018

View reviewed changes

jreback requested changes May 5, 2018

View reviewed changes

Use isinteger and add comments

0a88cdd

Dr-Irv mentioned this pull request May 5, 2018

ENH: Support operators for ExtensionArray #20889

Closed

4 tasks

jreback requested changes May 5, 2018

View reviewed changes

add boundary tests

894edc8

Dr-Irv added 2 commits May 5, 2018 20:02

Merge remote-tracking branch 'upstream/master' into issue20882

eb9b6bc

fix if test

65087a6

Dr-Irv mentioned this pull request May 6, 2018

BUG in .groupby.apply when applying a function that has mixed data types and the user supplied function can fail on the grouping column #20959

Merged

TomAugspurger approved these changes May 8, 2018

View reviewed changes

jreback approved these changes May 9, 2018

View reviewed changes

jreback merged commit e978279 into pandas-dev:master May 9, 2018

Dr-Irv deleted the issue20882 branch May 9, 2018 13:39

Dr-Irv mentioned this pull request May 30, 2018

BUG: Series.get() on ExtensionArray with integer key not in index returns incorrect result #21257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix Series.get() for ExtensionArray and Categorical #20885

BUG: Fix Series.get() for ExtensionArray and Categorical #20885

Dr-Irv commented Apr 30, 2018

TomAugspurger commented Apr 30, 2018

TomAugspurger commented Apr 30, 2018 •

edited

Loading

TomAugspurger Apr 30, 2018

Dr-Irv Apr 30, 2018

TomAugspurger commented Apr 30, 2018

codecov bot commented Apr 30, 2018 •

edited

Loading

jreback May 1, 2018

jreback commented May 1, 2018

Dr-Irv commented May 1, 2018

Dr-Irv commented May 1, 2018

jreback commented May 3, 2018 •

edited

Loading

Dr-Irv commented May 3, 2018 •

edited

Loading

jreback commented May 4, 2018

Dr-Irv commented May 4, 2018 •

edited

Loading

Dr-Irv commented May 4, 2018

Dr-Irv commented May 4, 2018

jorisvandenbossche left a comment

jreback May 5, 2018

jreback May 5, 2018

jreback May 5, 2018

Dr-Irv May 5, 2018

jreback May 5, 2018

Dr-Irv May 6, 2018

jreback May 5, 2018

jreback May 5, 2018

Dr-Irv May 6, 2018

jreback commented May 8, 2018

Dr-Irv commented May 8, 2018

jreback commented May 9, 2018

jreback commented May 9, 2018


		s = pd.Series(data[:6], index=list('abcdef'))
		assert s.get('c') == s.iloc[2]

BUG: Fix Series.get() for ExtensionArray and Categorical #20885

BUG: Fix Series.get() for ExtensionArray and Categorical #20885

Conversation

Dr-Irv commented Apr 30, 2018

TomAugspurger commented Apr 30, 2018

TomAugspurger commented Apr 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Apr 30, 2018

codecov bot commented Apr 30, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

jreback commented May 1, 2018

Dr-Irv commented May 1, 2018

Dr-Irv commented May 1, 2018

jreback commented May 3, 2018 • edited Loading

Dr-Irv commented May 3, 2018 • edited Loading

jreback commented May 4, 2018

Dr-Irv commented May 4, 2018 • edited Loading

Dr-Irv commented May 4, 2018

Dr-Irv commented May 4, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 8, 2018

Dr-Irv commented May 8, 2018

jreback commented May 9, 2018

jreback commented May 9, 2018

TomAugspurger commented Apr 30, 2018 •

edited

Loading

codecov bot commented Apr 30, 2018 •

edited

Loading

jreback commented May 3, 2018 •

edited

Loading

Dr-Irv commented May 3, 2018 •

edited

Loading

Dr-Irv commented May 4, 2018 •

edited

Loading