Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix Series.get() for ExtensionArray and Categorical #20885

Merged
merged 9 commits into from
May 9, 2018

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Apr 30, 2018

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 30, 2018
@TomAugspurger
Copy link
Contributor

Taking a look now.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 30, 2018

OK, so something that confuses me here. I initially hacked this in with a if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):.

That's mainly because this line failed:

            return self._engine.get_value(s, k,
                                          tz=getattr(series.dtype, 'tz', None))

self._engine.get_value(s, k) expects an ndarray, but we don't want to convert to an ndarray here, since that's expensive. It doesn't seem easy to let self._engine.get_value take an ndarray, but what if we do a bit of indirect indexing, by passing np.arange(len(s)), k? Something like

diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py
index 2ceec1592..b5b175c11 100644
--- a/pandas/core/indexes/base.py
+++ b/pandas/core/indexes/base.py
@@ -36,6 +36,7 @@ from pandas.core.dtypes.common import (
     is_period_dtype,
     is_bool,
     is_bool_dtype,
+    is_extension_array_dtype,
     is_signed_integer_dtype,
     is_unsigned_integer_dtype,
     is_integer_dtype, is_float_dtype,
@@ -3068,21 +3069,21 @@ class Index(IndexOpsMixin, PandasObject):
         # if we have something that is Index-like, then
         # use this, e.g. DatetimeIndex
         s = getattr(series, '_values', None)
-        if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):
-            try:
-                return s[key]
-            except (IndexError, ValueError):
+        is_extension = is_extension_array_dtype(series)
 
-                # invalid type as an indexer
-                pass
+        if is_extension:
+            s = np.arange(len(series))
+        else:
+            s = com._values_from_object(series)
 
-        s = com._values_from_object(series)
         k = com._values_from_object(key)
-
         k = self._convert_scalar_indexer(k, kind='getitem')
         try:
-            return self._engine.get_value(s, k,
-                                          tz=getattr(series.dtype, 'tz', None))
+            result = self._engine.get_value(
+                s, k, tz=getattr(series.dtype, 'tz', None))
+            if is_extension:
+                result = series._values[result]
+            return result
         except KeyError as e1:
             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
                 raise

The basic idea is to index into the positions, and then do a secondary ExtensionArray.getitem once you know the positions. Does that make any sense?

@TomAugspurger TomAugspurger added the Indexing Related to indexing on series/frames, not to indexes themselves label Apr 30, 2018

s = pd.Series(data[:6], index=list('abcdef'))
assert s.get('c') == s.iloc[2]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test for a slice like s.get(slice(2))? That seems to be valid as far as .get is concerned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your suggestion, I don't think you need all of those changes. (I also don't think they will work because self._engine.get_value() returns an item from the from the passed ndarray and you are using that to index into the series itself). Things work fine in how I did it when specifying a slice or multiple indices. The issue is when the key is a single value, which is what I took care of with my fix.

I'll push some additional tests.

@TomAugspurger
Copy link
Contributor

FYI, I think this can go in after the RC.

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Apr 30, 2018
@codecov
Copy link

codecov bot commented Apr 30, 2018

Codecov Report

Merging #20885 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20885      +/-   ##
==========================================
+ Coverage   91.81%   91.81%   +<.01%     
==========================================
  Files         153      153              
  Lines       49481    49483       +2     
==========================================
+ Hits        45430    45434       +4     
+ Misses       4051     4049       -2
Flag Coverage Δ
#multiple 90.21% <100%> (ø) ⬆️
#single 41.84% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/base.py 96.65% <100%> (ø) ⬆️
pandas/util/testing.py 84.59% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd4332f...65087a6. Read the comment docs.

@@ -3068,13 +3068,23 @@ def get_value(self, series, key):
# if we have something that is Index-like, then
# use this, e.g. DatetimeIndex
s = getattr(series, '_values', None)
if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):
try:
return s[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you just use .get_loc(key) on an Index as well does this work? (perf should be similar). That way we can avoid separating this.

@jreback
Copy link
Contributor

jreback commented May 1, 2018

see also #14865 I believe this will also be solved by the same fix. If so, pls add that as a test (its for Categorical)

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 1, 2018

@jreback with respect to #14865, as @jorisvandenbossche stated in the original issue #20885 for this PR, I think they are separate issues, so I will not add additional tests.

@TomAugspurger TomAugspurger mentioned this pull request May 1, 2018
71 tasks
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 1, 2018

@jreback with respect to your suggested change of using s[self.get_loc[key]] instead of s[key] for Index, it breaks two existing tests, so I think the fix as I've implemented it should be taken as is. (cc @TomAugspurger)

If you guys want me to do anything else with this PR, let me know, but IMHO, you ought to just merge it in.

@jreback
Copy link
Contributor

jreback commented May 3, 2018

it breaks two existing tests

what does it break? I really dont' like treating ExtensionArray different from Index, if that's the case its a bug in the API (for indexing).

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 3, 2018

@jreback The tests that fail are

  • pandas\tests\test_base.py in test_value_counts_unique_nunique
  • tests/groupby/aggregate/test_other.py in test_agg_timezone_round_trip

I've investigated the first one and here is what I have found out. (I think the second one is the same problem). If we use the way you propose, where we use get_loc(key) on the Index, the example below fails on the expression s2[0].

In [1]: import pandas as pd

In [2]: def makeDateIndex(k=10, freq='B', name=None):
   ...:     dt = pd.datetime(2000, 1, 1)
   ...:     dr = pd.bdate_range(dt, periods=k, freq=freq, name=name)
   ...:     return pd.DatetimeIndex(dr, name=name)
   ...:
   ...: dt_tz_index = makeDateIndex(10, name='a').tz_localize(tz='US/Eastern')
   ...: s1 = pd.Series([i for i in range(len(dt_tz_index))], index=dt_tz_index)
   ...:
   ...: s2 = pd.Series(dt_tz_index, index=dt_tz_index)
   ...:

In [3]: s1
Out[3]:
a
2000-01-03 00:00:00-05:00    0
2000-01-04 00:00:00-05:00    1
2000-01-05 00:00:00-05:00    2
2000-01-06 00:00:00-05:00    3
2000-01-07 00:00:00-05:00    4
2000-01-10 00:00:00-05:00    5
2000-01-11 00:00:00-05:00    6
2000-01-12 00:00:00-05:00    7
2000-01-13 00:00:00-05:00    8
2000-01-14 00:00:00-05:00    9
Freq: B, dtype: int64

In [4]: s2
Out[4]:
a
2000-01-03 00:00:00-05:00   2000-01-03 00:00:00-05:00
2000-01-04 00:00:00-05:00   2000-01-04 00:00:00-05:00
2000-01-05 00:00:00-05:00   2000-01-05 00:00:00-05:00
2000-01-06 00:00:00-05:00   2000-01-06 00:00:00-05:00
2000-01-07 00:00:00-05:00   2000-01-07 00:00:00-05:00
2000-01-10 00:00:00-05:00   2000-01-10 00:00:00-05:00
2000-01-11 00:00:00-05:00   2000-01-11 00:00:00-05:00
2000-01-12 00:00:00-05:00   2000-01-12 00:00:00-05:00
2000-01-13 00:00:00-05:00   2000-01-13 00:00:00-05:00
2000-01-14 00:00:00-05:00   2000-01-14 00:00:00-05:00
Freq: B, Name: a, dtype: datetime64[ns, US/Eastern]

In [5]: s1[0]
Out[5]: 0

In [6]: s2[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10784)()
    429
--> 430     cpdef get_loc(self, object val):
    431         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10616)()
    465             val = maybe_datetimelike_to_i8(val)
--> 466             return self.mapping.get_item(val)
    467         except (TypeError, ValueError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15389)()
    957
--> 958     cpdef get_item(self, int64_t val):
    959         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15333)()
    963         else:
--> 964             raise KeyError(val)
    965

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
   1609         try:
-> 1610             return Index.get_loc(self, key, method, tolerance)
   1611         except (KeyError, ValueError, TypeError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10784)()
    429
--> 430     cpdef get_loc(self, object val):
    431         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10616)()
    465             val = maybe_datetimelike_to_i8(val)
--> 466             return self.mapping.get_item(val)
    467         except (TypeError, ValueError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15389)()
    957
--> 958     cpdef get_item(self, int64_t val):
    959         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15333)()
    963         else:
--> 964             raise KeyError(val)
    965

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10434)()
    457         try:
--> 458             return self.mapping.get_item(val.value)
    459         except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15389)()
    957
--> 958     cpdef get_item(self, int64_t val):
    959         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15333)()
    963         else:
--> 964             raise KeyError(val)
    965

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10784)()
    429
--> 430     cpdef get_loc(self, object val):
    431         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10521)()
    459         except KeyError:
--> 460             raise KeyError(val)
    461         except AttributeError:

KeyError: Timestamp('1969-12-31 19:00:00-0500', tz='US/Eastern')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10434)()
    457         try:
--> 458             return self.mapping.get_item(val.value)
    459         except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15389)()
    957
--> 958     cpdef get_item(self, int64_t val):
    959         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15333)()
    963         else:
--> 964             raise KeyError(val)
    965

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
   1618                 stamp = Timestamp(key, tz=self.tz)
-> 1619                 return Index.get_loc(self, stamp, method, tolerance)
   1620             except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10784)()
    429
--> 430     cpdef get_loc(self, object val):
    431         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10521)()
    459         except KeyError:
--> 460             raise KeyError(val)
    461         except AttributeError:

KeyError: Timestamp('1969-12-31 19:00:00-0500', tz='US/Eastern')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_value(self, series, key)
   1559         try:
-> 1560             return com._maybe_box(self, Index.get_value(self, series, key),
   1561                                   series, key)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_value(self, series, key)
   3087 #                    return s[key]
-> 3088                     return s[self.get_loc(key)]
   3089                 except (IndexError, ValueError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
   1620             except KeyError:
-> 1621                 raise KeyError(key)
   1622             except ValueError as e:

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10434)()
    457         try:
--> 458             return self.mapping.get_item(val.value)
    459         except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15389)()
    957
--> 958     cpdef get_item(self, int64_t val):
    959         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:15333)()
    963         else:
--> 964             raise KeyError(val)
    965

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_value(self, series, key)
   1569             try:
-> 1570                 return self.get_value_maybe_box(series, key)
   1571             except (TypeError, ValueError, KeyError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_value_maybe_box(self, series, key)
   1580         values = self._engine.get_value(com._values_from_object(series),

-> 1581                                         key, tz=self.tz)
   1582         return com._maybe_box(self, values, series, key)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4847)()
    104
--> 105     cpdef get_value(self, ndarray arr, object key, object tz=None):
    106         """

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4530)()
    112
--> 113         loc = self.get_loc(key)
    114         if PySlice_Check(loc) or cnp.PyArray_Check(loc):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc (pandas\_libs\index.c:10521)()
    459         except KeyError:
--> 460             raise KeyError(val)
    461         except AttributeError:

KeyError: Timestamp('1969-12-31 19:00:00-0500', tz='US/Eastern')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-ab7c8e26b0d3> in <module>()
----> 1 s2[0]

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\series.py in __getitem__(self, key)
    764         key = com._apply_if_callable(key, self)
    765         try:
--> 766             result = self.index.get_value(self, key)
    767
    768             if not is_scalar(result):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\datetimes.py in get_value(self, series, key)
   1570                 return self.get_value_maybe_box(series, key)
   1571             except (TypeError, ValueError, KeyError):
-> 1572                 raise KeyError(key)
   1573
   1574     def get_value_maybe_box(self, series, key):

KeyError: 0

Here's an easier-to-read stack trace of calls for s2[0] in terms of where it bombs out:

pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
--> 3063                 return self._engine.get_loc(key)
pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
-->1610             return Index.get_loc(self, key, method, tolerance)
pandas\core\indexes\base.py in get_value(self, series, key)
--> 3088                     return s[self.get_loc(key)]
pandas\core\indexes\datetimes.py in get_value(self, series, key)
--> 1560             return com._maybe_box(self, Index.get_value(self, series, key),
    1561                                   series, key)
pandas\core\series.py in __getitem__(self, key)
--> 766             result = self.index.get_value(self, key)

So what may be the error here is that if the values of a Series containing TZ-aware values end up being a DateTimeIndex and get_loc(0) isn't defined for an Index. Note that in the code below, if you remove the .tz_localize() part, it works fine.

Maybe when DateTimeIndex is based on ExtensionArray, then this code as I wrote it could change so that ExtensionArray and Index are handled the same way.

Or maybe you know how to fix things for DateTimeIndex, but it's not clear what is the thing I should be testing to be sure that this issue is fixed.

@jreback
Copy link
Contributor

jreback commented May 4, 2018

this #20885 (comment)

is a bug and should be fixed as a pre-cursor to this PR, @Dr-Irv pls make a separate issue (I thought we had one, but couldn't fin it).

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 4, 2018

@jreback I'm not sure how to report the issue. The only way to replicate it is if we implement this PR as you suggested. Should I create the issue that way?

I can't figure out how to demonstrate the bad behavior on 0.23rc2, or, equivalently, what test I could write that would verify that the bug illustrated above exists (since the bug only occurs if the PR is implemented in a different way).

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 4, 2018

@jreback Above, I mentioned that the change you want also breaks pandas\core\indexes\datetimes.py in test_agg_timezone_round_trip . I think that is related to the bug reported in #20949 .

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 4, 2018

@jreback I've pushed a new commit here, as I found another bug in how I did things, so the new implementation looks more different for Index and ExtensionArray because of how the exceptions are handled. I added another test that captured this difference.

In the case of ExtensionArray, if the key is a scalar, we have a Series backed by an ExtensionArray, so we first have to try to translate that key into an integer index to pass down to ExtensionArray.__getitem__. If we can't do that, then if the scalar is an integer, we pass that integer down. If that fails, we pass back whatever exception is raised by the ExtensionArray.__getitem__() implementation..

If the Series is backed by an Index, then we need to let the existing machinery for Index return the result, and that machinery uses deep pandas internals to get the right value whether it is a loc-type key or an integer.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me

iloc = self.get_loc(key)
return s[iloc]
except KeyError:
if isinstance(key, (int, np.integer)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_integer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why doesn’t the pass work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I've just pushed a new commit that uses is_integer and adds comments.

pass doesn't work because we are catching a different exception. So the way I unified the code between Index and ExtensionArray was by handling the exception differently in the case that the key was a scalar.

# invalid type as an indexer
pass
if is_scalar(key):
if isinstance(s, (Index, ExtensionArray)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be an and here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

try:
iloc = self.get_loc(key)
return s[iloc]
except KeyError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened to the Indexerror case? is that not possible now?

result = s.get(slice('b', 'd'))
expected = s.iloc[[1, 2, 3]]
self.assert_series_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some cases with an out-of-range integer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback
Copy link
Contributor

jreback commented May 8, 2018

@Dr-Irv ok this change looks ok. can you run the indexing asv's and report if anything is changing? this is a central piece of code so need to check.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 8, 2018

@jreback I hope I did this right. I'm doing this on my 16GB 4 core laptop on Windows. I compared upstream/master (last commit on May 5) to my latest commit. Here are the results:

      before           after         ratio
     [bd4332f4]       [65087a6d]
+        30.8±0μs         44.0±5μs     1.43  indexing.NumericSeriesIndexing.time_iloc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>)
-         516±0ms          453±0ms     0.88  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>)
-        83.3±5μs         71.7±0μs     0.86  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Int64Index'>)
-         109±2μs         93.9±0μs     0.86  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>)
-     1.09±0.04ms         934±10μs     0.85  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>)
-        170±10μs          141±0μs     0.83  indexing.IntervalIndexing.time_getitem_list
-         427±0μs         351±30μs     0.82  indexing.MultiIndexing.time_series_ix
-         516±8ms          406±0ms     0.79  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>)
-        498±50ns          382±0ns     0.77  indexing.MethodLookup.time_lookup_loc
-        97.7±6μs         70.7±0μs     0.72  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>)
-         107±0μs         75.0±0μs     0.70  indexing.NumericSeriesIndexing.time_ix_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>)
-        562±20ms          383±4ms     0.68  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>)
-         103±6μs         67.2±2μs     0.65  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>)

So many things got faster. I'm not surprised by this, as we're not falling through exceptions any more when indexing by a scalar.

I also reran just the one that got slower, and when I did that, it flipped to 62.3±4μs for master and 55.6±0μs for the new version, showing an improvement.

I then ran the benchmarks again to better understand variability:

      before           after         ratio
     [bd4332f4]       [65087a6d]
+         567±0μs         758±50μs     1.34  indexing.DataFrameNumericIndexing.time_bool_indexer
+        72.1±6μs         88.8±0μs     1.23  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>)

And once more:

      before           after         ratio
     [bd4332f4]       [65087a6d]
+        2.15±0ms         62.5±4ms    29.09  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Int64Index'>)
-        92.5±0μs         83.8±0μs     0.91  indexing.NonNumericSeriesIndexing.time_getitem_label_slice('datetime')

Given the inconsistent results (probably affected by other background processes happening on my laptop), I think we're OK.

@jreback
Copy link
Contributor

jreback commented May 9, 2018

@Dr-Irv IIRC there is a flag for affinity to help prevent this kind of variable, also I thin we have some asv options to run this many times. but close enough for now.

@jreback jreback merged commit e978279 into pandas-dev:master May 9, 2018
@jreback
Copy link
Contributor

jreback commented May 9, 2018

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series.get() on ExtensionArray series (and Categorical) indexed by integer returns incorrect result
4 participants