REF: Back IntervalArray by array instead of Index #36310

jbrockmendel · 2020-09-12T19:04:34Z

The benefit I have in mind here is that we could back it by a single 2xN array and a) avoid the kludge needed to make __setitem__ atomic, b) do a view to get native types for e.g uniqueness checks, c) possibly share some methods with NDarrayBackedExtensionArray.

Also just in principle having EAs not depend on Index is preferable dependency-structure-wise.

cc @jschendel

pandas/core/arrays/interval.py

jreback · 2020-09-12T21:01:20Z

pandas/core/arrays/interval.py

-        right = self.right.fillna(value=value.right)
+        from pandas import Index
+
+        left = Index(self.left).fillna(value=value.left)


umm why do you need to coerce to .fillna?

ATM self.left is an ndarray which doesnt have fillna

pandas/core/arrays/interval.py

…f-ia-ii

jbrockmendel · 2020-09-13T16:31:45Z

AFAICT remaining test failures are on from arrow

jbrockmendel · 2020-09-14T19:01:13Z

cc @jorisvandenbossche are the pyarrow failures here something we can address on our end?

jorisvandenbossche · 2020-09-16T09:59:37Z

It are our own tests (testing the conversion that is implemented in pandas itself), so that's something that need to be fixed in this PR.

Looking at the failure, it seems that setting a missing value in an IntervalArray does not/no longer set a missing value in both left and right arrays. But looking at the changes in __setitem__ in the diff, I don't directly see why that would be the case.

jorisvandenbossche

For backwards compatibility, we could keep .left and .right returning an Index? (since the arrays are actually stored as ._left and ._right)

pandas/core/arrays/interval.py

…f-ia-ii

TomAugspurger · 2020-09-16T18:54:59Z

Nice idea, +1 to changing the storage.

We shouldn't change IntervalArray.left from an index to ndarray without warning. I'm fine with just continuing to return an Index. Or if we really want to change it we can deprecate IntervalArray.left in favor of IntervalArray.left_array.

…f-ia-ii

jreback · 2020-09-19T00:47:43Z

pandas/_testing.py

-    assert_index_equal(left.right, right.right, exact=exact, obj=f"{obj}.left")
+    if left._left.dtype.kind in ["m", "M"]:
+        # We have a DatetimeArray or Timed
+        # TODO: `exact` keyword?


maybe better to

kwargs = {} if left._left.dtype.kind in ["m", "M"]: kwargs['check_freq'] = False ....

pandas/_testing.py

pandas/core/arrays/interval.py

jreback · 2020-09-19T00:50:37Z

pandas/core/arrays/interval.py

-                new_right = self.right.astype(dtype.subtype)
+                # We need to use Index rules for astype to prevent casting
+                #  np.nan entries to int subtypes
+                new_left = Index(self._left).astype(dtype.subtype)


copy=False ?

jreback · 2020-09-19T00:51:10Z

pandas/core/arrays/interval.py

-            fill_value = self.left._na_value
+            from pandas import Index
+
+            fill_value = Index(self._left)._na_value


why can't we get the fill value rather than doing this? if we need to do this add copy=False

jreback · 2020-09-19T00:51:35Z

pandas/core/indexes/interval.py

@@ -865,6 +863,22 @@ def _convert_list_indexer(self, keyarr):

    # --------------------------------------------------------------------

+    @cache_readonly
+    def left(self) -> Index:
+        return Index(self._values.left)


copy=False on these?

can you add this?

…f-ia-ii

jbrockmendel · 2020-09-22T21:38:17Z

i think comments have been addressed, LMK if i missed anything

jreback · 2020-09-22T22:10:53Z

pandas/core/indexes/interval.py

@@ -201,6 +197,8 @@ class IntervalIndex(IntervalMixin, ExtensionIndex):
    _mask = None

    _data: IntervalArray
+    _values: IntervalArray


umm, now i am confused, what is different about these?

without this, mypy thinks _values is ExtensionArray and has a bunch of new complaints since we access self._values.left below

kk, should try to remove this at some point

jreback · 2020-09-22T22:11:10Z

pandas/core/indexes/interval.py

@@ -865,6 +863,22 @@ def _convert_list_indexer(self, keyarr):

    # --------------------------------------------------------------------

+    @cache_readonly
+    def left(self) -> Index:
+        return Index(self._values.left)


can you add this?

…f-ia-ii

jreback · 2020-09-24T01:33:45Z

this is marked POC, but i assume want to merge it?

jbrockmendel · 2020-09-24T17:26:28Z

this is marked POC, but i assume want to merge it?

Yes, updated the title.

jbrockmendel · 2020-09-30T18:45:33Z

any more comments here?

TomAugspurger

Looks good generally. Just one change needed in the whatsnew.

This should improve the performance of IntervalArray.__setitem__, right? That could be added in the release notes.

TomAugspurger · 2020-09-30T19:47:19Z

doc/source/whatsnew/v1.2.0.rst

@@ -120,6 +120,7 @@ Other enhancements
 - `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`)
 - :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`)
 - :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`)
+- :func:`pandas._testing.assert_datetime_array_equal` and :func:`pandas._testing.assert_timedelta_array_equal` now have a ``check_freq=True`` keyword that allows disabling the check for matching ``freq`` attribute (:issue:`36310`)


The docs should only reference public functions. How would users actually pass this through a public API? I suspect it's impossible, since check_freq in assert_frame_equal applies to the index rather than values of an array.

So assert_extension_array_equal could perhaps take kwargs and pass it through. But that's maybe not worth the effort.

so remove this note?

yep, unless this is user facing in some way (e.g. is assert_index_equal changed)?

TomAugspurger · 2020-09-30T19:53:03Z

pandas/core/arrays/interval.py

@@ -588,7 +595,7 @@ def __eq__(self, other):
        if is_interval_dtype(other_dtype):
            if self.closed != other.closed:
                return np.zeros(len(self), dtype=bool)
-            return (self.left == other.left) & (self.right == other.right)
+            return (self._left == other.left) & (self._right == other.right)


Is other here known to be a particular type (like IntervalArray), or is it something like Union[IntervalArray, Series, Index,Interval]. If it's just IntervalArray it'd be a bit faster to compare with other._left and right._left.

at this point other can be an Interval or IntervalArray

…f-ia-ii

jbrockmendel · 2020-09-30T21:51:41Z

This should improve the performance of IntervalArray.setitem, right? That could be added in the release notes.

In [2]: ii = pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]) 
In [3]: ia = ii._data                                                           
In [4]: val = ia[0]                                                             

In [5]: %timeit ia[-1] = val                                                    
38.4 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- master
23.9 µs ± 335 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- PR

note added to whatsnew.

jreback · 2020-10-01T02:11:32Z

doc/source/whatsnew/v1.2.0.rst

@@ -120,6 +120,7 @@ Other enhancements
 - `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`)
 - :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`)
 - :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`)
+- :func:`pandas._testing.assert_datetime_array_equal` and :func:`pandas._testing.assert_timedelta_array_equal` now have a ``check_freq=True`` keyword that allows disabling the check for matching ``freq`` attribute (:issue:`36310`)


yep, unless this is user facing in some way (e.g. is assert_index_equal changed)?

…f-ia-ii

jorisvandenbossche · 2020-10-01T07:21:34Z

pandas/_testing.py

+        # We have a DatetimeArray or TimedeltaArray
+        kwargs["check_freq"] = False
+
+    # TODO: `exact` keyword?


Does this TODO needs to be solved first? It was there before, but you now removed it? (so is not being ignored?)

i think we're ok without the keyword, will remove the comment

jorisvandenbossche · 2020-10-01T07:30:34Z

pandas/core/arrays/interval.py

+        from pandas.core.ops.array_ops import maybe_upcast_datetimelike_array
+
+        left = maybe_upcast_datetimelike_array(left)
+        left = extract_array(left, extract_numpy=True)


Above we are first ensuring that the arrays passed to _simple_new are an index, and then we extract the array again. Is this roundtrip to index and back needed?

I think we can avoid the roundtrip eventually, will be best accomplished by being stricter in what we pass to _simple_new

Sounds good, would be a nice follow-up

pandas/core/indexes/interval.py

jorisvandenbossche · 2020-10-01T07:38:21Z

pandas/core/arrays/interval.py

-        right._values[key] = value_right
-        self._right = right
+        self._left[key] = value_left
+        self._right[key] = value_right  # TODO: needs tests for not breaking views


Isn't the un-xfail-ed test doing that?
(or if not, can you add a test now?)

good catch, will remove comment

…f-ia-ii

jbrockmendel · 2020-10-02T15:27:14Z

updated per comments

jreback

minor comment

jreback · 2020-10-02T21:57:55Z

pandas/core/arrays/interval.py

-                new_right = self.right.astype(dtype.subtype)
+                # We need to use Index rules for astype to prevent casting
+                #  np.nan entries to int subtypes
+                new_left = Index(self._left, copy=False).astype(dtype.subtype)


could add copy=False to .astype (not sure how much any of this matters though)

merge is ok now and can see if this matters on followup

pandas/core/arrays/interval.py

jreback · 2020-10-02T21:59:05Z

pandas/core/indexes/interval.py

@@ -201,6 +197,8 @@ class IntervalIndex(IntervalMixin, ExtensionIndex):
    _mask = None

    _data: IntervalArray
+    _values: IntervalArray


kk, should try to remove this at some point

POC: back IntervalArray by array instead of Index

8987a0e

jreback reviewed Sep 12, 2020

View reviewed changes

pandas/core/arrays/interval.py Outdated Show resolved Hide resolved

jreback added the Interval Interval data type label Sep 12, 2020

jreback requested changes Sep 12, 2020

View reviewed changes

jbrockmendel added 3 commits September 12, 2020 18:08

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

153c87a

…f-ia-ii

Fix failing copy/view tests

8164099

mypy fixup

d545dac

jorisvandenbossche reviewed Sep 16, 2020

View reviewed changes

pandas/core/arrays/interval.py Show resolved Hide resolved

jbrockmendel added 2 commits September 16, 2020 08:20

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

6050ec8

…f-ia-ii

Avoid having left and right view the same data

bd6231c

jbrockmendel added 6 commits September 16, 2020 12:29

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

548efe6

…f-ia-ii

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

124938e

…f-ia-ii

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

c479e0a

…f-ia-ii

Restore left and right as Indexes

c4a2229

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

97a0bed

…f-ia-ii

TST: test_left_right_dont_share_data

bfa13bb

jreback requested changes Sep 19, 2020

View reviewed changes

jbrockmendel added 3 commits September 20, 2020 11:21

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

b45ed46

…f-ia-ii

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

e6d4bd9

…f-ia-ii

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

266512f

…f-ia-ii

jreback added this to the 1.2 milestone Sep 22, 2020

jreback requested changes Sep 22, 2020

View reviewed changes

jbrockmendel added 2 commits September 22, 2020 15:13

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

f16be73

…f-ia-ii

pass copy=False

4efdc08

jbrockmendel changed the title ~~POC: back IntervalArray by array instead of Index~~ Back IntervalArray by array instead of Index Sep 24, 2020

TomAugspurger reviewed Sep 30, 2020

View reviewed changes

jbrockmendel added 2 commits September 30, 2020 14:43

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

1ed9623

…f-ia-ii

perf note

ed6a932

jreback requested changes Oct 1, 2020

View reviewed changes

jbrockmendel added 2 commits September 30, 2020 19:14

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

fee70d8

…f-ia-ii

revert whatsnew

1a22095

jorisvandenbossche changed the title ~~Back IntervalArray by array instead of Index~~ REF: Back IntervalArray by array instead of Index Oct 1, 2020

jorisvandenbossche reviewed Oct 1, 2020

View reviewed changes

jbrockmendel added 2 commits October 1, 2020 09:36

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

490a8f5

…f-ia-ii

update per comments

865b3fc

jreback requested changes Oct 2, 2020

View reviewed changes

jreback approved these changes Oct 2, 2020

View reviewed changes

jreback merged commit 089fad9 into pandas-dev:master Oct 2, 2020

jbrockmendel deleted the ref-ia-ii branch October 3, 2020 00:28

jbrockmendel mentioned this pull request Oct 11, 2020

BUG: IntervalArray.__setitem__ creates copies incorrectly #27147

Closed

TomAugspurger mentioned this pull request Oct 13, 2020

REF: back IntervalArray by a single ndarray #37047

Merged

5 tasks

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

REF: Back IntervalArray by array instead of Index (pandas-dev#36310)

42a5e9e

REF: Back IntervalArray by array instead of Index #36310

REF: Back IntervalArray by array instead of Index #36310

Conversation

jbrockmendel commented Sep 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Sep 13, 2020

jbrockmendel commented Sep 14, 2020

jorisvandenbossche commented Sep 16, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Sep 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 24, 2020

jbrockmendel commented Sep 24, 2020

jbrockmendel commented Sep 30, 2020

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Sep 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Oct 2, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment