CLN/INT: remove Index as a sub-class of NDArray #7891

jreback · 2014-07-31T17:15:09Z

make Index now subclass PandasObject/IndexOpsMixin rather than ndarray
should allow much easier new Index classes (e.g. #7640)

This doesn't change the public API at all, and provides compat

closes #5080
back compat for pickles is now way simpler

ToDo:

closes #5155 (perf fix for Period creation), slight increase on the plotting
because of the the plottling routines holding array of Periods (rather than PeriodIndex).

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
period_setitem                               |  16.2210 | 122.9340 |   0.1319 |
timeseries_iter_periodindex                  | 1197.7839 | 6906.1464 |   0.1734 |
timeseries_iter_periodindex_preexit          |  12.4850 |  69.8563 |   0.1787 |
timeseries_period_downsample_mean            |  11.2850 |  11.1457 |   1.0125 |
plot_timeseries_period                       | 107.6056 |  86.5277 |   1.2436 |

cpcloud · 2014-07-31T17:16:05Z

We will call you Mr Anti-NDArray :)

jreback · 2014-07-31T17:16:13Z

cc @Komnomnomnom if you could have a look at the json issues would be great!

jreback · 2014-07-31T17:16:32Z

ndarray sub-classing is SO old world. Its really quite annoying.

cpcloud · 2014-07-31T17:17:27Z

amen to that composition ftw

Komnomnomnom · 2014-08-01T09:10:28Z

No problem guys, I'll take a look this weekend.

jreback · 2014-08-01T13:28:38Z

@sinhrks

would you mind taking a look at this PR and see if you can figure this out?

seem the PeriodIndex is getting converted to underlying (and not preserverd) like in master.....

thxs

======================================================================
ERROR: test_business_freq (pandas.tseries.tests.test_plotting.TestTSPlot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/home/jreback/pandas/pandas/tseries/tests/test_plotting.py", line 273, in test_business_freq
    self.assertEqual(PeriodIndex(data=idx).freqstr, 'B')
  File "/mnt/home/jreback/pandas/pandas/tseries/period.py", line 598, in __new__
    ordinal, freq = cls._from_arraylike(data, freq, tz)
  File "/mnt/home/jreback/pandas/pandas/tseries/period.py", line 667, in _from_arraylike
    raise ValueError('freq not specified and cannot be '
ValueError: freq not specified and cannot be inferred from first element

======================================================================
ERROR: test_mixed_freq_hf_first (pandas.tseries.tests.test_plotting.TestTSPlot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/home/jreback/pandas/pandas/tseries/tests/test_plotting.py", line 639, in test_mixed_freq_hf_first
    self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, 'D')
  File "/mnt/home/jreback/pandas/pandas/tseries/period.py", line 598, in __new__
    ordinal, freq = cls._from_arraylike(data, freq, tz)
  File "/mnt/home/jreback/pandas/pandas/tseries/period.py", line 667, in _from_arraylike
    raise ValueError('freq not specified and cannot be '
ValueError: freq not specified and cannot be inferred from first element

sinhrks · 2014-08-01T21:22:09Z

@jreback I think following lines should be changed not to pass PeriodIndex. Both tests can be passed by using _mpl_repr to pass ndarray of Period objects explicitly. It looks no affect to chart-looking, but will take a further look (especially resampling/replot case).

One difference is matplotlib axis no longer can hold PeriodIndex wigh freq, thus freq will be re-inferred during PeriodIndex reconstruction (and affects to other tests which checks freq)

-    lines = plotf(ax, series.index, series.values, **kwargs)
+    lines = plotf(ax, series.index._mpl_repr(), series.values, **kwargs)

https://github.com/pydata/pandas/blob/master/pandas/tseries/plotting.py#L64
https://github.com/pydata/pandas/blob/master/pandas/tseries/plotting.py#L155

jreback · 2014-08-01T21:45:35Z

@sinhrks ahh..ok, lmk try to replace that and see.

hayd · 2014-08-01T22:50:07Z

Will this make RangeIndexs a little easier to implement? xlink #3268 #2420

shoyer · 2014-08-01T22:56:35Z

@hayd I certainly hope so!

I am super stoked about this change.

jreback · 2014-08-01T23:09:38Z

@hayd their is already a lot of work on RangeIndex here: https://github.com/jtratner/pandas/tree/add-range-index

I think it should be a bit easier. However, I have to think about how to fix the real problem, which is that Index.__new__ unfortunatelly is still used (and each sub-class has a __new__ as well). I think actually need a IndexFactory (which is then simply called by __new__), with the ability to register / override the creation mechanism.

That will have to wait though. @shoyer when you are ready to integrate IntervalIndex I think we can address that.

shoyer · 2014-08-02T06:33:32Z

doc/source/v0.15.0.txt

+
+- pickles <= 0.8.0 may not work if they contain MultiIndexes.
+- you may need to unpickle < 0.15.0 pickles using ``pd.read_pickle`` rathen than ``pickle.loads``. See :ref:`pickle docs <io.pickle>`
+- boolean comparisons of ``DatetimeIndex`` that have ``NaT`` with ``ndarray`` ONLY work if the ndarray is on the right-handle side. An example of this limited case is:


Is this something you could fix (at least for standard ndarrays) by setting __array_priority__ > 1 ?
http://docs.scipy.org/doc/numpy/reference/arrays.classes.html#special-attributes-and-methods

(someone should really update those docs to discourage subclassing!)

hmm maybe if I add array_prepare

the issue is that ndarray defines 'lt' for example and I don't know anyway to have it reverse the args and call 'ge' on the index instead

do u?

@shoyer I have determined that this CAN be done by intercepting (and interpreting the context) in the __array_preprare__ call (similar to what is done in core/series.py/__array_prepare__). However IMHO this is pretty complicated (as you would need to translate the ufunc and reverse the arguments). Leaving it off for now with just the docs warning. I think this is a very limited case anyhow.

I don't think use of __array_prepare__ is necessary. Numpy will call gt instead of lt if you set a higher array priority for the second argument:

class ArrayLike(object): def __init__(self, array, priority): self.array = array self.__array_priority__ = priority def __array__(self): return self.array def __lt__(self, other): print 'subclass used lt' return self.array < other def __le__(self, other): print 'subclass used le' return self.array <= other def __eq__(self, other): print 'subclass used eq' return self.array == other def __ne__(self, other): print 'subclass used ne' return self.array != other def __gt__(self, other): print 'subclass used gt' return self.array > other def __ge__(self, other): print 'subclass used ge' return self.array >= other

Examples:

In [3]: np.array([0, 0]) < ArrayLike(np.array([1, 1]), priority=None) Out[3]: array([ True, True], dtype=bool) In [4]: np.array([0, 0]) <= ArrayLike(np.array([1, 1]), priority=2) subclass used ge Out[4]: array([ True, True], dtype=bool) In [5]: 0 <= ArrayLike(np.array([1, 1]), priority=2) subclass used ge Out[5]: array([ True, True], dtype=bool)

jreback · 2014-08-02T13:30:54Z

@sinhrks I have made the changes for plotting here. jreback@1fc0a1f

can you confirm that the graphs act/look the same? (esp when resampled/zoomed and such)

as an aside this now makes all lines whether DatetimeIndex or PeriodIndex be arrays of the underlying object (datetimes/Period).

jreback · 2014-08-02T19:31:35Z

@shoyer that was a good idea to use _array_priority__, see here: jreback@29d2f44

you have to catch Notimplemented (type) but that 's ok

jreback · 2014-08-03T01:46:47Z

@sinhrks I updated this commit a couple of times to put in some period construction optimizations....

pls lmk about the plottnig (if you think its ok). (all tests now pass).

jreback@2f32f46

sinhrks · 2014-08-03T12:27:05Z

@jreback I've checked some plots, and these worked as the same as before. If anything, I'll confirm again.

Komnomnomnom · 2014-08-04T09:58:09Z

hey @jreback just taking a look at the json stuff now

shoyer · 2014-08-04T17:22:32Z

pandas/core/index.py

        result = func(other)
+        if result is NotImplemented:


OK, so I finally figured out what is happening here.

Numpy is not performing (ndarray, Index) comparisons here (returning NotImplemented) because Index has a higher array priority.

The fix is to always coerce the right-side argument into a plain ndarray. e.g., result = func(np.asarray(other)) (better to use asarray than array to avoid unnecessary copies). If you do that, you will be able to skip the NotImplemented check.

jreback · 2014-08-04T17:38:24Z

@shoyer jreback@7d66157

done.

also I have put in (well a little), structure on the Index testing for more generic testing, e.g. jreback@d1c4fbb

so prob worhwhile after this PR is merged to 'fix' the index tests to make it more class based (how Float64/Int64 are done). to make it a big more generic. E.g.

jreback · 2014-08-05T16:42:05Z

ok monster is ready to merge. any further comments.

has back compat pickle compat (I dropped < 0.8.0). I suppose could resurrect, any reason to though?
perf is good
testing / infrastructure is better for sub-classing
couple of todos INT: followup to Index not a sub-class of ndarray #7904

@jorisvandenbossche @cpcloud @shoyer @sinhrks @hayd
cc @immerrr

I think I understand pickle and all of its evils now :) (not sure if that is a net benefit to society though)

shoyer · 2014-08-05T18:24:47Z

pandas/core/index.py

+
+    __array_priority__ = 1000
+
+    def __array_prepare__(self, result, context=None):


I don't think this does anything.

oh __array_prepare__ yes I know....sort of left in in their ...will take out

jorisvandenbossche · 2014-08-06T08:14:09Z

pandas/core/generic.py

@@ -2137,14 +2137,14 @@ def copy(self, deep=True):
        ----------
        deep : boolean, default True
            Make a deep copy, i.e. also copy data
+        axes : string or None, default None
+            View copy of the axes


I don't really understand what this means?

I agree, maybe it's better to merge this with deep argument, e.g.

deep=False : shallow copy

deep=True : deep copy of values, shallow copy of axes

deep='withaxes' : deep copy of everything (withaxes could be any token that clarifies the meaning)

Would be nice to have deep=True to deep-copy everything and deep='values'/deep='axes' to pick only one component, but that seems non-backward compatible.

or maybe accept deep='values' as an alias for deep=True and deep=('values', 'axes') to deep-copy everything

and if we keep the axes arg, I would rather make it a bool like deep

this was all a kludge to avoid repeating code, their is exacttly 1 case where I need this: reduce.pyx/Reducer. Basically need to make a complete copy of an object including a deep copy of its index.

deep=True does not copy the actual data, rather it is a view on it. This preserves numpy semantics so memory is shared. We never actually need to copy index data memory as these are immutable and so cannot be changed. We always just create a new object (with possibly shared memory).

Meta-data is a different story (e.g. .name), where we almost always want/need to copy this (e.g. .view uses ._shallow_copy for this purpose).

However, in this reducer because of how it actually messes with the pointers, I do actually need to copy the memory.

I needed a 'private' way of doing that. So either make axes private, or just overload deep (default is still always True). Will change to deep=True|False|'all'.

The user never needs to actually copy the index data as it is a view and numpy takes care of that. This is an internal usage.

jorisvandenbossche · 2014-08-06T09:00:43Z

@jreback Added a bunch of comments (mainly on docs and public API, not familiar enough to comment on technical details)

Further, I wondered, are there things we have learned from the "Series -> NDFrame subclass and no longer ndarray subclass" move that can be relevant here? Issues that came up afterwards (where we had to say "series is no longer ndarray subclass, so this will not work anymore) that we can now warn for beforehand?
The whatsnew section then has some more warnings (http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#internal-refactoring).

immerrr · 2014-08-06T09:32:46Z

pandas/core/index.py

@@ -1894,8 +2008,75 @@ def drop(self, labels):
            raise ValueError('labels %s not contained in axis' % labels[mask])
        return self.delete(indexer)

+    @classmethod
+    def _add_numeric_methods_disabled(cls):


I wonder if _add_numeric_methods_disabled and _add_numeric_methods could be put into pandas.core.ops module to reuse/be reused from such methods implemented for other containers.

I started doing this, but the ops.py was a bit too specific. It could/should be fixed I think. but would require some dedicated effort. Feel free!

jreback · 2014-08-06T13:10:23Z

@jorisvandenbossche
jreback@72f491a#diff-d41d8cd98f00b204e9800998ecf8427e

I just update the properties directly, better I think because then don't have to clutter with the full ndarray definitions.

jorisvandenbossche · 2014-08-06T13:30:06Z

OK, it is a bit a compromise between both, as for some functions it can also be usefull information, for some it is too much clutter ..
For some you could also add

See also
----------
numpy.ndarray.flat

jreback · 2014-08-06T13:33:18Z

ahh ok, that makes sense

jreback · 2014-08-06T15:12:04Z

@jorisvandenbossche ok added a lot of see alsos (bot series and index), and put doc-strings on lots of attributes.

jreback · 2014-08-06T16:44:02Z

ok, think this is ready.

I put back the MultiIndex support for really old pickles (wasn't hard). though not sure anyone really has them around.

any final comments

@jorisvandenbossche @cpcloud @immerrr @sinhrks

jreback · 2014-08-06T17:11:30Z

@jorisvandenbossche added the rest of the properties (and now consolidated in 1 place)

jreback@9f86df2

CLN: add searchsorted to core/base (GH6712, GH7447, GH6469) fixup tests in test_timeseries for reverse ndarray/datetimeindex comparisons fix algos / multi-index repeat (essentially this is a bug-fix) ENH: add NumericIndex and operators, related (GH7439) DOC: indexing/v0.15.0 docs TST: fixed up plotting issues COMPAT/API: use __array_priority__ to facility proper comparisons of DatetimeIndex with ndarrays fixup to do actual views in copy (except in reduce where its needed) COMPAT: numpy compat with 1.6 for np.may_share_memory FIX: access values attr in JSON code to support index that's not an ndarry subclass COMPAT: numpy compat with array_priority fix CLN: remove constructor pickle compat code as not necessary COMPAT: fix pickle in sparse CLN: clean up shallow_copy/simple_new COMPAT: pickle compat remove __array_prepare__ COMPAT: tests & compat for numeric operation support only on supported indexes DOC: fixup for comments COMPAT: allow older MultiIndex pickles again CLN: combine properties from index/series for ndarray compat

jreback · 2014-08-07T11:34:52Z

ok, bombs away...

CLN/INT: remove Index as a sub-class of NDArray

jorisvandenbossche · 2014-08-07T11:38:31Z

Nice!

immerrr · 2014-08-07T11:52:52Z

That was huge, great job

cpcloud · 2014-08-07T12:10:58Z

Bravo!

, fixed in 0.15 by #7891

jreback added API Design labels Jul 31, 2014

jreback added this to the 0.15.0 milestone Jul 31, 2014

jreback added the MultiIndex label Jul 31, 2014

jreback added the Refactor label Jul 31, 2014

shoyer reviewed Aug 2, 2014
View reviewed changes

jreback mentioned this pull request Aug 2, 2014

INT: followup to Index not a sub-class of ndarray #7904

Closed

7 tasks

ischwabacher mentioned this pull request Aug 2, 2014

BUG: date_range(str, tz=str) and date_range(Timestamp) handle tz discontinuity differently #7835

Closed

jreback mentioned this pull request Aug 3, 2014

PERF: fastpath on Period construction from PeriodIndex #5155

Closed

shoyer reviewed Aug 4, 2014
View reviewed changes

shoyer reviewed Aug 5, 2014
View reviewed changes

jorisvandenbossche reviewed Aug 6, 2014
View reviewed changes

immerrr reviewed Aug 6, 2014
View reviewed changes

jreback added a commit that referenced this pull request Aug 7, 2014

Merge pull request #7891 from jreback/index

c7bfb4e

CLN/INT: remove Index as a sub-class of NDArray

jreback merged commit c7bfb4e into pandas-dev:master Aug 7, 2014

jreback mentioned this pull request Aug 7, 2014

Unify index and multindex (and possibly others) API #3268

Closed

17 tasks

jorisvandenbossche mentioned this pull request Oct 23, 2014

Plotting of DatetimeIndex directly with matplotlib no longer gives datetime formatted axis (0.15) #8614

Closed

This was referenced Jan 27, 2015

TST add a test for repeat() method with MultiIndex, referenced in #9361 #9362

Merged

bug in repeat() method with MultiIndex - fixed in 0.15 #9361

Closed

jreback pushed a commit that referenced this pull request Jan 27, 2015

TEST add a test for repeat() method with MultiIndex, referenced in #9361

01de130

, fixed in 0.15 by #7891


		__array_priority__ = 1000

		def __array_prepare__(self, result, context=None):

CLN/INT: remove Index as a sub-class of NDArray #7891

CLN/INT: remove Index as a sub-class of NDArray #7891

Conversation

jreback commented Jul 31, 2014

cpcloud commented Jul 31, 2014

jreback commented Jul 31, 2014

jreback commented Jul 31, 2014

cpcloud commented Jul 31, 2014

Komnomnomnom commented Aug 1, 2014

jreback commented Aug 1, 2014

sinhrks commented Aug 1, 2014

jreback commented Aug 1, 2014

hayd commented Aug 1, 2014

shoyer commented Aug 1, 2014

jreback commented Aug 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 2, 2014

jreback commented Aug 2, 2014

jreback commented Aug 3, 2014

sinhrks commented Aug 3, 2014

Komnomnomnom commented Aug 4, 2014

Choose a reason for hiding this comment

jreback commented Aug 4, 2014

jreback commented Aug 5, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 6, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 6, 2014

jorisvandenbossche commented Aug 6, 2014

jreback commented Aug 6, 2014

jreback commented Aug 6, 2014

jreback commented Aug 6, 2014

jreback commented Aug 6, 2014

jreback commented Aug 7, 2014

jorisvandenbossche commented Aug 7, 2014

immerrr commented Aug 7, 2014

cpcloud commented Aug 7, 2014