Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN/INT: remove Index as a sub-class of NDArray #7891

Merged
merged 1 commit into from
Aug 7, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jul 31, 2014

make Index now subclass PandasObject/IndexOpsMixin rather than ndarray
should allow much easier new Index classes (e.g. #7640)

This doesn't change the public API at all, and provides compat

closes #5080
back compat for pickles is now way simpler

ToDo:

closes #5155 (perf fix for Period creation), slight increase on the plotting
because of the the plottling routines holding array of Periods (rather than PeriodIndex).

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
period_setitem                               |  16.2210 | 122.9340 |   0.1319 |
timeseries_iter_periodindex                  | 1197.7839 | 6906.1464 |   0.1734 |
timeseries_iter_periodindex_preexit          |  12.4850 |  69.8563 |   0.1787 |
timeseries_period_downsample_mean            |  11.2850 |  11.1457 |   1.0125 |
plot_timeseries_period                       | 107.6056 |  86.5277 |   1.2436 |

@jreback jreback added this to the 0.15.0 milestone Jul 31, 2014
@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2014

We will call you Mr Anti-NDArray :)

@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

cc @Komnomnomnom if you could have a look at the json issues would be great!

@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

ndarray sub-classing is SO old world. Its really quite annoying.

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2014

amen to that composition ftw

@Komnomnomnom
Copy link
Contributor

No problem guys, I'll take a look this weekend.

@jreback
Copy link
Contributor Author

jreback commented Aug 1, 2014

@sinhrks

would you mind taking a look at this PR and see if you can figure this out?

seem the PeriodIndex is getting converted to underlying (and not preserverd) like in master.....

thxs

======================================================================
ERROR: test_business_freq (pandas.tseries.tests.test_plotting.TestTSPlot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/home/jreback/pandas/pandas/tseries/tests/test_plotting.py", line 273, in test_business_freq
    self.assertEqual(PeriodIndex(data=idx).freqstr, 'B')
  File "/mnt/home/jreback/pandas/pandas/tseries/period.py", line 598, in __new__
    ordinal, freq = cls._from_arraylike(data, freq, tz)
  File "/mnt/home/jreback/pandas/pandas/tseries/period.py", line 667, in _from_arraylike
    raise ValueError('freq not specified and cannot be '
ValueError: freq not specified and cannot be inferred from first element

======================================================================
ERROR: test_mixed_freq_hf_first (pandas.tseries.tests.test_plotting.TestTSPlot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/home/jreback/pandas/pandas/tseries/tests/test_plotting.py", line 639, in test_mixed_freq_hf_first
    self.assertEqual(PeriodIndex(data=l.get_xdata()).freq, 'D')
  File "/mnt/home/jreback/pandas/pandas/tseries/period.py", line 598, in __new__
    ordinal, freq = cls._from_arraylike(data, freq, tz)
  File "/mnt/home/jreback/pandas/pandas/tseries/period.py", line 667, in _from_arraylike
    raise ValueError('freq not specified and cannot be '
ValueError: freq not specified and cannot be inferred from first element

@sinhrks
Copy link
Member

sinhrks commented Aug 1, 2014

@jreback I think following lines should be changed not to pass PeriodIndex. Both tests can be passed by using _mpl_repr to pass ndarray of Period objects explicitly. It looks no affect to chart-looking, but will take a further look (especially resampling/replot case).

One difference is matplotlib axis no longer can hold PeriodIndex wigh freq, thus freq will be re-inferred during PeriodIndex reconstruction (and affects to other tests which checks freq)

-    lines = plotf(ax, series.index, series.values, **kwargs)
+    lines = plotf(ax, series.index._mpl_repr(), series.values, **kwargs)

https://github.com/pydata/pandas/blob/master/pandas/tseries/plotting.py#L64
https://github.com/pydata/pandas/blob/master/pandas/tseries/plotting.py#L155

@jreback
Copy link
Contributor Author

jreback commented Aug 1, 2014

@sinhrks ahh..ok, lmk try to replace that and see.

@hayd
Copy link
Contributor

hayd commented Aug 1, 2014

Will this make RangeIndexs a little easier to implement? xlink #3268 #2420

@shoyer
Copy link
Member

shoyer commented Aug 1, 2014

@hayd I certainly hope so!

I am super stoked about this change.

@jreback
Copy link
Contributor Author

jreback commented Aug 1, 2014

@hayd their is already a lot of work on RangeIndex here: https://github.com/jtratner/pandas/tree/add-range-index

I think it should be a bit easier. However, I have to think about how to fix the real problem, which is that Index.__new__ unfortunatelly is still used (and each sub-class has a __new__ as well). I think actually need a IndexFactory (which is then simply called by __new__), with the ability to register / override the creation mechanism.

That will have to wait though. @shoyer when you are ready to integrate IntervalIndex I think we can address that.


- pickles <= 0.8.0 may not work if they contain MultiIndexes.
- you may need to unpickle < 0.15.0 pickles using ``pd.read_pickle`` rathen than ``pickle.loads``. See :ref:`pickle docs <io.pickle>`
- boolean comparisons of ``DatetimeIndex`` that have ``NaT`` with ``ndarray`` ONLY work if the ndarray is on the right-handle side. An example of this limited case is:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something you could fix (at least for standard ndarrays) by setting __array_priority__ > 1 ?
http://docs.scipy.org/doc/numpy/reference/arrays.classes.html#special-attributes-and-methods

(someone should really update those docs to discourage subclassing!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe if I add array_prepare

the issue is that ndarray defines 'lt' for example and I don't know anyway to have it reverse the args and call 'ge' on the index instead

do u?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer I have determined that this CAN be done by intercepting (and interpreting the context) in the __array_preprare__ call (similar to what is done in core/series.py/__array_prepare__). However IMHO this is pretty complicated (as you would need to translate the ufunc and reverse the arguments). Leaving it off for now with just the docs warning. I think this is a very limited case anyhow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think use of __array_prepare__ is necessary. Numpy will call gt instead of lt if you set a higher array priority for the second argument:

class ArrayLike(object):
    def __init__(self, array, priority):
        self.array = array
        self.__array_priority__ = priority

    def __array__(self):
        return self.array

    def __lt__(self, other):
        print 'subclass used lt'
        return self.array < other

    def __le__(self, other):
        print 'subclass used le'
        return self.array <= other

    def __eq__(self, other):
        print 'subclass used eq'
        return self.array == other

    def __ne__(self, other):
        print 'subclass used ne'
        return self.array != other

    def __gt__(self, other):
        print 'subclass used gt'
        return self.array > other

    def __ge__(self, other):
        print 'subclass used ge'
        return self.array >= other

Examples:

In [3]: np.array([0, 0]) < ArrayLike(np.array([1, 1]), priority=None)
Out[3]: array([ True,  True], dtype=bool)

In [4]: np.array([0, 0]) <= ArrayLike(np.array([1, 1]), priority=2)
subclass used ge
Out[4]: array([ True,  True], dtype=bool)

In [5]: 0 <= ArrayLike(np.array([1, 1]), priority=2)
subclass used ge
Out[5]: array([ True,  True], dtype=bool)

@jreback
Copy link
Contributor Author

jreback commented Aug 2, 2014

@sinhrks I have made the changes for plotting here. jreback@1fc0a1f

can you confirm that the graphs act/look the same? (esp when resampled/zoomed and such)

as an aside this now makes all lines whether DatetimeIndex or PeriodIndex be arrays of the underlying object (datetimes/Period).

@jreback
Copy link
Contributor Author

jreback commented Aug 2, 2014

@shoyer that was a good idea to use _array_priority__, see here: jreback@29d2f44

you have to catch Notimplemented (type) but that 's ok

@jreback
Copy link
Contributor Author

jreback commented Aug 3, 2014

@sinhrks I updated this commit a couple of times to put in some period construction optimizations....

pls lmk about the plottnig (if you think its ok). (all tests now pass).

jreback@2f32f46

@sinhrks
Copy link
Member

sinhrks commented Aug 3, 2014

@jreback I've checked some plots, and these worked as the same as before. If anything, I'll confirm again.

@Komnomnomnom
Copy link
Contributor

hey @jreback just taking a look at the json stuff now

result = func(other)
if result is NotImplemented:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I finally figured out what is happening here.

Numpy is not performing (ndarray, Index) comparisons here (returning NotImplemented) because Index has a higher array priority.

The fix is to always coerce the right-side argument into a plain ndarray. e.g., result = func(np.asarray(other)) (better to use asarray than array to avoid unnecessary copies). If you do that, you will be able to skip the NotImplemented check.

@jreback
Copy link
Contributor Author

jreback commented Aug 4, 2014

@shoyer jreback@7d66157

done.

also I have put in (well a little), structure on the Index testing for more generic testing, e.g. jreback@d1c4fbb

so prob worhwhile after this PR is merged to 'fix' the index tests to make it more class based (how Float64/Int64 are done). to make it a big more generic. E.g.

@jreback
Copy link
Contributor Author

jreback commented Aug 5, 2014

ok monster is ready to merge. any further comments.

@jorisvandenbossche @cpcloud @shoyer @sinhrks @hayd
cc @immerrr

I think I understand pickle and all of its evils now :) (not sure if that is a net benefit to society though)


__array_priority__ = 1000

def __array_prepare__(self, result, context=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this does anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh __array_prepare__ yes I know....sort of left in in their ...will take out

@@ -2137,14 +2137,14 @@ def copy(self, deep=True):
----------
deep : boolean, default True
Make a deep copy, i.e. also copy data
axes : string or None, default None
View copy of the axes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what this means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, maybe it's better to merge this with deep argument, e.g.

  • deep=False : shallow copy
  • deep=True : deep copy of values, shallow copy of axes
  • deep='withaxes' : deep copy of everything (withaxes could be any token that clarifies the meaning)

Would be nice to have deep=True to deep-copy everything and deep='values'/deep='axes' to pick only one component, but that seems non-backward compatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe accept deep='values' as an alias for deep=True and deep=('values', 'axes') to deep-copy everything

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if we keep the axes arg, I would rather make it a bool like deep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was all a kludge to avoid repeating code, their is exacttly 1 case where I need this: reduce.pyx/Reducer. Basically need to make a complete copy of an object including a deep copy of its index.

deep=True does not copy the actual data, rather it is a view on it. This preserves numpy semantics so memory is shared. We never actually need to copy index data memory as these are immutable and so cannot be changed. We always just create a new object (with possibly shared memory).

Meta-data is a different story (e.g. .name), where we almost always want/need to copy this (e.g. .view uses ._shallow_copy for this purpose).

However, in this reducer because of how it actually messes with the pointers, I do actually need to copy the memory.

I needed a 'private' way of doing that. So either make axes private, or just overload deep (default is still always True). Will change to deep=True|False|'all'.

The user never needs to actually copy the index data as it is a view and numpy takes care of that. This is an internal usage.

@jorisvandenbossche
Copy link
Member

@jreback Added a bunch of comments (mainly on docs and public API, not familiar enough to comment on technical details)

Further, I wondered, are there things we have learned from the "Series -> NDFrame subclass and no longer ndarray subclass" move that can be relevant here? Issues that came up afterwards (where we had to say "series is no longer ndarray subclass, so this will not work anymore) that we can now warn for beforehand?
The whatsnew section then has some more warnings (http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#internal-refactoring).

@@ -1894,8 +2008,75 @@ def drop(self, labels):
raise ValueError('labels %s not contained in axis' % labels[mask])
return self.delete(indexer)

@classmethod
def _add_numeric_methods_disabled(cls):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if _add_numeric_methods_disabled and _add_numeric_methods could be put into pandas.core.ops module to reuse/be reused from such methods implemented for other containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started doing this, but the ops.py was a bit too specific. It could/should be fixed I think. but would require some dedicated effort. Feel free!

@jreback
Copy link
Contributor Author

jreback commented Aug 6, 2014

@jorisvandenbossche
jreback@72f491a#diff-d41d8cd98f00b204e9800998ecf8427e

I just update the properties directly, better I think because then don't have to clutter with the full ndarray definitions.

@jorisvandenbossche
Copy link
Member

OK, it is a bit a compromise between both, as for some functions it can also be usefull information, for some it is too much clutter ..
For some you could also add

See also
----------
numpy.ndarray.flat

@jreback
Copy link
Contributor Author

jreback commented Aug 6, 2014

ahh ok, that makes sense

@jreback
Copy link
Contributor Author

jreback commented Aug 6, 2014

@jorisvandenbossche ok added a lot of see alsos (bot series and index), and put doc-strings on lots of attributes.

@jreback
Copy link
Contributor Author

jreback commented Aug 6, 2014

ok, think this is ready.

I put back the MultiIndex support for really old pickles (wasn't hard). though not sure anyone really has them around.

any final comments

@jorisvandenbossche @cpcloud @immerrr @sinhrks

@jreback
Copy link
Contributor Author

jreback commented Aug 6, 2014

@jorisvandenbossche added the rest of the properties (and now consolidated in 1 place)

jreback@9f86df2

CLN: add searchsorted to core/base (GH6712, GH7447, GH6469)

fixup tests in test_timeseries for reverse ndarray/datetimeindex comparisons

fix algos / multi-index repeat (essentially this is a bug-fix)

ENH: add NumericIndex and operators, related (GH7439)

DOC: indexing/v0.15.0 docs

TST: fixed up plotting issues

COMPAT/API: use __array_priority__ to facility proper comparisons of DatetimeIndex with ndarrays

fixup to do actual views in copy (except in reduce where its needed)

COMPAT: numpy compat with 1.6 for np.may_share_memory

FIX: access values attr in JSON code to support index that's not an ndarry subclass

COMPAT: numpy compat with array_priority fix

CLN: remove constructor pickle compat code as not necessary

COMPAT: fix pickle in sparse

CLN: clean up shallow_copy/simple_new

COMPAT: pickle compat

remove __array_prepare__

COMPAT: tests & compat for numeric operation support only on supported indexes

DOC: fixup for comments

COMPAT: allow older MultiIndex pickles again

CLN: combine properties from index/series for ndarray compat
@jreback
Copy link
Contributor Author

jreback commented Aug 7, 2014

ok, bombs away...

jreback added a commit that referenced this pull request Aug 7, 2014
CLN/INT: remove Index as a sub-class of NDArray
@jreback jreback merged commit c7bfb4e into pandas-dev:master Aug 7, 2014
@jorisvandenbossche
Copy link
Member

Nice!

@immerrr
Copy link
Contributor

immerrr commented Aug 7, 2014

That was huge, great job

@cpcloud
Copy link
Member

cpcloud commented Aug 7, 2014

Bravo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation MultiIndex Refactor Internal refactoring of code
Projects
None yet
8 participants