Unify index and multindex (and possibly others) API #3268

ghost · 2013-04-07T18:43:42Z

Have you ever written code that looks like this:

if isinstance(d.index, MultiIndex):
    results = []
    for l in d.index.levels:
       for x in baz(l):
          results.append(foo)
elif  isinstance(d.index, Index):
    for x in d.index:
       foo

I've had to special case the handling of index vs. multindex several times in the past.
Conceptually, I should be able to treat index as a private case of MultIndex
with nlevels =1, and supporting that in the API would make things nicer.

Edit by @cpcloud:
Tasks :

API Unification

Method unification is relatively simple:

The text was updated successfully, but these errors were encountered:

wesm · 2013-04-07T18:53:09Z

Much agreed and this would lead to a lot of code simplification in places.

ghost · 2013-04-07T19:38:27Z

Great, glad you think so. Do you think this should be done as new stub methods
on Index, or should there really be a deeper unification of Index into MultiIndex at
the code level?

wesm · 2013-04-07T20:45:59Z

It would be great to do at a deeper level. Maybe needs to wait until after Index is made "not an ndarray" and we push the code to Cython. To do that we first need to fix the serialization / "pickle problem".

jreback · 2013-09-21T23:59:29Z

@jtratner want to assign yourself?

jtratner · 2013-09-22T00:08:56Z

sure. @wesm (or others) what do you mean by "Maybe needs to wait until after Index is made 'not an ndarray' and we push code to Cython"? Block Manager is still fundamentally using ndarray right? If we're just thinking that the ndarray becomes the equivalent of the _data attribute of NDFrame, then I understand.

jreback · 2013-09-22T00:22:25Z

@jtratner

right now Index subclasses FrozenNDarray which is an ndarray subclass. The idea would be to do a has-a,similar to what we did with Series. So make the Index have an attribute, prob _data which holds the data. Would need to add some compat methods and delegate other methods to _data.

The pickle problem is solved same way as solved the Series problem (via pickle_compat.py, where we fake the read back class if needed and just recreate in the new scheme).

jtratner · 2013-09-22T00:40:35Z

@jreback okay, that makes sense.

jtratner · 2013-09-22T00:51:06Z

Just running through this in my head - is this what you'd expect @y-p or others?

Handling Levels and Labels

Are you thinking that the outputted levels and labels are supposed to be equivalent? (given that MI levels appear to be sorted and labels are 0-indexed)?

>>> arr = [2, 3, 2, 7]
>>> ind = Index(arr)
>>> mi = MultiIndex.from_arrays([arr, [1, 1, 1, 1]])
>>> assert all(mi.levels[0] == ind.levels[0])
>>> assert all(mi.labels[0] == ind.labels[0])

>>> ind = Index(list('abcde'))
>>> ind.levels
[['a', 'b', 'c', 'd', 'e']]
>>> ind.labels
[[0, 1, 2, 3, 4, 5]]

This particularly applies when you have duplicates.

>>> ind = Index(list('abcaadec'))
>>> ind.levels
[['a', 'b', 'c', 'd', 'e']]
>>> ind.labels
[[0, 1, 2, 0, 0, 3, 4, 2]]

and for Int64Index, levels and labels may be the same

>>> ind = Index(range(10))
>>> ind.levels
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
>>> ind.labels
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]

but again, ambiguity with duplicates and ordering

>>> ind = Index([10, 0, 1, 2])
>>> ind.levels
[[0, 1, 2, 10]]
>>> ind.labels
[[3, 0, 1, 2]]

jreback · 2013-09-22T00:52:58Z

@jtratner
I would move this to the top of this issue (easier to see).

jtratner · 2013-09-22T01:13:32Z

@jreback I will, but can you tell me whether I'm right about how this should work? :P

jreback · 2013-09-22T01:20:18Z

I think easiest way to do this would be to copy all/bunch of multi-index tests and just substitue an Index there to see what happens/breaks. the idea would be that from a duck perspective it would walk the same as far as methods go

(which I think you have enumerated)

I think levels/labels are wrong on an Index now, in that levels should be treated as if it has 1 level (but not the case now)

jtratner · 2013-09-22T02:44:14Z

@jreback right, but question is about sorting... probably the stopgap solution is to roundtrip through MultiIndex constructor when levels and labels are asked for.

ghost · 2013-09-22T11:14:23Z

what jeff said.

jreback · 2013-09-22T14:19:09Z

@jtratner

sorting doesn't matter with a single level, but IIRC the lexsort_depth checks take this into account

wesm · 2013-10-11T16:51:33Z

I have long been of the opinion that Index and friends should not be ndarray subclasses (in particular: MultiIndex currently has a dummy empty array inside, that is a bad smell). I don't know if now is the right time to fix that fundamental design (ndarray subclass) but it might be soonish since Series is now no longer an ndarray.

The "engine" classes that are used internally in the Index classes could be easily consolidated to produce simplified Cython-implemented index classes that don't have the auxiliary engine objects and this would improve performance at a micro level overall (adding up to larger gains, hopefully) and simplify things considerably.

Step 1 is having a consistent API between Index and MultiIndex as you guys have spelled out, but pushing much or all of the index code down into Cython (and consolidating the engine classes) would be a nice next step. I had planned to do this myself but ran out of bandwidth due to the running-a-startup thing.

jtratner · 2013-10-11T21:10:52Z

@wesm - I'm 50-75% of the way through changing Index from ndarray subclass
to object subclass with some delegation to ndarray. I was planning to unify
the API at the same time (since it'll be trivial at this point). I was
thinking that it would make sense to have MI integer levels be represented
as a 2d array under the hood (easy to keep the public API with levels and
labels the same) - I think it would make more sense and seems like it would
be more straightforward when slicing on levels (slice of the MI only has to
slice one ndarray, using xs to only look at one level means selecting a
subset of columns, stack/unstack also is more straightforward), though I'm
definitely not as much of an expert on numpy as you and many of the other
core devs.

Anyways, I will absolutely cede this to you if you want; otherwise this can
be a starting point (particularly in terms of ID'ing which points expect
ndarray vs. able to handle Index) in Python and then Cythonizing it.

wesm · 2013-10-12T01:27:21Z

Okay that's excellent, glad to hear you're working on that. The other refactoring I wanted to make was to eliminate the creation of simple range(i, j) indexes (and therefore save a ton of memory). I got about 1 hour into doing that about a year ago (https://github.com/pydata/pandas/tree/range-index) and ran out of steam; it would be simpler to trash that work (look at it though, maybe!) and start over given that we're 4000 commits deeper into pandas development.

jtratner · 2013-10-12T01:39:02Z

@wesm yeah I saw that - looked interesting - I'd imagine it would make lookups, slicing, etc. just some arithmetic, which is interesting. I think once we've nailed down the Index interface, it should be pretty easy to do.

jreback · 2013-10-12T01:40:28Z

related #939

wesm · 2013-10-12T01:41:24Z

yeah. it will definitely yield speed and memory improvements throughout the codebase.

hayd · 2013-12-10T04:28:13Z

@jtratner What's the current story with range index? #2420

re. from_tuples, from_array, I wonder if an orient kwarg in the constructor would make more sense (same for MI).

jtratner · 2013-12-10T04:52:20Z

I'm working on it (was interviewing for the past two weeks+ so I basically
dropped off the face of the earth) and I have some of the easy parts done.

What do you mean by "orient kwarg"?

hayd · 2013-12-10T05:23:58Z

coool, was just checking to see if you wanted a hand.

I meant like read_json and DataFrame.from_dict do, not sure if I take that back or not: it bugs me that MI constructor requires you to choose from (I think) confusing array/tuple, think orient is more descriptive/inline with others.

jreback · 2014-08-07T12:05:11Z

after #7891
cc @immerrr

see @wesm comments above if you are interested in applying any of the micro optimizations

immerrr · 2014-11-10T11:35:25Z

How about adding searchsorted to the API?

UPD: and making it also work for is_monotonical_decreasing

jreback · 2014-11-10T12:05:01Z

@immerrr updated. Interesting in checking some of these off (what can do is do PR's and call this the master issue) ?

immerrr · 2014-11-10T13:14:23Z

Yup, I'll keep this issue in mind when choosing the next direction.

jbrockmendel · 2017-11-22T18:13:30Z

Some of these are pretty straightforward to implement. If a levels property is added to Index that just returns [self], then the following MultiIndex methods/properties are immediately valid in Index: nlevels, levshape, _inferred_type_levels, _have_mixed_levels, _get_names, _reference_duplicate_name, equal_levels.

A few others can be patched with only minor ugliness: from_product, reorder_levels, lexsort_depth, swaplevel, is_lexsorted.

To go much further than that requires a labels attribute (or per #14443 transition to codes to match CategoricalIndex). Maybe lazily categorize Index under the hood?

jreback · 2018-01-16T11:00:31Z

Add in .levels to Index breaks a couple of easy-to-fix issues in pandas test suite, mainly checks like

if hasattr(idx, 'levels'):
    # is a MI

should change to

if idx.nlevels > 1:
    # is a MI

patch

diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py
index a5949c62a..23279e3da 100644
--- a/pandas/core/indexes/base.py
+++ b/pandas/core/indexes/base.py
@@ -1108,6 +1108,10 @@ class Index(IndexOpsMixin, PandasObject):
     def nlevels(self):
         return 1
 
+    @property
+    def levels(self):
+        return [self]
+
     def _get_names(self):
         return FrozenList((self.name, ))

jreback · 2019-01-01T17:30:08Z

this is mostly done. should open specific issues for non compliance here.

ghost mentioned this issue Apr 8, 2013

Thoughts on DataFrame index not being columns, grouping issues #3275

Closed

ghost assigned jtratner Sep 22, 2013

jtratner mentioned this issue Oct 2, 2013

CLN: MultiIndex and Index no longer inherit from ndarray. #5080

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Mar 14, 2014

jreback mentioned this issue May 27, 2014

Support for indices satisfying the Index API that aren't pandas.Index subclasses #7243

Closed

immerrr mentioned this issue Jul 31, 2014

API: add 'level' kwarg to 'Index.isin' method #7892

Merged

2 tasks

hayd mentioned this issue Aug 1, 2014

CLN/INT: remove Index as a sub-class of NDArray #7891

Merged

11 tasks

jorisvandenbossche mentioned this issue Nov 7, 2014

Add a levels member in pandas.Index #8751

Closed

jreback unassigned jtratner Nov 10, 2014

jreback mentioned this issue Jan 18, 2015

sort_index behavior differs for the same DataFrame? #9212

Closed

jreback modified the milestones: 0.16.0, 0.17.0 Jan 26, 2015

jorisvandenbossche mentioned this issue Apr 27, 2015

Towards "pandas 1.0" #10000

Closed

shoyer mentioned this issue May 12, 2015

BUG: drop_duplicates drops name(s). #10116

Closed

jorisvandenbossche mentioned this issue Jan 25, 2017

ENH: add MultiIndex.to_dataframe #15216

Closed

jreback mentioned this issue Jan 16, 2018

COMPAT: MultiIndex checking is fragile pydata/xarray#1833

Closed

jschendel mentioned this issue May 18, 2018

API: implement droplevels() for flat index #21115

Closed

toobaz mentioned this issue May 19, 2018

Idx droplevel #21116

Merged

4 tasks

shoyer mentioned this issue Aug 29, 2018

Assigning with loc using an index with string values #22500

Open

jreback closed this as completed Jan 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify index and multindex (and possibly others) API #3268

Unify index and multindex (and possibly others) API #3268

ghost commented Apr 7, 2013 •

edited by toobaz

Loading

wesm commented Apr 7, 2013

ghost commented Apr 7, 2013

wesm commented Apr 7, 2013

jreback commented Sep 21, 2013

jtratner commented Sep 22, 2013

jreback commented Sep 22, 2013

jtratner commented Sep 22, 2013

jtratner commented Sep 22, 2013

jreback commented Sep 22, 2013

jtratner commented Sep 22, 2013

jreback commented Sep 22, 2013

jtratner commented Sep 22, 2013

ghost commented Sep 22, 2013

jreback commented Sep 22, 2013

wesm commented Oct 11, 2013

jtratner commented Oct 11, 2013

wesm commented Oct 12, 2013

jtratner commented Oct 12, 2013

jreback commented Oct 12, 2013

wesm commented Oct 12, 2013

hayd commented Dec 10, 2013

jtratner commented Dec 10, 2013

hayd commented Dec 10, 2013

jreback commented Aug 7, 2014

immerrr commented Nov 10, 2014

jreback commented Nov 10, 2014

immerrr commented Nov 10, 2014

jbrockmendel commented Nov 22, 2017

jreback commented Jan 16, 2018

jreback commented Jan 1, 2019

Unify index and multindex (and possibly others) API #3268

Unify index and multindex (and possibly others) API #3268

Comments

ghost commented Apr 7, 2013 • edited by toobaz Loading

API Unification

wesm commented Apr 7, 2013

ghost commented Apr 7, 2013

wesm commented Apr 7, 2013

jreback commented Sep 21, 2013

jtratner commented Sep 22, 2013

jreback commented Sep 22, 2013

jtratner commented Sep 22, 2013

jtratner commented Sep 22, 2013

Handling Levels and Labels

jreback commented Sep 22, 2013

jtratner commented Sep 22, 2013

jreback commented Sep 22, 2013

jtratner commented Sep 22, 2013

ghost commented Sep 22, 2013

jreback commented Sep 22, 2013

wesm commented Oct 11, 2013

jtratner commented Oct 11, 2013

wesm commented Oct 12, 2013

jtratner commented Oct 12, 2013

jreback commented Oct 12, 2013

wesm commented Oct 12, 2013

hayd commented Dec 10, 2013

jtratner commented Dec 10, 2013

hayd commented Dec 10, 2013

jreback commented Aug 7, 2014

immerrr commented Nov 10, 2014

jreback commented Nov 10, 2014

immerrr commented Nov 10, 2014

jbrockmendel commented Nov 22, 2017

jreback commented Jan 16, 2018

jreback commented Jan 1, 2019

ghost commented Apr 7, 2013 •

edited by toobaz

Loading