Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify index and multindex (and possibly others) API #3268

Closed
17 tasks
ghost opened this issue Apr 7, 2013 · 30 comments
Closed
17 tasks

Unify index and multindex (and possibly others) API #3268

ghost opened this issue Apr 7, 2013 · 30 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code

Comments

@ghost
Copy link

ghost commented Apr 7, 2013

Have you ever written code that looks like this:

if isinstance(d.index, MultiIndex):
    results = []
    for l in d.index.levels:
       for x in baz(l):
          results.append(foo)
elif  isinstance(d.index, Index):
    for x in d.index:
       foo

I've had to special case the handling of index vs. multindex several times in the past.
Conceptually, I should be able to treat index as a private case of MultIndex
with nlevels =1, and supporting that in the API would make things nicer.


Edit by @cpcloud:
Tasks :

API Unification

Method unification is relatively simple:

  • Index.from_tuples and Index.from_arrays are just MultiIndex.from_tuples and MultiIndex.from_arrays moved to classmethods of Index.
  • droplevel, ` just raises on Index (right? what would it mean?): API: implement droplevels() for flat index #21115
  • has_duplicates is straightforward
  • truncate should be equivalent to slicing
  • reorder_levels raises if not level=0 or name of index
  • equal_levels - straightforward
  • levshape - (len(ind),)
  • sortorder - None
  • get_loc_level - I think meaningless with tuple, raises whatever if not 0 or index name
  • is_lexsorted - doesn't need to change
  • is_lexosrted_tuple - doesn't need to change
  • is_monotonic_*
  • lexsort_depth - doesn't need to be changed at all
  • searchsorted
  • repeat
  • levels and labels property for Index - question on whether it should be sorted.
  • change to rename behavior: Index will accept either string or single-element list; MI continues to handle only list
@wesm
Copy link
Member

wesm commented Apr 7, 2013

Much agreed and this would lead to a lot of code simplification in places.

@ghost
Copy link
Author

ghost commented Apr 7, 2013

Great, glad you think so. Do you think this should be done as new stub methods
on Index, or should there really be a deeper unification of Index into MultiIndex at
the code level?

@wesm
Copy link
Member

wesm commented Apr 7, 2013

It would be great to do at a deeper level. Maybe needs to wait until after Index is made "not an ndarray" and we push the code to Cython. To do that we first need to fix the serialization / "pickle problem".

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@jtratner want to assign yourself?

@ghost ghost assigned jtratner Sep 22, 2013
@jtratner
Copy link
Contributor

sure. @wesm (or others) what do you mean by "Maybe needs to wait until after Index is made 'not an ndarray' and we push code to Cython"? Block Manager is still fundamentally using ndarray right? If we're just thinking that the ndarray becomes the equivalent of the _data attribute of NDFrame, then I understand.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@jtratner

right now Index subclasses FrozenNDarray which is an ndarray subclass. The idea would be to do a has-a,similar to what we did with Series. So make the Index have an attribute, prob _data which holds the data. Would need to add some compat methods and delegate other methods to _data.

The pickle problem is solved same way as solved the Series problem (via pickle_compat.py, where we fake the read back class if needed and just recreate in the new scheme).

@jtratner
Copy link
Contributor

@jreback okay, that makes sense.

@jtratner
Copy link
Contributor

Just running through this in my head - is this what you'd expect @y-p or others?

Handling Levels and Labels

Are you thinking that the outputted levels and labels are supposed to be equivalent? (given that MI levels appear to be sorted and labels are 0-indexed)?

>>> arr = [2, 3, 2, 7]
>>> ind = Index(arr)
>>> mi = MultiIndex.from_arrays([arr, [1, 1, 1, 1]])
>>> assert all(mi.levels[0] == ind.levels[0])
>>> assert all(mi.labels[0] == ind.labels[0])
>>> ind = Index(list('abcde'))
>>> ind.levels
[['a', 'b', 'c', 'd', 'e']]
>>> ind.labels
[[0, 1, 2, 3, 4, 5]]

This particularly applies when you have duplicates.

>>> ind = Index(list('abcaadec'))
>>> ind.levels
[['a', 'b', 'c', 'd', 'e']]
>>> ind.labels
[[0, 1, 2, 0, 0, 3, 4, 2]]

and for Int64Index, levels and labels may be the same

>>> ind = Index(range(10))
>>> ind.levels
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
>>> ind.labels
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]

but again, ambiguity with duplicates and ordering

>>> ind = Index([10, 0, 1, 2])
>>> ind.levels
[[0, 1, 2, 10]]
>>> ind.labels
[[3, 0, 1, 2]]

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@jtratner
I would move this to the top of this issue (easier to see).

@jtratner
Copy link
Contributor

@jreback I will, but can you tell me whether I'm right about how this should work? :P

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

I think easiest way to do this would be to copy all/bunch of multi-index tests and just substitue an Index there to see what happens/breaks. the idea would be that from a duck perspective it would walk the same as far as methods go

(which I think you have enumerated)

I think levels/labels are wrong on an Index now, in that levels should be treated as if it has 1 level (but not the case now)

@jtratner
Copy link
Contributor

@jreback right, but question is about sorting... probably the stopgap solution is to roundtrip through MultiIndex constructor when levels and labels are asked for.

@ghost
Copy link
Author

ghost commented Sep 22, 2013

what jeff said.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@jtratner

sorting doesn't matter with a single level, but IIRC the lexsort_depth checks take this into account

@wesm
Copy link
Member

wesm commented Oct 11, 2013

I have long been of the opinion that Index and friends should not be ndarray subclasses (in particular: MultiIndex currently has a dummy empty array inside, that is a bad smell). I don't know if now is the right time to fix that fundamental design (ndarray subclass) but it might be soonish since Series is now no longer an ndarray.

The "engine" classes that are used internally in the Index classes could be easily consolidated to produce simplified Cython-implemented index classes that don't have the auxiliary engine objects and this would improve performance at a micro level overall (adding up to larger gains, hopefully) and simplify things considerably.

Step 1 is having a consistent API between Index and MultiIndex as you guys have spelled out, but pushing much or all of the index code down into Cython (and consolidating the engine classes) would be a nice next step. I had planned to do this myself but ran out of bandwidth due to the running-a-startup thing.

@jtratner
Copy link
Contributor

@wesm - I'm 50-75% of the way through changing Index from ndarray subclass
to object subclass with some delegation to ndarray. I was planning to unify
the API at the same time (since it'll be trivial at this point). I was
thinking that it would make sense to have MI integer levels be represented
as a 2d array under the hood (easy to keep the public API with levels and
labels the same) - I think it would make more sense and seems like it would
be more straightforward when slicing on levels (slice of the MI only has to
slice one ndarray, using xs to only look at one level means selecting a
subset of columns, stack/unstack also is more straightforward), though I'm
definitely not as much of an expert on numpy as you and many of the other
core devs.

Anyways, I will absolutely cede this to you if you want; otherwise this can
be a starting point (particularly in terms of ID'ing which points expect
ndarray vs. able to handle Index) in Python and then Cythonizing it.

@wesm
Copy link
Member

wesm commented Oct 12, 2013

Okay that's excellent, glad to hear you're working on that. The other refactoring I wanted to make was to eliminate the creation of simple range(i, j) indexes (and therefore save a ton of memory). I got about 1 hour into doing that about a year ago (https://github.com/pydata/pandas/tree/range-index) and ran out of steam; it would be simpler to trash that work (look at it though, maybe!) and start over given that we're 4000 commits deeper into pandas development.

@jtratner
Copy link
Contributor

@wesm yeah I saw that - looked interesting - I'd imagine it would make lookups, slicing, etc. just some arithmetic, which is interesting. I think once we've nailed down the Index interface, it should be pretty easy to do.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2013

related #939

@wesm
Copy link
Member

wesm commented Oct 12, 2013

yeah. it will definitely yield speed and memory improvements throughout the codebase.

@hayd
Copy link
Contributor

hayd commented Dec 10, 2013

@jtratner What's the current story with range index? #2420

re. from_tuples, from_array, I wonder if an orient kwarg in the constructor would make more sense (same for MI).

@jtratner
Copy link
Contributor

I'm working on it (was interviewing for the past two weeks+ so I basically
dropped off the face of the earth) and I have some of the easy parts done.

What do you mean by "orient kwarg"?

@hayd
Copy link
Contributor

hayd commented Dec 10, 2013

coool, was just checking to see if you wanted a hand.

I meant like read_json and DataFrame.from_dict do, not sure if I take that back or not: it bugs me that MI constructor requires you to choose from (I think) confusing array/tuple, think orient is more descriptive/inline with others.

@jreback
Copy link
Contributor

jreback commented Aug 7, 2014

after #7891
cc @immerrr

see @wesm comments above if you are interested in applying any of the micro optimizations

@immerrr
Copy link
Contributor

immerrr commented Nov 10, 2014

How about adding searchsorted to the API?

UPD: and making it also work for is_monotonical_decreasing

@jreback
Copy link
Contributor

jreback commented Nov 10, 2014

@immerrr updated. Interesting in checking some of these off (what can do is do PR's and call this the master issue) ?

@immerrr
Copy link
Contributor

immerrr commented Nov 10, 2014

Yup, I'll keep this issue in mind when choosing the next direction.

@jbrockmendel
Copy link
Member

Some of these are pretty straightforward to implement. If a levels property is added to Index that just returns [self], then the following MultiIndex methods/properties are immediately valid in Index: nlevels, levshape, _inferred_type_levels, _have_mixed_levels, _get_names, _reference_duplicate_name, equal_levels.

A few others can be patched with only minor ugliness: from_product, reorder_levels, lexsort_depth, swaplevel, is_lexsorted.

To go much further than that requires a labels attribute (or per #14443 transition to codes to match CategoricalIndex). Maybe lazily categorize Index under the hood?

@jreback
Copy link
Contributor

jreback commented Jan 16, 2018

Add in .levels to Index breaks a couple of easy-to-fix issues in pandas test suite, mainly checks like

if hasattr(idx, 'levels'):
    # is a MI

should change to

if idx.nlevels > 1:
    # is a MI

patch

diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py
index a5949c62a..23279e3da 100644
--- a/pandas/core/indexes/base.py
+++ b/pandas/core/indexes/base.py
@@ -1108,6 +1108,10 @@ class Index(IndexOpsMixin, PandasObject):
     def nlevels(self):
         return 1
 
+    @property
+    def levels(self):
+        return [self]
+
     def _get_names(self):
         return FrozenList((self.name, ))
 

@jreback
Copy link
Contributor

jreback commented Jan 1, 2019

this is mostly done. should open specific issues for non compliance here.

@jreback jreback closed this as completed Jan 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Projects
None yet
Development

No branches or pull requests

6 participants