-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify index and multindex (and possibly others) API #3268
Comments
Much agreed and this would lead to a lot of code simplification in places. |
Great, glad you think so. Do you think this should be done as new stub methods |
It would be great to do at a deeper level. Maybe needs to wait until after Index is made "not an ndarray" and we push the code to Cython. To do that we first need to fix the serialization / "pickle problem". |
@jtratner want to assign yourself? |
sure. @wesm (or others) what do you mean by "Maybe needs to wait until after Index is made 'not an ndarray' and we push code to Cython"? Block Manager is still fundamentally using ndarray right? If we're just thinking that the ndarray becomes the equivalent of the |
right now The pickle problem is solved same way as solved the |
@jreback okay, that makes sense. |
Just running through this in my head - is this what you'd expect @y-p or others? Handling Levels and LabelsAre you thinking that the outputted levels and labels are supposed to be equivalent? (given that MI levels appear to be sorted and labels are 0-indexed)? >>> arr = [2, 3, 2, 7]
>>> ind = Index(arr)
>>> mi = MultiIndex.from_arrays([arr, [1, 1, 1, 1]])
>>> assert all(mi.levels[0] == ind.levels[0])
>>> assert all(mi.labels[0] == ind.labels[0]) >>> ind = Index(list('abcde'))
>>> ind.levels
[['a', 'b', 'c', 'd', 'e']]
>>> ind.labels
[[0, 1, 2, 3, 4, 5]] This particularly applies when you have duplicates. >>> ind = Index(list('abcaadec'))
>>> ind.levels
[['a', 'b', 'c', 'd', 'e']]
>>> ind.labels
[[0, 1, 2, 0, 0, 3, 4, 2]] and for Int64Index, levels and labels may be the same >>> ind = Index(range(10))
>>> ind.levels
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
>>> ind.labels
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]] but again, ambiguity with duplicates and ordering >>> ind = Index([10, 0, 1, 2])
>>> ind.levels
[[0, 1, 2, 10]]
>>> ind.labels
[[3, 0, 1, 2]] |
@jtratner |
@jreback I will, but can you tell me whether I'm right about how this should work? :P |
I think easiest way to do this would be to copy all/bunch of multi-index tests and just substitue an Index there to see what happens/breaks. the idea would be that from a duck perspective it would walk the same as far as methods go (which I think you have enumerated) I think levels/labels are wrong on an Index now, in that levels should be treated as if it has 1 level (but not the case now) |
@jreback right, but question is about sorting... probably the stopgap solution is to roundtrip through MultiIndex constructor when |
what jeff said. |
sorting doesn't matter with a single level, but IIRC the lexsort_depth checks take this into account |
I have long been of the opinion that Index and friends should not be ndarray subclasses (in particular: MultiIndex currently has a dummy empty array inside, that is a bad smell). I don't know if now is the right time to fix that fundamental design (ndarray subclass) but it might be soonish since Series is now no longer an ndarray. The "engine" classes that are used internally in the Index classes could be easily consolidated to produce simplified Cython-implemented index classes that don't have the auxiliary engine objects and this would improve performance at a micro level overall (adding up to larger gains, hopefully) and simplify things considerably. Step 1 is having a consistent API between Index and MultiIndex as you guys have spelled out, but pushing much or all of the index code down into Cython (and consolidating the engine classes) would be a nice next step. I had planned to do this myself but ran out of bandwidth due to the running-a-startup thing. |
@wesm - I'm 50-75% of the way through changing Index from ndarray subclass Anyways, I will absolutely cede this to you if you want; otherwise this can |
Okay that's excellent, glad to hear you're working on that. The other refactoring I wanted to make was to eliminate the creation of simple |
@wesm yeah I saw that - looked interesting - I'd imagine it would make lookups, slicing, etc. just some arithmetic, which is interesting. I think once we've nailed down the Index interface, it should be pretty easy to do. |
related #939 |
yeah. it will definitely yield speed and memory improvements throughout the codebase. |
I'm working on it (was interviewing for the past two weeks+ so I basically What do you mean by "orient kwarg"? |
coool, was just checking to see if you wanted a hand. I meant like read_json and DataFrame.from_dict do, not sure if I take that back or not: it bugs me that MI constructor requires you to choose from (I think) confusing array/tuple, think orient is more descriptive/inline with others. |
How about adding UPD: and making it also work for |
@immerrr updated. Interesting in checking some of these off (what can do is do PR's and call this the master issue) ? |
Yup, I'll keep this issue in mind when choosing the next direction. |
Some of these are pretty straightforward to implement. If a A few others can be patched with only minor ugliness: To go much further than that requires a |
Add in
should change to
patch
|
this is mostly done. should open specific issues for non compliance here. |
Have you ever written code that looks like this:
I've had to special case the handling of index vs. multindex several times in the past.
Conceptually, I should be able to treat index as a private case of MultIndex
with nlevels =1, and supporting that in the API would make things nicer.
Edit by @cpcloud:
Tasks :
API Unification
Method unification is relatively simple:
Index.from_tuples
andIndex.from_arrays
are justMultiIndex.from_tuples
andMultiIndex.from_arrays
moved to classmethods ofIndex
.droplevel
, ` just raises on Index (right? what would it mean?): API: implement droplevels() for flat index #21115has_duplicates
is straightforwardtruncate
should be equivalent to slicingreorder_levels
raises if not level=0 or name of indexequal_levels
- straightforwardlevshape
- (len(ind),)sortorder
- Noneget_loc_level
- I think meaningless with tuple, raises whatever if not 0 or index nameis_lexsorted
- doesn't need to changeis_lexosrted_tuple
- doesn't need to changeis_monotonic_*
lexsort_depth
- doesn't need to be changed at allsearchsorted
repeat
levels
andlabels
property for Index - question on whether it should be sorted.rename
behavior:Index
will accept either string or single-element list; MI continues to handle only listThe text was updated successfully, but these errors were encountered: