Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: RangeIndex redux #11892

Merged
merged 6 commits into from
Jan 16, 2016
Merged

ENH: RangeIndex redux #11892

merged 6 commits into from
Jan 16, 2016

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Dec 23, 2015

closes #939
replaces #9977

ToDo:

  • test for packers.py
  • more code review

Much commentary on the original issue #9977

but in essence RangeIndex is a complete replacement for Int64Index, which all indexing semantics and interop. This is now the default indexer upon construction. It should be completely transparent to the end user.

It provides a constant memory footprint for any size of index. Their is a tiny penalty for < about 10 elements (which is actually trivial to fix, e.g. we could simply instantiate an Int64Index for these cases). But I think it is more natural to always get a RangeIndex.

One other change here is to assert_index_equal the exact kw now takes equiv as the default (in addition to a boolean) to allow for exact comparisions except for Int64Index/RangeIndex are considered equivalent (as are string/unicode as inferred types, this was pre-existing).

In [1]: s = Series(range(5))

In [2]: s.index
Out[2]: RangeIndex(start=0, stop=5, step=1)

In [3]: s.nbytes
Out[3]: 40

In [4]: s.index.nbytes
Out[4]: 72

In [5]: s.index.astype(int).nbytes
Out[5]: 40

In [6]: s = Series(range(100))

In [7]: s.index.astype(int).nbytes
Out[7]: 800

In [8]: s.index.nbytes
Out[8]: 72

@jreback jreback added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Dec 23, 2015
@jreback jreback added this to the 0.18.0 milestone Dec 23, 2015
@jreback jreback mentioned this pull request Dec 23, 2015
25 tasks
@jreback
Copy link
Contributor Author

jreback commented Dec 23, 2015

@jreback
Copy link
Contributor Author

jreback commented Dec 29, 2015

@shoyer if you can review when you have a chance. as you had a number of comments on the original.

In [2]: s.index
Out[2]: Int64Index([0, 1, 2, 3, 4], dtype='int64')

.. ipython:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "New behavior:"

@jreback
Copy link
Contributor Author

jreback commented Dec 30, 2015

@shoyer all fixed up

raise TypeError('Invalid to pass a non-int64 dtype to RangeIndex')

# RangeIndex
if isinstance(start, RangeIndex):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe pretty strongly that it's a bad idea to make the public API for the constructor this flexible. You can use Index(range(...)) for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the dtype part? or accepting range part?

@jreback
Copy link
Contributor Author

jreback commented Jan 13, 2016

note that I already had removed start,stop,step from the public API as this an implementation detail (and more like range)

not sure why 1) but having a nice constructor is a problem.

@shoyer
Copy link
Member

shoyer commented Jan 13, 2016

note that I already had removed start,stop,step from the public API as this an implementation detail (and more like range)

Great. This is like how xrange works on Python 2, but Python 3's range does have start, stop, step as attribute.

not sure why 1) but having a nice constructor is a problem.

Well, we could make the constructor only accept a single array/index/range argument instead, and require another dedicated method from_steps (?) for start, stop, step -- or even skip the dedicated method entirely, requiring passing in a range object to parse it that way. But I think I would prefer the other way around.

@kawochen
Copy link
Contributor

another attempt to change equals 👊

In [75]: range(0, 9, 2) == range(0, 10, 2)
Out[75]: True

And if the length is 1 you should only use _start

@jreback
Copy link
Contributor Author

jreback commented Jan 13, 2016

@kawochen can you add a commit for this (and some addtl tests)?

@kawochen
Copy link
Contributor

Oh OK.

@TomAugspurger
Copy link
Contributor

@jreback submitted some tests at jreback#15

@jreback
Copy link
Contributor Author

jreback commented Jan 15, 2016

@TomAugspurger @kawochen incorporated your changes thanks!

ok, ready to go on this.

@TomAugspurger
Copy link
Contributor

LGTM (assuming tests pass). Thanks.

@jreback
Copy link
Contributor Author

jreback commented Jan 15, 2016

@TomAugspurger I added the floordiv enhancement to #12034 but low priority :>

ARF and others added 6 commits January 16, 2016 10:37
`RangeIndex(1, 10, 2)` is a memory saving alternative to
`Index(np.arange(1, 10,2))`: c.f. pandas-dev#939.

This re-implementation is compatible with the current `Index()` api and is a
drop-in replacement for `Int64Index()`. It automatically converts to
Int64Index() when required by operations.

At present only for a minimum number of operations the type is
conserved (e.g. slicing, inner-, left- and right-joins). Most other operations
trigger creation of an equivalent Int64Index (or at least an equivalent numpy
array) and fall back to its implementation.

This PR also extends the functionality of the `Index()` constructor to allow
creation of `RangeIndexes()` with
```
Index(20)
Index(2, 20)
Index(0, 20, 2)
```
in analogy to
```
range(20)
range(2, 20)
range(0, 20, 2)
```

restore Index() fastpath precedence

Various fixes suggested by @jreback and @shoyer

Cache a private Int64Index object the first time it or its values are required.
Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1).
Make RangeIndex immutable. See start, stop, step properties.
In test_constructor(): check class, attributes (possibly including dtype).
In test_copy(): check that copy is not identical (but equal) to the existing.
In test_duplicates(): Assert is_unique and has_duplicates return correct values.

fix slicing

fix view

Set RangeIndex as default index
* enh: set RangeIndex as default index
* fix: pandas.io.packers: encode() and decode() for RangeIndex
* enh: array argument pass-through
* fix: reindex
* fix: use _default_index() in pandas.core.frame.extract_index()
* fix: pandas.core.index.Index._is()
* fix: add RangeIndex to ABCIndexClass
* fix: use _default_index() in _get_names_from_index()
* fix: pytables tests
* fix: MultiIndex.get_level_values()
* fix: RangeIndex._shallow_copy()
* fix: null-size RangeIndex equals() comparison
* enh: make RangeIndex.is_unique immutable

enh: various performance optimizations

 * optimize argsort()
 * optimize tolist()
 * comment clean-up
@jreback
Copy link
Contributor Author

jreback commented Jan 16, 2016

ok, anything else I suppose can just add to the enhancements list.

bombs away.

jreback added a commit that referenced this pull request Jan 16, 2016
@jreback jreback merged commit 723a147 into pandas-dev:master Jan 16, 2016
@hayd
Copy link
Contributor

hayd commented Jan 16, 2016

💥

@shoyer
Copy link
Member

shoyer commented Jan 17, 2016

@jreback it's great to get this in, but I'm a little disappointed that you merged this over my API objections... next time, we can please discuss such things more thoroughly and actually reach consensus before merging?

@jreback
Copy link
Contributor Author

jreback commented Jan 17, 2016

@shoyer

well your objections would be to make an API change which does not preserve idempotency - so you can certainly raise that change in another issue but it would be a major break with the current Index design

this keeps this enhancement quite straightforward

further this change need to live in master for a while - since planning on doing an rc in s couple of weeks
letting see how actual users interact is beneficial

further this was discussed quite a bit - last thing that we need is to have endless debates

@shoyer
Copy link
Member

shoyer commented Jan 17, 2016

@jreback just made a new issue: #12067

I agree with your other points (especially testing), just would have appreciated a chance to raise the issue with a broader audience (like I did now) before you merged over my objection. I understand that this is not final (given that we haven't made a release yet) and that you could possibly be convinced on this (depending on how others chime in), but that wouldn't be obvious to new contributors. Just more generally, I would appreciate it if you tried harder to operate by consensus, per our new governance docs -- as opposed to "s/he who presses the merge button makes all final decisions!" :)

@wesm
Copy link
Member

wesm commented Jan 17, 2016

I think we can better avoid miscommunication in the future with a "-1, until we resolve X Y Z". It's too bad that GitHub doesn't have a voting system (cf https://github.com/dear-github/dear-github). As a matter of process I've seen in other projects, usually it's someone other than the person who proposed the patch who presses the Merge button, or the Merge button cannot be pressed into a code review explicitly gives a +1 for the patch. For large patches we can try to be better about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a more memory-efficient RangeIndex-sort of thing to avoid large arange(N) indexes in some cases
8 participants