Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: New Index type for binned data (IntervalIndex) #7640

Closed
shoyer opened this issue Jul 1, 2014 · 20 comments
Closed

Proposal: New Index type for binned data (IntervalIndex) #7640

shoyer opened this issue Jul 1, 2014 · 20 comments
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation
Milestone

Comments

@shoyer
Copy link
Member

shoyer commented Jul 1, 2014

Design

The idea is to have a natural representation of the grids that ubiquitously appear in simulations and measurements of physical systems. Instead of referencing a single value, a grid cell references a range of values, based on the chosen discretization. Typically, cells boundaries would be specified by floating point numbers. In one dimension, a grid cell corresponds to an interval, the name we use here.

The key feature of IntervalIndex is that looking up an indexer should return all intervals in which the indexer's values fall. FloatIndex is a poor substitute, because of floating point precision issues, and because I don't want to label values by a single point.

A IntervalIndex is uniquely identified by its intervals and closed ('left' or 'right') properties, an ndarray of shape (len(idx), 2), indicating each interval. Other useful properties for IntervalIndex would include left, right and mid, which should return arrays (indexes?) corresponding to the left, right or mid-points of each interval.

The constructor should allow the optional keyword argument breaks (an array of length len(idx) + 1) to specified instead of intervals.

It's not entirely obvious what idx.values should be (idx.mid? strings like '(0, 1]'? an array of tuples or Interval objects?). I think the most useful choice for cross compatibility would probably be to an ndarray like idx.mid.

IntervalIndex should support mathematical operations (e.g., idx + 1), which are calculated by vectorizing the operation over the breaks.

Examples

An example already in pandas that should be a IntervalIndex is the levels property of categorical returned by cut, which is currently an object array of strings:

>>> pd.cut([], [0, 5, 10]).levels
Index([u'(0, 5]', u'(5, 10]'], dtype='object')

Example usage:

>>> # should be equivalent to pd.cut([], [0, 1, 2]).levels
>>> idx = IntervalIndex(intervals=[(0, 1), (1, 2)]) 
>>> idx2 = IntervalIndex(breaks=[0, 1, 2]) # equivalent
>>> idx
IntervalIndex([(0, 1), (1, 2)], closed='right')
>>> idx.left
np.array([0, 1]) 
>>> idx.right
np.array([1, 2]) 
>>> idx.mid
np.array([0.5, 1.5]) 
>>> s = pd.Series([1, 2], idx)
(0, 1]    1
(1, 2]    2
dtype: int64
>>> s.loc[1]
1
>>> s.loc[0.5]
1
>>> s.loc[0]
KeyError

Implementation

A IntervalIndex would be a monotonic and non-overlapping one-dimensional array of intervals. It is not required to be contiguous. A scalar Interval would correspond to a contiguous interval between start and stop values (e.g., given by integers, floating point numbers or datetimes).

For index lookups, I propose to do a binary search (np.searchsorted) on idx.left. If we add the constraint that all intervals must have a fixed width, we could calculate the bin using a formula in constant time, but I'm not sure the loss in flexibility would be worth the speedup.

IntervalIndex should play nicely when used as the levels for Categorical variable (#7217), but it is not the same as a CategoricalIndex (#7629). For example, a IntervalIndex should not allow for redundant values. To represent redundant or non-continuous intervals, you would need to make in a Categorical or CategoricalIndex which uses a IntervalIndex for the levels. Calling df.reset_index() on an DataFrame with an IntervalIndex would create a new Categorical column.


Note: I'm not entirely sure if this design doc belongs here or on mailing list (I'm happy to post it there if requested).

Here is the comment where I brought this up previously: #5460 (comment)

CC @hugadams -- I expect IntervalIndex would be very handy for your pyuvvis.

@dsm054
Copy link
Contributor

dsm054 commented Jul 2, 2014

FWIW, in our local in-house n-dim library we have something similar (an IntervalAxis), and it works quite well.

@jreback jreback added this to the 0.15.0 milestone Jul 2, 2014
@jreback
Copy link
Contributor

jreback commented Jul 2, 2014

@shoyer all for this!

I know you are against this, but I would encorage you to inherit from Index. OR create a new base class that is ABC like which we can eventually use as a base class for Index.

@cpcloud
Copy link
Member

cpcloud commented Jul 2, 2014

+1 here too. tho i think IntervalIndex might be a better name.

@shoyer great idea and excellent write up :)

@shoyer
Copy link
Member Author

shoyer commented Jul 2, 2014

Thanks for the support! I'm not sure when I'll get around to implementing this, but I will add it to my source open backlog :).

@jreback Agreed, for an new index class inside pandas, it is OK to subclass from Index. I haven't thought too much about the details of implementing this in pandas yet.

@cpcloud Also agreed, IntervalIndex is a better name for the described functionality. I will update the first comment. CellIndex makes more sense for an index that is actually constrained to a grid. That would also be useful, but is less general.

@shoyer shoyer changed the title Proposal: New Index type for binned data (CellIndex) Proposal: New Index type for binned data (IntervalIndex) Jul 2, 2014
@ischwabacher
Copy link
Contributor

This would be very useful for me, too. Currently I'm using a DatetimeIndex that's one longer than my data, which are padded with a row of nans at the end, so that df.index[i]:df.index[i+1] is the "index" corresponding to iloc[i]. It seemed clever when I started the project.

This also seems like it will help make contiguous groupby (#5494) easier, since it gives a natural choice of index for the groupings.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2014

@shoyer

if u (or anyone else)
could post test pairs for this would really help it along

essentially test cases for everything from construction to various indexing ops
that define as much behavior as possible

eg for Int64Index

result = Index([1,2,3])
expected = [1,2,3]
assert_almost_equal(result,expected)

@hughesadam87
Copy link

@shoyer

Thanks for including me; sorry I didn't notice earlier (mail filter was throwing github alerts out). Indeed, I think a general interval index is probably a great addition; although, I lack the breadth in vision to see a general solution.

I did actually implement a hacky version of an interval index in pyuvvis that converts a datetime index to intervals of seconds, minutes etc... The main lesson I learned is that your interval index should be able to map back to the original data. To do this, I actually retain the original datetimeindex, and use metadata like "_interval=True" to navigate between all of the logic. In my case, this mapping is stored on the TimeSpectra object (dataframe + metadata).

I put a demo of this up in case seeing a hack in action might help in the design of a general solution.

http://nbviewer.ipython.org/github/hugadams/pyuvvis/blob/master/examples/Notebooks/intervals.ipynb

@shoyer
Copy link
Member Author

shoyer commented Jul 28, 2014

@hugadams Looking at your notebook, it appears you may be thinking of a TimedeltaIndex?

The idea behind IntervalIndex is somewhat distinct -- although I can imagine that an IntervalIndex wrapping a TimedeltaIndex could be useful in some cases.

@jreback Sounds like a good idea, when I get the chance I will start writing some test cases and add them to this issue.

@hughesadam87
Copy link

Ha ya exactly! Thanks, never even saw this thread. I'll post my notebook there for reference as well. I must not understand the intervalindex then.

@shoyer
Copy link
Member Author

shoyer commented Jul 30, 2014

Here are a bunch of test cases: shoyer@838a597

I can open a PR if that makes things easier.

@jreback
Copy link
Contributor

jreback commented Jul 30, 2014

@shoyer that's a nice test suite...link is good for now. but of course expand to an actual impl!

@shoyer
Copy link
Member Author

shoyer commented Jul 30, 2014

I have updated the first post with some revisions to implementation details (per by test-cases). Basically, I realized that there is no a strong need to require that intervals be contiguous, and dropping that requirement should add some nice flexibility (e.g., the ability to subsample intervals with idx[::step]).

@jreback Haha, I thought that was your job? ;)

In all seriousness, I will probably get around to this at some point but the existing Index objects are pretty complex. #5080 would help -- I'm not looking forward to writing kludges around this also being an ndarray.

@jreback
Copy link
Contributor

jreback commented Jul 30, 2014

well this is the removing of ndarray from Index!

https://github.com/jreback/pandas/tree/index

almost done

shoyer added a commit to shoyer/pandas that referenced this issue Nov 2, 2014
Fixes pandas-dev#7640, pandas-dev#8625

This is a work in progress, but it's far enough along that I'd love to get
some feedback.

TODOs (more called out in the code):

- [ ] documentation + docstrings
- [ ] finish the index methods:
   - [ ] `get_loc`
   - [ ] `get_indexer`
   - [ ] `slice_locs`
- [ ] comparison operations
- [ ] fix `is_monotonic` (pending pandas-dev#8680)
- [ ] ensure sorting works
- [ ] arithmetic operations (not essential for MVP)
- [ ] cythonize the bottlenecks:
   - [ ] `from_breaks`
   - [ ] `_data`
   - [ ] `Interval`?
- [ ] `MultiIndex`
- [ ] `Categorical`/`cut`
- [ ] serialization
- [ ] lots more tests

CC @jreback @cpcloud @immerrr
@shoyer shoyer mentioned this issue Nov 2, 2014
35 tasks
@shoyer
Copy link
Member Author

shoyer commented Nov 2, 2014

First draft PR is up in #8707. So far, this is actually much easier than I feared...

@shoyer
Copy link
Member Author

shoyer commented Jan 28, 2015

For those of you not following along in #901 (which is honestly a dup of this issue), I am now thinking that the implementation here should probably use an actual interval-tree rather than relying on sortedness.

Also, for future reference: a suitable data-structure for an index of multi-dimensional intervals (an NDIntervalIndex) is an "R-tree". And in fact, this is quite a handy data-structure for GIS queries -- there is an R-tree now implemented in Geopandas: geopandas/geopandas#140

jreback pushed a commit to jreback/pandas that referenced this issue Feb 7, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Feb 8, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Feb 15, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Feb 15, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Feb 16, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Mar 8, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Mar 14, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Mar 14, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Mar 17, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Mar 17, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Mar 20, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Mar 24, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Mar 27, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Mar 28, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Mar 31, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Apr 3, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Apr 4, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Apr 6, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Apr 7, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Apr 7, 2017
@jreback jreback modified the milestones: 0.20.0, Next Major Release Apr 11, 2017
jreback pushed a commit to jreback/pandas that referenced this issue Apr 13, 2017
@blalterman
Copy link

I'm not sure if this belongs here or elsewhere. However, I'm trying to not clutter everything uselessly by just adding to the ever growing list of issues. If this belongs elsewhere, I'm happy to move it.

Is there a reason pd.cut returns a CategoricalIndex instead of an IntervalIndex? The current behavior is

>>> pd.cut(np.linspace(0,` 100), bins=np.linspace(0, 101, 10)).value_counts().sort_index().index
CategoricalIndex([   (0.0, 11.222], (11.222, 22.444], (22.444, 33.667],
                  (33.667, 44.889], (44.889, 56.111], (56.111, 67.333],
                  (67.333, 78.556], (78.556, 89.778],  (89.778, 101.0]],
                 categories=[(0.0, 11.222], (11.222, 22.444], (22.444, 33.667], (33.667, 44.889], (44.889, 56.111], (56.111, 67.333], (67.333, 78.556], (78.556, 89.778], ...], ordered=True, dtype='category')

instead of the following.

IntervalIndex([(0.0, 11.222], (11.222, 22.444], (22.444, 33.667], (33.667, 44.889], (44.889, 56.111], (56.111, 67.333], (67.333, 78.556], (78.556, 89.778], (89.778, 101.0]]
              closed='right',
              dtype='interval[float64]')

I would naively think that an IntervalIndex would make more sense here. It might also allow simplified plotting behavior such that

cut = pd.cut(np.linspace(0, 100), bins=np.linspace(0, 101, 10)).value_counts().sort_index()
cut.plot(index_part="mid")

would plot the counts vs. the index mid point.

@jreback
Copy link
Contributor

jreback commented May 7, 2018

the categories are an interval index or whatever type we are actually binning

cut/qcut return categorical always

@blalterman
Copy link

blalterman commented May 7, 2018 via email

@jreback
Copy link
Contributor

jreback commented May 8, 2018

@bla1089 I did consider return an IntervalIndex from cut/qcut. But rejected as:

  • it broken backward compat in a big way
  • the implementation of II is not efficient when stored in a Series (going to better in 0.24 with ExtensionArray)
  • indexing is quite a bit simpler

So in theory it is possible, but I don't really see a compelling reason to switch. cats are a nicer holder type of data like this. What exactly is the issue?

@blalterman
Copy link

@jreback I find myself doing the following type of thing rather often:

cut = pd.cut(np.linspace(0, 100), bins=np.linspace(0, 101, 10)).value_counts().sort_index()
cut.index = pd.IntervalIndex(cut.index).mid.astype(float)
cut.plot(drawstyle="steps-mid")

I hadn't seen a particular issue for it and I was wondering if I was missing something. The backwards compat issue is certainly relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants