Proposal: New Index type for binned data (IntervalIndex) #7640

shoyer · 2014-07-01T21:18:41Z

Design

The idea is to have a natural representation of the grids that ubiquitously appear in simulations and measurements of physical systems. Instead of referencing a single value, a grid cell references a range of values, based on the chosen discretization. Typically, cells boundaries would be specified by floating point numbers. In one dimension, a grid cell corresponds to an interval, the name we use here.

The key feature of IntervalIndex is that looking up an indexer should return all intervals in which the indexer's values fall. FloatIndex is a poor substitute, because of floating point precision issues, and because I don't want to label values by a single point.

A IntervalIndex is uniquely identified by its intervals and closed ('left' or 'right') properties, an ndarray of shape (len(idx), 2), indicating each interval. Other useful properties for IntervalIndex would include left, right and mid, which should return arrays (indexes?) corresponding to the left, right or mid-points of each interval.

The constructor should allow the optional keyword argument breaks (an array of length len(idx) + 1) to specified instead of intervals.

It's not entirely obvious what idx.values should be (idx.mid? strings like '(0, 1]'? an array of tuples or Interval objects?). I think the most useful choice for cross compatibility would probably be to an ndarray like idx.mid.

IntervalIndex should support mathematical operations (e.g., idx + 1), which are calculated by vectorizing the operation over the breaks.

Examples

An example already in pandas that should be a IntervalIndex is the levels property of categorical returned by cut, which is currently an object array of strings:

>>> pd.cut([], [0, 5, 10]).levels
Index([u'(0, 5]', u'(5, 10]'], dtype='object')

Example usage:

>>> # should be equivalent to pd.cut([], [0, 1, 2]).levels
>>> idx = IntervalIndex(intervals=[(0, 1), (1, 2)]) 
>>> idx2 = IntervalIndex(breaks=[0, 1, 2]) # equivalent
>>> idx
IntervalIndex([(0, 1), (1, 2)], closed='right')
>>> idx.left
np.array([0, 1]) 
>>> idx.right
np.array([1, 2]) 
>>> idx.mid
np.array([0.5, 1.5]) 
>>> s = pd.Series([1, 2], idx)
(0, 1]    1
(1, 2]    2
dtype: int64
>>> s.loc[1]
1
>>> s.loc[0.5]
1
>>> s.loc[0]
KeyError

Implementation

A IntervalIndex would be a monotonic and non-overlapping one-dimensional array of intervals. It is not required to be contiguous. A scalar Interval would correspond to a contiguous interval between start and stop values (e.g., given by integers, floating point numbers or datetimes).

For index lookups, I propose to do a binary search (np.searchsorted) on idx.left. If we add the constraint that all intervals must have a fixed width, we could calculate the bin using a formula in constant time, but I'm not sure the loss in flexibility would be worth the speedup.

IntervalIndex should play nicely when used as the levels for Categorical variable (#7217), but it is not the same as a CategoricalIndex (#7629). For example, a IntervalIndex should not allow for redundant values. To represent redundant or non-continuous intervals, you would need to make in a Categorical or CategoricalIndex which uses a IntervalIndex for the levels. Calling df.reset_index() on an DataFrame with an IntervalIndex would create a new Categorical column.

Note: I'm not entirely sure if this design doc belongs here or on mailing list (I'm happy to post it there if requested).

Here is the comment where I brought this up previously: #5460 (comment)

CC @hugadams -- I expect IntervalIndex would be very handy for your pyuvvis.

The text was updated successfully, but these errors were encountered:

dsm054 · 2014-07-02T04:51:16Z

FWIW, in our local in-house n-dim library we have something similar (an IntervalAxis), and it works quite well.

jreback · 2014-07-02T11:48:18Z

@shoyer all for this!

I know you are against this, but I would encorage you to inherit from Index. OR create a new base class that is ABC like which we can eventually use as a base class for Index.

cpcloud · 2014-07-02T15:43:10Z

+1 here too. tho i think IntervalIndex might be a better name.

@shoyer great idea and excellent write up :)

shoyer · 2014-07-02T17:14:55Z

Thanks for the support! I'm not sure when I'll get around to implementing this, but I will add it to my source open backlog :).

@jreback Agreed, for an new index class inside pandas, it is OK to subclass from Index. I haven't thought too much about the details of implementing this in pandas yet.

@cpcloud Also agreed, IntervalIndex is a better name for the described functionality. I will update the first comment. CellIndex makes more sense for an index that is actually constrained to a grid. That would also be useful, but is less general.

ischwabacher · 2014-07-28T20:28:56Z

This would be very useful for me, too. Currently I'm using a DatetimeIndex that's one longer than my data, which are padded with a row of nans at the end, so that df.index[i]:df.index[i+1] is the "index" corresponding to iloc[i]. It seemed clever when I started the project.

This also seems like it will help make contiguous groupby (#5494) easier, since it gives a natural choice of index for the groupings.

jreback · 2014-07-28T20:39:27Z

@shoyer

if u (or anyone else)
could post test pairs for this would really help it along

essentially test cases for everything from construction to various indexing ops
that define as much behavior as possible

eg for Int64Index

result = Index([1,2,3])
expected = [1,2,3]
assert_almost_equal(result,expected)

hughesadam87 · 2014-07-28T21:03:27Z

@shoyer

Thanks for including me; sorry I didn't notice earlier (mail filter was throwing github alerts out). Indeed, I think a general interval index is probably a great addition; although, I lack the breadth in vision to see a general solution.

I did actually implement a hacky version of an interval index in pyuvvis that converts a datetime index to intervals of seconds, minutes etc... The main lesson I learned is that your interval index should be able to map back to the original data. To do this, I actually retain the original datetimeindex, and use metadata like "_interval=True" to navigate between all of the logic. In my case, this mapping is stored on the TimeSpectra object (dataframe + metadata).

I put a demo of this up in case seeing a hack in action might help in the design of a general solution.

http://nbviewer.ipython.org/github/hugadams/pyuvvis/blob/master/examples/Notebooks/intervals.ipynb

shoyer · 2014-07-28T22:16:54Z

@hugadams Looking at your notebook, it appears you may be thinking of a TimedeltaIndex?

The idea behind IntervalIndex is somewhat distinct -- although I can imagine that an IntervalIndex wrapping a TimedeltaIndex could be useful in some cases.

@jreback Sounds like a good idea, when I get the chance I will start writing some test cases and add them to this issue.

hughesadam87 · 2014-07-28T22:23:03Z

Ha ya exactly! Thanks, never even saw this thread. I'll post my notebook there for reference as well. I must not understand the intervalindex then.

shoyer · 2014-07-30T08:48:18Z

Here are a bunch of test cases: shoyer@838a597

I can open a PR if that makes things easier.

jreback · 2014-07-30T12:27:19Z

@shoyer that's a nice test suite...link is good for now. but of course expand to an actual impl!

shoyer · 2014-07-30T17:23:53Z

I have updated the first post with some revisions to implementation details (per by test-cases). Basically, I realized that there is no a strong need to require that intervals be contiguous, and dropping that requirement should add some nice flexibility (e.g., the ability to subsample intervals with idx[::step]).

@jreback Haha, I thought that was your job? ;)

In all seriousness, I will probably get around to this at some point but the existing Index objects are pretty complex. #5080 would help -- I'm not looking forward to writing kludges around this also being an ndarray.

jreback · 2014-07-30T17:26:26Z

well this is the removing of ndarray from Index!

https://github.com/jreback/pandas/tree/index

almost done

@jreback

Fixes pandas-dev#7640, pandas-dev#8625 This is a work in progress, but it's far enough along that I'd love to get some feedback. TODOs (more called out in the code): - [ ] documentation + docstrings - [ ] finish the index methods: - [ ] `get_loc` - [ ] `get_indexer` - [ ] `slice_locs` - [ ] comparison operations - [ ] fix `is_monotonic` (pending pandas-dev#8680) - [ ] ensure sorting works - [ ] arithmetic operations (not essential for MVP) - [ ] cythonize the bottlenecks: - [ ] `from_breaks` - [ ] `_data` - [ ] `Interval`? - [ ] `MultiIndex` - [ ] `Categorical`/`cut` - [ ] serialization - [ ] lots more tests CC @jreback @cpcloud @immerrr

shoyer · 2014-11-02T09:11:00Z

First draft PR is up in #8707. So far, this is actually much easier than I feared...

shoyer · 2015-01-28T07:58:37Z

For those of you not following along in #901 (which is honestly a dup of this issue), I am now thinking that the implementation here should probably use an actual interval-tree rather than relying on sortedness.

Also, for future reference: a suitable data-structure for an index of multi-dimensional intervals (an NDIntervalIndex) is an "R-tree". And in fact, this is quite a handy data-structure for GIS queries -- there is an R-tree now implemented in Geopandas: geopandas/geopandas#140

closes pandas-dev#7640 closes pandas-dev#8625

blalterman · 2018-05-07T14:54:01Z

I'm not sure if this belongs here or elsewhere. However, I'm trying to not clutter everything uselessly by just adding to the ever growing list of issues. If this belongs elsewhere, I'm happy to move it.

Is there a reason pd.cut returns a CategoricalIndex instead of an IntervalIndex? The current behavior is

>>> pd.cut(np.linspace(0,` 100), bins=np.linspace(0, 101, 10)).value_counts().sort_index().index
CategoricalIndex([   (0.0, 11.222], (11.222, 22.444], (22.444, 33.667],
                  (33.667, 44.889], (44.889, 56.111], (56.111, 67.333],
                  (67.333, 78.556], (78.556, 89.778],  (89.778, 101.0]],
                 categories=[(0.0, 11.222], (11.222, 22.444], (22.444, 33.667], (33.667, 44.889], (44.889, 56.111], (56.111, 67.333], (67.333, 78.556], (78.556, 89.778], ...], ordered=True, dtype='category')

instead of the following.

IntervalIndex([(0.0, 11.222], (11.222, 22.444], (22.444, 33.667], (33.667, 44.889], (44.889, 56.111], (56.111, 67.333], (67.333, 78.556], (78.556, 89.778], (89.778, 101.0]]
              closed='right',
              dtype='interval[float64]')

I would naively think that an IntervalIndex would make more sense here. It might also allow simplified plotting behavior such that

cut = pd.cut(np.linspace(0, 100), bins=np.linspace(0, 101, 10)).value_counts().sort_index()
cut.plot(index_part="mid")

would plot the counts vs. the index mid point.

jreback · 2018-05-07T23:19:28Z

the categories are an interval index or whatever type we are actually binning

cut/qcut return categorical always

blalterman · 2018-05-07T23:55:04Z

I understand. However, a categorical index does not have the same methods and properties available as an interval index. Is it at all reasonable to return an interval index when the categories are purely numeric? Are there reasons to use a categorical over an interval? Ben

…

--------------------- B. L. Alterman Candidate, Applied Physics Solar and Heliospheric Research Group Climate and Space Sciences and Engineering University of Michigan balterma@umich.edu

On Mon, May 7, 2018 at 7:19 PM, Jeff Reback ***@***.***> wrote: the categories are an interval index or whatever type we are actually binning cut/qcut return categorical always — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7640 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMPWVbkMDlkY9ZtE0QDOdOjmLJNBG41Gks5twNaXgaJpZM4CJWMy> .

jreback · 2018-05-08T00:02:19Z

@bla1089 I did consider return an IntervalIndex from cut/qcut. But rejected as:

it broken backward compat in a big way
the implementation of II is not efficient when stored in a Series (going to better in 0.24 with ExtensionArray)
indexing is quite a bit simpler

So in theory it is possible, but I don't really see a compelling reason to switch. cats are a nicer holder type of data like this. What exactly is the issue?

blalterman · 2018-05-08T00:06:51Z

@jreback I find myself doing the following type of thing rather often:

cut = pd.cut(np.linspace(0, 100), bins=np.linspace(0, 101, 10)).value_counts().sort_index()
cut.index = pd.IntervalIndex(cut.index).mid.astype(float)
cut.plot(drawstyle="steps-mid")

I hadn't seen a particular issue for it and I was wondering if I was missing something. The backwards compat issue is certainly relevant.

jreback added API Design labels Jul 2, 2014

jreback added this to the 0.15.0 milestone Jul 2, 2014

cpcloud added the Enhancement label Jul 2, 2014

shoyer changed the title ~~Proposal: New Index type for binned data (CellIndex)~~ Proposal: New Index type for binned data (IntervalIndex) Jul 2, 2014

hughesadam87 mentioned this issue Jul 28, 2014

ENH: TimeDeltaIndex and corresponding scalar #3009

Closed

6 tasks

jreback mentioned this issue Jul 31, 2014

CLN/INT: remove Index as a sub-class of NDArray #7891

Merged

11 tasks

shoyer mentioned this issue Oct 24, 2014

API/ENH: create Interval class #8625

Closed

shoyer mentioned this issue Nov 2, 2014

WIP/API/ENH: IntervalIndex #8707

Closed

35 tasks

shoyer mentioned this issue Nov 9, 2014

Feature Request: Array indices which understand units astropy/astropy#3053

Open

shoyer mentioned this issue Jan 28, 2015

timeseries branch, intervals w/ alternate durations #901

Closed

jreback pushed a commit to jreback/pandas that referenced this issue Feb 7, 2017

API/ENH: IntervalIndex

3acdf7d

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Feb 8, 2017

API/ENH: IntervalIndex

455b3fd

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Feb 15, 2017

API/ENH: IntervalIndex

b67b098

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Feb 15, 2017

API/ENH: IntervalIndex

439b335

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Feb 16, 2017

API/ENH: IntervalIndex

0193f57

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Mar 8, 2017

API/ENH: IntervalIndex

e929645

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Mar 14, 2017

API/ENH: IntervalIndex

3e872fa

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Mar 14, 2017

API/ENH: IntervalIndex

c30ef44

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Mar 17, 2017

API/ENH: IntervalIndex

e3eaacc

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Mar 17, 2017

API/ENH: IntervalIndex

4c0217b

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Mar 20, 2017

API/ENH: IntervalIndex

a4b82bd

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Mar 24, 2017

API/ENH: IntervalIndex

7398643

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Mar 27, 2017

API/ENH: IntervalIndex

2fc322d

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Mar 28, 2017

API/ENH: IntervalIndex

2a58edb

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Mar 31, 2017

API/ENH: IntervalIndex

2611eee

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Apr 3, 2017

API/ENH: IntervalIndex

6225115

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Apr 4, 2017

API/ENH: IntervalIndex

604b399

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Apr 6, 2017

API/ENH: IntervalIndex

a46e895

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Apr 7, 2017

API/ENH: IntervalIndex

d4a3c5d

closes pandas-dev#7640 closes pandas-dev#8625

jreback pushed a commit to jreback/pandas that referenced this issue Apr 7, 2017

API/ENH: IntervalIndex

7dfcc51

closes pandas-dev#7640 closes pandas-dev#8625

jreback modified the milestones: 0.20.0, Next Major Release Apr 11, 2017

jreback pushed a commit to jreback/pandas that referenced this issue Apr 13, 2017

API/ENH: IntervalIndex

74162aa

closes pandas-dev#7640 closes pandas-dev#8625

jreback closed this as completed in 9991579 Apr 14, 2017

JiaweiZhuang mentioned this issue Jul 11, 2017

Allow DataArray to hold cell boundaries as coordinate variables pydata/xarray#1475

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: New Index type for binned data (IntervalIndex) #7640

Proposal: New Index type for binned data (IntervalIndex) #7640

shoyer commented Jul 1, 2014

dsm054 commented Jul 2, 2014

jreback commented Jul 2, 2014

cpcloud commented Jul 2, 2014

shoyer commented Jul 2, 2014

ischwabacher commented Jul 28, 2014

jreback commented Jul 28, 2014

hughesadam87 commented Jul 28, 2014

shoyer commented Jul 28, 2014

hughesadam87 commented Jul 28, 2014

shoyer commented Jul 30, 2014

jreback commented Jul 30, 2014

shoyer commented Jul 30, 2014

jreback commented Jul 30, 2014

shoyer commented Nov 2, 2014

shoyer commented Jan 28, 2015

blalterman commented May 7, 2018

jreback commented May 7, 2018

blalterman commented May 7, 2018 via email

jreback commented May 8, 2018

blalterman commented May 8, 2018

Proposal: New Index type for binned data (IntervalIndex) #7640

Proposal: New Index type for binned data (IntervalIndex) #7640

Comments

shoyer commented Jul 1, 2014

Design

Examples

Implementation

dsm054 commented Jul 2, 2014

jreback commented Jul 2, 2014

cpcloud commented Jul 2, 2014

shoyer commented Jul 2, 2014

ischwabacher commented Jul 28, 2014

jreback commented Jul 28, 2014

hughesadam87 commented Jul 28, 2014

shoyer commented Jul 28, 2014

hughesadam87 commented Jul 28, 2014

shoyer commented Jul 30, 2014

jreback commented Jul 30, 2014

shoyer commented Jul 30, 2014

jreback commented Jul 30, 2014

shoyer commented Nov 2, 2014

shoyer commented Jan 28, 2015

blalterman commented May 7, 2018

jreback commented May 7, 2018

blalterman commented May 7, 2018 via email

jreback commented May 8, 2018

blalterman commented May 8, 2018