Add Float64Index class? #236

wesm · 2011-10-14T12:40:38Z

Idea from conversation with @CRP in #235

lodagro · 2012-03-23T07:51:15Z

Same idea came up on this mailing list thread.

kghose · 2013-05-29T16:23:39Z

Yes, I would voice support for a general index that keeps the original index dtype. I used a float as index (it was time in seconds) and was delighted when everything, including df.plot() worked swimmingly. But then I wasted 30min figuring out why pylab.exp(df.index.values) was failing with the mysterious AttributeError: exp. pandas and Python normally make things so pleasant, but unexpected behavior like this reminds me of my dark days debugging c :(

baldwint · 2013-06-04T02:12:37Z

@kghose: I agree. I also use indices to store things like the time in seconds (e.g. oscilloscope traces), and am constantly having to do array(df.index.values, dtype=float) in place of a simple df.index.values to get an array that I can use with scipy fitting functions. It's an awkward idiom.

cpcloud · 2013-06-04T05:02:03Z

+1 here. Matplotlib issue has tripped me up a number of times when I needed to make custom plots.

jreback · 2013-06-04T14:51:13Z

see http://pandas.pydata.org/pandas-docs/dev/indexing.html#fallback-indexing, it is rarely necessary to actually use a float index; you are often better off served by using a column. The point of the index is to make individual elements faster, e.g. df[1.0], but this is quite tricky; this is the reason for having an issue about this.

kghose · 2013-06-04T15:17:25Z

Yes, its true because whether two floats are the same depends on precision, but its nice to be able to have that as a time index.

cpcloud · 2013-06-04T15:17:48Z

In my cases I don't really care about being able to select via get item ish style indexing I usually want to loop over the index series pairs or I have them in frame that I want to show as an image with the index in the columns. The object dtype makes matplotlib show the index to full precision which is really annoying since I then have to go in and format the tick labels by hand. I wholeheartedly agree that float indexes are to be avoided but sometimes they make sense. My cases are mostly plotting issues which only matters when I can't use pandas plotting abilities which thankfully isn't that often.

jreback · 2013-06-04T15:21:34Z

@kghose consider using a datetime64[ns] index (if you are dealing with time), or as I said, use it as a column; you can do nearly everything you need (with an ocassional set_index/reset_index). what you are you trying to do? as @cpcloud indicates, the only real issue with not having a FloatIndex typed as float is for plotting (w/o manual conversion)

cpcloud · 2013-06-04T15:36:40Z

General index dtype retention is probably not worth the amount of complexity and code that it would require to do it right. Datetime indexes are your friend. @jreback what about attempting coercion of object indexes when accessing the values attribute?

jreback · 2013-06-04T15:50:17Z

Something like this is pretty easy (@cpcloud, can't change the way values works or everything breaks)

Is this useful?

In [1]: idx = pd.Index(np.arange(10).astype('float64'))

In [3]: idx
Out[3]: Index([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], dtype=object)

In [4]: idx.inferred_values
Out[4]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

jreback · 2013-06-04T15:54:16Z

Of course for datetimes you get this

In [1]: idx = date_range('20130101',periods=5)

In [2]: idx
Out[2]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-05 00:00:00]
Length: 5, Freq: D, Timezone: None

In [3]: idx.inferred_values
Out[3]: 
array([1356998400000000000, 1357084800000000000, 1357171200000000000,
       1357257600000000000, 1357344000000000000])

cpcloud · 2013-06-04T16:00:50Z

Well it's consistent... But it looks like it would only be useful in the float case... What would strings return?

cpcloud · 2013-06-04T16:03:12Z

Shouldn't dates return array of date time?

jreback · 2013-06-04T16:10:30Z

I could return anything...(e.g. a datetime64[ns]) numpy array for example, is easy enough, strings will return the same (an object array)..

jreback · 2013-06-04T16:14:34Z

numpy 1.7 (this is the same as .values though)

In [5]: x = date_range('20130101',periods=5)

In [6]: x.inferred_values
Out[6]: 
array(['2012-12-31T19:00:00.000000000-0500',
       '2013-01-01T19:00:00.000000000-0500',
       '2013-01-02T19:00:00.000000000-0500',
       '2013-01-03T19:00:00.000000000-0500',
       '2013-01-04T19:00:00.000000000-0500'], dtype='datetime64[ns]')

In [7]: x.inferred_values[0]
Out[7]: numpy.datetime64('2012-12-31T19:00:00.000000000-0500')

jreback · 2013-06-04T16:15:17Z

@cpcloud I think you are right, only float is dfferent...

cpcloud · 2013-06-04T16:50:33Z

i mean...i don't feel super strong about this since it seems like there are so few use cases for float indices. i do think that it should return the "highest level" dtype possible that can be represented by numpy, e.g., return dates as dates like u show if this is going to be done. again though, inferred_values will be the same as values in every case except float and maybe you could return a 2D array for MultiIndex...

baldwint · 2013-06-04T17:11:00Z

I've been using float indices a lot, so I would love inferred_values, or some property that gives you back an array cast to the same dtype as the one originally passed to the index= keyword argument.

A time axis is not the only use case for a float index; sometimes I work with spectral data where the X axis is a floating point value representing frequency or wavelength.

jreback · 2013-06-04T17:19:09Z

Do this somewhere in your code (before you use it!)
This is a monkey patch

import numpy as np
import pandas as pd
In [10]: def inferred_values(self):
   ....:     if self.inferred_type == 'floating':
   ....:         return np.asarray(self,dtype='float64')
   ....:     return np.asarray(self)
   ....: 

In [11]: pd.Index.inferred_values = property(inferred_values)

In [12]: idx.inferred_values
Out[12]: array([ 0.1,  1.1,  2.1,  3.1,  4.1])

In [13]: idx = Index(np.arange(5)+0.1)

In [14]: idx
Out[14]: Index([0.1, 1.1, 2.1, 3.1, 4.1], dtype=object)

cpcloud · 2013-06-04T17:24:02Z

A time axis is not the only use case for a float index; sometimes I work with spectral data where the X axis is a floating point value representing frequency or wavelength.

You're abs right here, I also use it for things other than time. It would be great if there was some way to integrate pandas with quantities, but that's probably a long way away...

njsmith · 2013-09-16T14:21:41Z

I (inadvertently) started a thread about this on the pystatsmodels list; thread link: https://groups.google.com/forum/#!topic/pystatsmodels/ua7WpNd-U8Q

My use case is also for time values (and DatetimeIndex is not useful for a variety of reasons, most notably that all I have a deltas against some unknown epoch defined as "whenever someone hit the record button".). My concern though isn't so much having a useful .values attribute (though I guess that might be nice too!), but for having a reliable way to do time-based indexing, mostly for ad hoc interactive use. The main features I'm looking for are:

Make it reliably predictable whether any given indexing expression will go by time-in-milliseconds or offset-in-array
For a time-based indexing expression, slices should give all values that fall within their bounds, whether or not the exact endpoints are present in the array. (NB: my index will always be sorted.) This is to support ad hoc queries like "eh, let's see one second of data from channel P4" -> plot(df.loc[:1000, "P4"]). For this kind of usage, no-one cares whether there was a sample taken at exactly 1000 milliseconds or not. Currently .ix does interpret floating point slices like this, but .loc does not.
For bonus points: if a time does happen to be a nice exact integer value, then there should be some way to write down a time-based indexing expression that picks it out exactly.

jreback · 2013-09-16T14:44:47Z

@njsmith

I made a couple of minor changes to .loc to get the following behavior, which I believe is still consistent
with label based, but does NOT fallback (the end-points of a slice are allowed to just not be in the index, which is slightly inconsistent, but they select on a label basis, so I think that is ok) . Pls review and lmk.

In [1]: s = Series(np.arange(5), index=np.arange(5) * 2.5)

In [2]: s
Out[2]: 
0.0     0
2.5     1
5.0     2
7.5     3
10.0    4
dtype: int64

In [3]: # label based slicing

In [4]: s[1.0:3.0]
Out[4]: 
2.5    1
dtype: int64

In [5]: s.ix[1.0:3.0]
Out[5]: 
2.5    1
dtype: int64

In [6]: s.loc[1.0:3.0]
Out[6]: 
2.5    1
dtype: int64

In [7]: # exact indexing when found

In [8]: s[5.0]
Out[8]: 2

In [9]: s.loc[5.0]
Out[9]: 2

In [10]: s.ix[5.0]
Out[10]: 2

In [11]: # non-fallback location based should raise this error (__getitem__,ix fallback here)

In [12]: s.loc[4.0]
KeyError: 'the label [4.0] is not in the [index]'

In [13]: s[4.0] == s[4]
Out[13]: True

In [14]: s[4] == s[4]
Out[14]: True

# confusing slicing patterns in __getitem__/ix, loc is clear

In [15]: s.loc[2.0:5.0]
Out[15]: 
2.5    1
5.0    2
dtype: int64

In [16]: s.loc[2.0:5]
Out[16]: 
2.5    1
5.0    2
dtype: int64

In [17]: s.loc[2.1:5]
Out[17]: 
2.5    1
5.0    2
dtype: int64

In [18]: # these are what __getitem__/ix does

In [19]: s.ix[2.0:5.0]
Out[19]: 
2.5    1
5.0    2
dtype: int64

In [20]: s.ix[2.0:5]
Out[20]: 
5.0     2
7.5     3
10.0    4
dtype: int64

In [21]: s.ix[2.1:5]
Out[21]: 
2.5    1
5.0    2
dtype: int64

In [22]: s[2.0:5.0]
Out[22]: 
2.5    1
5.0    2
dtype: int64

In [23]: s[2.0:5]
Out[23]: 
5.0     2
7.5     3
10.0    4
dtype: int64

In [24]: s[2.1:5]
Out[24]: 
2.5    1
5.0    2
dtype: int64

jreback · 2013-09-16T15:09:27Z

cc @dragoljub, cc @nehalecky

you guys have had interests in indexing in the past....not sure if you have any comments wrt this

njsmith · 2013-09-16T18:16:37Z

BTW, I'd be in favor of a Float64Index class that specifically implemented the limited subset of stuff that makes sense for floating point indices and not the stuff that didn't. So e.g. trying to groupby would just be an error, and I can even see the argument for making scalar indexing an error. This would be much safer than the current situation, making @jreback happy :-). But, stuff like slicing and .values could still act the way people want, making users happy too.

dragoljub · 2013-09-16T18:43:18Z

For much of the data I work with I have been OK with using Object/Int64 index types, however I do also keep a copy of my indexers as data columns to enable easier plotting/slicing for some cases.

IMO, anything that enables a smoother interface to MatplotLib, Galry or Scikit-Learn I'm 👍

dalejung mentioned this issue Mar 29, 2013

MultiIndex promotes float64 dtype to object #3211

Closed

jreback mentioned this issue Sep 16, 2013

BUG/ENH: provide better .loc based semantics for float based indicies, continuing not to fallback (related GH236) #4850

Merged

jreback closed this as completed in #4850 Sep 25, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Float64Index class? #236

Add Float64Index class? #236

wesm commented Oct 14, 2011

lodagro commented Mar 23, 2012

kghose commented May 29, 2013

baldwint commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

kghose commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

baldwint commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

njsmith commented Sep 16, 2013

jreback commented Sep 16, 2013

jreback commented Sep 16, 2013

njsmith commented Sep 16, 2013

dragoljub commented Sep 16, 2013

Add Float64Index class? #236

Add Float64Index class? #236

Comments

wesm commented Oct 14, 2011

lodagro commented Mar 23, 2012

kghose commented May 29, 2013

baldwint commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

kghose commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

baldwint commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

njsmith commented Sep 16, 2013

jreback commented Sep 16, 2013

jreback commented Sep 16, 2013

njsmith commented Sep 16, 2013

dragoljub commented Sep 16, 2013