Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Float64Index class? #236

Closed
wesm opened this issue Oct 14, 2011 · 25 comments · Fixed by #4850
Closed

Add Float64Index class? #236

wesm opened this issue Oct 14, 2011 · 25 comments · Fixed by #4850
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@wesm
Copy link
Member

wesm commented Oct 14, 2011

Idea from conversation with @CRP in #235

@lodagro
Copy link
Contributor

lodagro commented Mar 23, 2012

Same idea came up on this mailing list thread.

@kghose
Copy link

kghose commented May 29, 2013

Yes, I would voice support for a general index that keeps the original index dtype. I used a float as index (it was time in seconds) and was delighted when everything, including df.plot() worked swimmingly. But then I wasted 30min figuring out why pylab.exp(df.index.values) was failing with the mysterious AttributeError: exp. pandas and Python normally make things so pleasant, but unexpected behavior like this reminds me of my dark days debugging c :(

@baldwint
Copy link

baldwint commented Jun 4, 2013

@kghose: I agree. I also use indices to store things like the time in seconds (e.g. oscilloscope traces), and am constantly having to do array(df.index.values, dtype=float) in place of a simple df.index.values to get an array that I can use with scipy fitting functions. It's an awkward idiom.

@cpcloud
Copy link
Member

cpcloud commented Jun 4, 2013

+1 here. Matplotlib issue has tripped me up a number of times when I needed to make custom plots.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

see http://pandas.pydata.org/pandas-docs/dev/indexing.html#fallback-indexing, it is rarely necessary to actually use a float index; you are often better off served by using a column. The point of the index is to make individual elements faster, e.g. df[1.0], but this is quite tricky; this is the reason for having an issue about this.

@kghose
Copy link

kghose commented Jun 4, 2013

Yes, its true because whether two floats are the same depends on precision, but its nice to be able to have that as a time index.

@cpcloud
Copy link
Member

cpcloud commented Jun 4, 2013

In my cases I don't really care about being able to select via get item ish style indexing I usually want to loop over the index series pairs or I have them in frame that I want to show as an image with the index in the columns. The object dtype makes matplotlib show the index to full precision which is really annoying since I then have to go in and format the tick labels by hand. I wholeheartedly agree that float indexes are to be avoided but sometimes they make sense. My cases are mostly plotting issues which only matters when I can't use pandas plotting abilities which thankfully isn't that often.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

@kghose consider using a datetime64[ns] index (if you are dealing with time), or as I said, use it as a column; you can do nearly everything you need (with an ocassional set_index/reset_index). what you are you trying to do? as @cpcloud indicates, the only real issue with not having a FloatIndex typed as float is for plotting (w/o manual conversion)

@cpcloud
Copy link
Member

cpcloud commented Jun 4, 2013

General index dtype retention is probably not worth the amount of complexity and code that it would require to do it right. Datetime indexes are your friend. @jreback what about attempting coercion of object indexes when accessing the values attribute?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

Something like this is pretty easy (@cpcloud, can't change the way values works or everything breaks)

Is this useful?

In [1]: idx = pd.Index(np.arange(10).astype('float64'))

In [3]: idx
Out[3]: Index([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], dtype=object)

In [4]: idx.inferred_values
Out[4]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

Of course for datetimes you get this

In [1]: idx = date_range('20130101',periods=5)

In [2]: idx
Out[2]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-05 00:00:00]
Length: 5, Freq: D, Timezone: None

In [3]: idx.inferred_values
Out[3]: 
array([1356998400000000000, 1357084800000000000, 1357171200000000000,
       1357257600000000000, 1357344000000000000])

@cpcloud
Copy link
Member

cpcloud commented Jun 4, 2013

Well it's consistent... But it looks like it would only be useful in the float case... What would strings return?

@cpcloud
Copy link
Member

cpcloud commented Jun 4, 2013

Shouldn't dates return array of date time?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

I could return anything...(e.g. a datetime64[ns]) numpy array for example, is easy enough, strings will return the same (an object array)..

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

numpy 1.7 (this is the same as .values though)

In [5]: x = date_range('20130101',periods=5)

In [6]: x.inferred_values
Out[6]: 
array(['2012-12-31T19:00:00.000000000-0500',
       '2013-01-01T19:00:00.000000000-0500',
       '2013-01-02T19:00:00.000000000-0500',
       '2013-01-03T19:00:00.000000000-0500',
       '2013-01-04T19:00:00.000000000-0500'], dtype='datetime64[ns]')

In [7]: x.inferred_values[0]
Out[7]: numpy.datetime64('2012-12-31T19:00:00.000000000-0500')

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

@cpcloud I think you are right, only float is dfferent...

@cpcloud
Copy link
Member

cpcloud commented Jun 4, 2013

i mean...i don't feel super strong about this since it seems like there are so few use cases for float indices. i do think that it should return the "highest level" dtype possible that can be represented by numpy, e.g., return dates as dates like u show if this is going to be done. again though, inferred_values will be the same as values in every case except float and maybe you could return a 2D array for MultiIndex...

@baldwint
Copy link

baldwint commented Jun 4, 2013

I've been using float indices a lot, so I would love inferred_values, or some property that gives you back an array cast to the same dtype as the one originally passed to the index= keyword argument.

A time axis is not the only use case for a float index; sometimes I work with spectral data where the X axis is a floating point value representing frequency or wavelength.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

Do this somewhere in your code (before you use it!)
This is a monkey patch

import numpy as np
import pandas as pd
In [10]: def inferred_values(self):
   ....:     if self.inferred_type == 'floating':
   ....:         return np.asarray(self,dtype='float64')
   ....:     return np.asarray(self)
   ....: 

In [11]: pd.Index.inferred_values = property(inferred_values)

In [12]: idx.inferred_values
Out[12]: array([ 0.1,  1.1,  2.1,  3.1,  4.1])

In [13]: idx = Index(np.arange(5)+0.1)

In [14]: idx
Out[14]: Index([0.1, 1.1, 2.1, 3.1, 4.1], dtype=object)

@cpcloud
Copy link
Member

cpcloud commented Jun 4, 2013

A time axis is not the only use case for a float index; sometimes I work with spectral data where the X axis is a floating point value representing frequency or wavelength.

You're abs right here, I also use it for things other than time. It would be great if there was some way to integrate pandas with quantities, but that's probably a long way away...

@njsmith
Copy link

njsmith commented Sep 16, 2013

I (inadvertently) started a thread about this on the pystatsmodels list; thread link: https://groups.google.com/forum/#!topic/pystatsmodels/ua7WpNd-U8Q

My use case is also for time values (and DatetimeIndex is not useful for a variety of reasons, most notably that all I have a deltas against some unknown epoch defined as "whenever someone hit the record button".). My concern though isn't so much having a useful .values attribute (though I guess that might be nice too!), but for having a reliable way to do time-based indexing, mostly for ad hoc interactive use. The main features I'm looking for are:

  • Make it reliably predictable whether any given indexing expression will go by time-in-milliseconds or offset-in-array
  • For a time-based indexing expression, slices should give all values that fall within their bounds, whether or not the exact endpoints are present in the array. (NB: my index will always be sorted.) This is to support ad hoc queries like "eh, let's see one second of data from channel P4" -> plot(df.loc[:1000, "P4"]). For this kind of usage, no-one cares whether there was a sample taken at exactly 1000 milliseconds or not. Currently .ix does interpret floating point slices like this, but .loc does not.
  • For bonus points: if a time does happen to be a nice exact integer value, then there should be some way to write down a time-based indexing expression that picks it out exactly.

@jreback
Copy link
Contributor

jreback commented Sep 16, 2013

@njsmith

I made a couple of minor changes to .loc to get the following behavior, which I believe is still consistent
with label based, but does NOT fallback (the end-points of a slice are allowed to just not be in the index, which is slightly inconsistent, but they select on a label basis, so I think that is ok) . Pls review and lmk.

In [1]: s = Series(np.arange(5), index=np.arange(5) * 2.5)

In [2]: s
Out[2]: 
0.0     0
2.5     1
5.0     2
7.5     3
10.0    4
dtype: int64

In [3]: # label based slicing

In [4]: s[1.0:3.0]
Out[4]: 
2.5    1
dtype: int64

In [5]: s.ix[1.0:3.0]
Out[5]: 
2.5    1
dtype: int64

In [6]: s.loc[1.0:3.0]
Out[6]: 
2.5    1
dtype: int64

In [7]: # exact indexing when found

In [8]: s[5.0]
Out[8]: 2

In [9]: s.loc[5.0]
Out[9]: 2

In [10]: s.ix[5.0]
Out[10]: 2

In [11]: # non-fallback location based should raise this error (__getitem__,ix fallback here)

In [12]: s.loc[4.0]
KeyError: 'the label [4.0] is not in the [index]'

In [13]: s[4.0] == s[4]
Out[13]: True

In [14]: s[4] == s[4]
Out[14]: True

# confusing slicing patterns in __getitem__/ix, loc is clear

In [15]: s.loc[2.0:5.0]
Out[15]: 
2.5    1
5.0    2
dtype: int64

In [16]: s.loc[2.0:5]
Out[16]: 
2.5    1
5.0    2
dtype: int64

In [17]: s.loc[2.1:5]
Out[17]: 
2.5    1
5.0    2
dtype: int64

In [18]: # these are what __getitem__/ix does

In [19]: s.ix[2.0:5.0]
Out[19]: 
2.5    1
5.0    2
dtype: int64

In [20]: s.ix[2.0:5]
Out[20]: 
5.0     2
7.5     3
10.0    4
dtype: int64

In [21]: s.ix[2.1:5]
Out[21]: 
2.5    1
5.0    2
dtype: int64

In [22]: s[2.0:5.0]
Out[22]: 
2.5    1
5.0    2
dtype: int64

In [23]: s[2.0:5]
Out[23]: 
5.0     2
7.5     3
10.0    4
dtype: int64

In [24]: s[2.1:5]
Out[24]: 
2.5    1
5.0    2
dtype: int64

@jreback
Copy link
Contributor

jreback commented Sep 16, 2013

cc @dragoljub, cc @nehalecky

you guys have had interests in indexing in the past....not sure if you have any comments wrt this

@njsmith
Copy link

njsmith commented Sep 16, 2013

BTW, I'd be in favor of a Float64Index class that specifically implemented the limited subset of stuff that makes sense for floating point indices and not the stuff that didn't. So e.g. trying to groupby would just be an error, and I can even see the argument for making scalar indexing an error. This would be much safer than the current situation, making @jreback happy :-). But, stuff like slicing and .values could still act the way people want, making users happy too.

@dragoljub
Copy link

For much of the data I work with I have been OK with using Object/Int64 index types, however I do also keep a copy of my indexers as data columns to enable easier plotting/slicing for some cases.

IMO, anything that enables a smoother interface to MatplotLib, Galry or Scikit-Learn I'm 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
8 participants