Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arithmetic by DataFrame index #7439

Closed
mmajewsk opened this issue Jun 12, 2014 · 15 comments
Closed

Arithmetic by DataFrame index #7439

mmajewsk opened this issue Jun 12, 2014 · 15 comments
Labels
API Design Docs Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@mmajewsk
Copy link

I encountered a problem with doing any arythmetic from index, in other words, when a index is time (datetime64) and i would like to count something by it, i have no other option than to assign it to some column in dataframe object.
import pandas as pd

import pandas as pd

rng = pd.date_range('1/1/2011', periods=4, freq='H')
ts = pd.Series(rng, index=rng)

print "Data:"
print ts

print "\nSubstraction from column"
print ts-ts[0]

print "\nIndex to column"
ts['lol']=ts.index
print ts['lol']-ts['lol'][0]

print "\nSubstraction by index"
df = ts.index
print df-df[0]

result:

Data:
2011-01-01 00:00:00   2011-01-01 00:00:00
2011-01-01 01:00:00   2011-01-01 01:00:00
2011-01-01 02:00:00   2011-01-01 02:00:00
2011-01-01 03:00:00   2011-01-01 03:00:00
Freq: H, dtype: datetime64[ns]

Substraction from column
2011-01-01 00:00:00   00:00:00
2011-01-01 01:00:00   01:00:00
2011-01-01 02:00:00   02:00:00
2011-01-01 03:00:00   03:00:00
Freq: H, dtype: timedelta64[ns]

Index to column
lol   00:00:00
lol   01:00:00
lol   02:00:00
lol   03:00:00
dtype: timedelta64[ns]

Substraction by index
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-146-5a8539747b5a> in <module>()
     13 print "\nSubstraction by index"
     14 df = ts.index
---> 15 print df-df[0]
     16 

C:\winpy\WinPython-64bit-2.7.6.4\python-2.7.6.amd64\lib\site-packages\pandas\core\index.pyc in __sub__(self, other)
    853 
    854     def __sub__(self, other):
--> 855         return self.diff(other)
    856 
    857     def __and__(self, other):

C:\winpy\WinPython-64bit-2.7.6.4\python-2.7.6.amd64\lib\site-packages\pandas\core\index.pyc in diff(self, other)
    981 
    982         if not hasattr(other, '__iter__'):
--> 983             raise TypeError('Input must be iterable!')
    984 
    985         if self.equals(other):

TypeError: Input must be iterable!

Maybe it's just conceptional problem, but if i want to make something with date index i have to keep additional column (with the same values as index!).
When it comes to huge datasets this can be a problem, because i have to store the same thing twice, or make additional column for calculations, which is not better.

P.S. pd.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 37 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.13.1
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.0.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2013.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: None
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
sqlalchemy: 0.9.4
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None
@jreback
Copy link
Contributor

jreback commented Jun 12, 2014

why are you using a Series in your example and not a DataFrame?

have you tried reset_index()? this is exactly tht purpose

@mmajewsk
Copy link
Author

I did not mean to only subtract the first value, i used it only as example.
Actually i encountered this problem when i tried to make indefinite integral from one columns, similar problem occurred when i tried to use df.index as x in numpy.

As for reason why i'm using Series; it's the same problem for dataframes, i just needed simple example.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2014

What are you actually trying to do? You simply need to use reset_index().

@mmajewsk
Copy link
Author

I want to use index to make some calculations on it, without changing dataframe or adding another column to it.

@jreback
Copy link
Contributor

jreback commented Jun 13, 2014

I get that, pls show an example of what you want to do. The above does not show what you are saying.

@mmajewsk
Copy link
Author

time = df.index-df.index[0]
time = time/M.np.timedelta64(1,'ms')
print time

@jreback
Copy link
Contributor

jreback commented Jun 13, 2014

- for and Index is a set operation, NOT a timedelta type of operation
+ is a union operation

simply convert to series and it will work

In [10]: df = DataFrame(np.random.randn(10,1),columns=['A'],index=pd.date_range('20130101',periods=10,freq='s'))

In [11]: df
Out[11]: 
                            A
2013-01-01 00:00:00 -0.590790
2013-01-01 00:00:01 -0.124065
2013-01-01 00:00:02  1.584884
2013-01-01 00:00:03  0.765875
2013-01-01 00:00:04 -1.760484
2013-01-01 00:00:05 -0.963729
2013-01-01 00:00:06 -1.045833
2013-01-01 00:00:07  0.641942
2013-01-01 00:00:08 -0.808226
2013-01-01 00:00:09  0.027466

In [12]: (df.index.to_series()-df.index[0])/np.timedelta64(1,'ms')
Out[12]: 
2013-01-01 00:00:00       0
2013-01-01 00:00:01    1000
2013-01-01 00:00:02    2000
2013-01-01 00:00:03    3000
2013-01-01 00:00:04    4000
2013-01-01 00:00:05    5000
2013-01-01 00:00:06    6000
2013-01-01 00:00:07    7000
2013-01-01 00:00:08    8000
2013-01-01 00:00:09    9000
Freq: S, dtype: float64

In [13]: (df.index.to_series()-df.index[0]).astype('timedelta64[ms]')
Out[13]: 
2013-01-01 00:00:00       0
2013-01-01 00:00:01    1000
2013-01-01 00:00:02    2000
2013-01-01 00:00:03    3000
2013-01-01 00:00:04    4000
2013-01-01 00:00:05    5000
2013-01-01 00:00:06    6000
2013-01-01 00:00:07    7000
2013-01-01 00:00:08    8000
2013-01-01 00:00:09    9000
Freq: S, dtype: float64

@shoyer
Copy link
Member

shoyer commented Jun 13, 2014

@jreback pandas is not entirely set-like with index math. For example, subtracting a DateOffset objects does work:

import pandas as pd

dates = pd.date_range('2000-01-01', periods=100)
offset = pd.tseries.offsets.MonthBegin()

print dates - offset

The same is true for numeric indices. Compare:

>>> pd.Index(np.arange(4)) + 40
Int64Index([40, 41, 42, 43], dtype='int64')
>>> pd.Index(np.arange(4)) + [40]
Int64Index([40, 41, 42, 43], dtype='int64')
>>> pd.Index(np.arange(4)) + np.array([40])
Int64Index([40, 41, 42, 43], dtype='int64')
>>> pd.Index(np.arange(4)) + pd.Index([40])
Int64Index([0, 1, 2, 3, 40], dtype='int64')

In my opinion, since Index is ndarray like, it would be less surprising if Index only supported math operations like an ndarray rather than like a set. Does anyone really use the overloaded operators for set operations? At the very least, pandas should pick one. Supporting both results in some very ambiguous cases (like my 2nd and 3rd examples).

@jreback
Copy link
Contributor

jreback commented Jun 13, 2014

@shoyer this has long been an issue.

The problem is union + is VERY common, while - is much less common, and disjoin ^ also not too common.

further since Index are very much like Series you expect addition type ops to a Int64Index to work, though I don't suspect this is that common.

I don't see a problem with adding/subtracting a DateOffset.

so it IS type sensitive.

Not sure what the answer is here. I think changing this could be a problem, its pretty ingrained. That said if someone wants to come up with a (better!) rational scheme, could take a look.

@shoyer
Copy link
Member

shoyer commented Jun 13, 2014

My solution would be to only have named methods for set operations (union, intersection, difference, symmetric_difference) and leave the operators as purely mathematical (ndarray/Series-like). Typing s.union(t) is not that bad.

This seems much more Pythonic to me: "In the face of ambiguity, refuse the temptation to guess."

It would require a long deprecation cycle, but eventually we could fix cases like this issue. The current dual purpose operator overloading is clearly confusing to new users -- and I expect even for expert users in some cases.

For what it's worth, personally I do more index math than set operations. But ultimately which operation to use for infix operators comes down to whether Index is more set-like or ndarray-like, and for which functionality there are obvious alternatives. I would argue that Index is a bit more ndarray-like and infix notation is also much more obvious/standard for arrays than sets.

@jreback
Copy link
Contributor

jreback commented Jun 13, 2014

maybe but then you lose this, which IMHO is prob the most used

date_range('20130101',periods=5) + date_range('20130201',periods=5)

@jorisvandenbossche
Copy link
Member

There is a small section on this in the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#set-operations-on-index-objects, but making a DOC issue of this for now? Eg adding a section in the 'gotchas' about this?

@jreback
Copy link
Contributor

jreback commented Jun 14, 2014

yeh...let's make a doc issue for now / and/or think about this for 0.15

@jorisvandenbossche
Copy link
Member

@jreback I think this can be closed now? The issue with set operations on index should be handled in the meantime? (they are deprecated for adding two indexes, and eg df.index-df.index[0] now works arithmetically)

@jreback
Copy link
Contributor

jreback commented Feb 24, 2016

yep, no need for addtl validation tests as this is already tested pretty

@jreback jreback closed this as completed Feb 24, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Feb 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Docs Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

4 participants