Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent behavior between DataFrame and Series when underlying objects are datetime #3320

Closed
mirage007 opened this issue Apr 11, 2013 · 9 comments

Comments

@mirage007
Copy link

I just upgraded from 0.10 to to 0.10.1 to fix a previous issue I had, and i think I'm getting regressions or inconsistent behavior as it relates to issue #2627

the test below is a comparison on 0.10.0 and 0.10.1 under windows 32 bit python 2.7.2

import numpy
print numpy.__version__
import pandas
from pandas import DataFrame, Series
print pandas.__version__
import datetime

dt1d = [datetime.datetime(2013,1,1) + datetime.timedelta(days= i) for i in range(5)]
dtSeries = pandas.Series(dt1d)
print dtSeries.dtype
print dtSeries.apply(lambda x: x.date())
#this works
print [x.date() for x in dtSeries.values]

dt2d = [[datetime.datetime(2013,1,1) + datetime.timedelta(days= i)] for i in range(5)]
dtDf = DataFrame(dt2d)
print dtDf.dtypes
print dtDf.icol(0).apply(lambda x: x.date())
#the following do not work on 0.10.1 but work on 0.10.0
print [x.date() for x in dtDf.values[:,0]]
print [x.date() for x in dtDf.icol(0).values]

The output for 0.10.0 in ipython:

1.6.2
0.10.0
object
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
[datetime.date(2013, 1, 1), datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)
, datetime.date(2013, 1, 4), datetime.date(2013, 1, 5)]
0 object
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
Name: 0
[datetime.date(2013, 1, 1), datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)
, datetime.date(2013, 1, 4), datetime.date(2013, 1, 5)]
[datetime.date(2013, 1, 1), datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)
, datetime.date(2013, 1, 4), datetime.date(2013, 1, 5)]

The output in 0.10.1:

1.6.2
0.10.1
object
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
[datetime.date(2013, 1, 1), datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)
, datetime.date(2013, 1, 4), datetime.date(2013, 1, 5)]
0 datetime64[ns]
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05

Name: 0

AttributeError Traceback (most recent call last)
in ()
18 print dtDf.icol(0).apply(lambda x: x.date())
19 #the following do not work on 0.10.1 but work on 0.10.0
---> 20 print [x.date() for x in dtDf.values[:,0]]
21 print [x.date() for x in dtDf.icol(0).values]

AttributeError: 'numpy.datetime64' object has no attribute 'date'

@jreback
Copy link
Contributor

jreback commented Apr 11, 2013

this all works in 0.11, pls try that

@mirage007
Copy link
Author

I downloaded the zip of master and installed the package via python setup.py install
now i get even more issues i.e. the Series part no longer works

0.11.dev

1.6.2
0.11.0.dev
datetime64[ns]
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
dtype: object

AttributeError Traceback (most recent call last)
in ()
11 print dtSeries.apply(lambda x: x.date())
12 #this works
---> 13 print [x.date() for x in dtSeries.values]
14
15 dt2d = [[datetime.datetime(2013,1,1) + datetime.timedelta(days= i)] fo
i in range(5)]

AttributeError: 'numpy.datetime64' object has no attribute 'date'

@jreback
Copy link
Contributor

jreback commented Apr 11, 2013

the issue is your column is type object, how did you create this data?

@jreback
Copy link
Contributor

jreback commented Apr 11, 2013

You are trying to do something like this?

In [32]: df = pd.DataFrame([[datetime.datetime(2013,1,1) + datetime.timedelta(days= i)] for i in range(5)])

In [35]: df[0]
Out[35]: 
0   2013-01-01 00:00:00
1   2013-01-02 00:00:00
2   2013-01-03 00:00:00
3   2013-01-04 00:00:00
4   2013-01-05 00:00:00
Name: 0, dtype: datetime64[ns]

In [33]: df[0].apply(lambda x: x.date())
Out[33]: 
0    2013-01-01
1    2013-01-02
2    2013-01-03
3    2013-01-04
4    2013-01-05
Name: 0, dtype: object

In [34]: type(df[0].apply(lambda x: x.date())[0])
Out[34]: datetime.date

@mirage007
Copy link
Author

Yes, the issue is that I understood .values property as numpy,array of the original data

In the code above dtSeries.values gets me back a datetime64[ns] in 0.11 where as np.array(dt1d) gets me an array of objects.

You can imagine this to be used profusely and probably interchangeably a lot of places in the code when sometimes we don't care about the indices and want to operate on the array directly, so the fact that it's inconsistent with dates causes some errors.

I notice that throughout the versions, when they were stored as objects, this was one and the same, but from the code and printout above it looks like the dtype after the creation is different as versions increase

in 0.10.0 both dtSeries and dtDf create the column as object, so the operations are consistent when using apply vs iterating over the .values

in 0.10.1 Series still keeps them as object whereas DataFrame converts them into datetime64[ns] and so it breaks when not using apply.

in 0.11.0 it seems the fix was to have Series also convert the dtype to datetime64[ns]
so that .values doesn't work:
print [x.date() for x in dtSeries.values]

but this works:
print [x.date() for x in dtSeries]

so how should i now think of .values?

@jreback
Copy link
Contributor

jreback commented Apr 11, 2013

The type conversions (started in 0.10.1) on datetime like objects are meant to hide a lot of buggy numpy code (a lot has been fixed in 1.7.0 pandas still accepts 1.6.2). Storing data as object dtype in numpy is almost never a good idea if you can use a native type, performance differences are drastic. Datetimes are inherently integers so they work very naturally that way.

Trying to access types like float with .values will be ok, but if you mix types (e.g. float32/float64), which is now possibile in 0.11, .values gives you raw upcasted types. I cannot think of a single reason ever to use .values directly. All pandas objects respect the array interface so they can be used in place of numpy objects; and they provide proper nan handling including with dates (which uses NaT).

So unless you have a really good reason, no reason to use .values directly

@jreback
Copy link
Contributor

jreback commented Apr 11, 2013

following up....internally different dtypes are stored separately, so when you ask for .values you are saying combing all of the data and shove it into a single container capable of holding it, if you try this in numpy you get weird results, so very odd to do

This is how numpy misbehaves (its really just a printing issue though)

In [38]: pd.Series([datetime.datetime(2013,1,1) + datetime.timedelta(days= i) for i in range(5) ])
Out[38]: 
0   2013-01-01 00:00:00
1   2013-01-02 00:00:00
2   2013-01-03 00:00:00
3   2013-01-04 00:00:00
4   2013-01-05 00:00:00
dtype: datetime64[ns]

In [39]: pd.Series([datetime.datetime(2013,1,1) + datetime.timedelta(days= i) for i in range(5) ]).values
Out[39]: 
array([1970-01-16 48:00:00, 1970-01-16 72:00:00, 1970-01-16 96:00:00,
       1970-01-16 120:00:00, 1970-01-16 144:00:00], dtype=datetime64[ns])

@jreback
Copy link
Contributor

jreback commented Apr 11, 2013

close this?

@mirage007
Copy link
Author

Yea please, thanks for the clarification, not online at the moment if you
would do the honors.
On Apr 11, 2013 6:23 PM, "jreback" notifications@github.com wrote:

close this?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3320#issuecomment-16265522
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants