inconsistent behavior between DataFrame and Series when underlying objects are datetime #3320

mirage007 · 2013-04-11T13:43:12Z

I just upgraded from 0.10 to to 0.10.1 to fix a previous issue I had, and i think I'm getting regressions or inconsistent behavior as it relates to issue #2627

the test below is a comparison on 0.10.0 and 0.10.1 under windows 32 bit python 2.7.2

import numpy
print numpy.__version__
import pandas
from pandas import DataFrame, Series
print pandas.__version__
import datetime

dt1d = [datetime.datetime(2013,1,1) + datetime.timedelta(days= i) for i in range(5)]
dtSeries = pandas.Series(dt1d)
print dtSeries.dtype
print dtSeries.apply(lambda x: x.date())
#this works
print [x.date() for x in dtSeries.values]

dt2d = [[datetime.datetime(2013,1,1) + datetime.timedelta(days= i)] for i in range(5)]
dtDf = DataFrame(dt2d)
print dtDf.dtypes
print dtDf.icol(0).apply(lambda x: x.date())
#the following do not work on 0.10.1 but work on 0.10.0
print [x.date() for x in dtDf.values[:,0]]
print [x.date() for x in dtDf.icol(0).values]

The output for 0.10.0 in ipython:

1.6.2
0.10.0
object
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
[datetime.date(2013, 1, 1), datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)
, datetime.date(2013, 1, 4), datetime.date(2013, 1, 5)]
0 object
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
Name: 0
[datetime.date(2013, 1, 1), datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)
, datetime.date(2013, 1, 4), datetime.date(2013, 1, 5)]
[datetime.date(2013, 1, 1), datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)
, datetime.date(2013, 1, 4), datetime.date(2013, 1, 5)]

The output in 0.10.1:

1.6.2
0.10.1
object
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
[datetime.date(2013, 1, 1), datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)
, datetime.date(2013, 1, 4), datetime.date(2013, 1, 5)]
0 datetime64[ns]
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05

Name: 0

AttributeError Traceback (most recent call last)
in ()
18 print dtDf.icol(0).apply(lambda x: x.date())
19 #the following do not work on 0.10.1 but work on 0.10.0
---> 20 print [x.date() for x in dtDf.values[:,0]]
21 print [x.date() for x in dtDf.icol(0).values]

AttributeError: 'numpy.datetime64' object has no attribute 'date'

jreback · 2013-04-11T13:46:20Z

this all works in 0.11, pls try that

mirage007 · 2013-04-11T14:08:08Z

I downloaded the zip of master and installed the package via python setup.py install
now i get even more issues i.e. the Series part no longer works

0.11.dev

1.6.2
0.11.0.dev
datetime64[ns]
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
dtype: object

AttributeError Traceback (most recent call last)
in ()
11 print dtSeries.apply(lambda x: x.date())
12 #this works
---> 13 print [x.date() for x in dtSeries.values]
14
15 dt2d = [[datetime.datetime(2013,1,1) + datetime.timedelta(days= i)] fo
i in range(5)]

AttributeError: 'numpy.datetime64' object has no attribute 'date'

jreback · 2013-04-11T14:22:30Z

the issue is your column is type object, how did you create this data?

jreback · 2013-04-11T14:24:36Z

You are trying to do something like this?

In [32]: df = pd.DataFrame([[datetime.datetime(2013,1,1) + datetime.timedelta(days= i)] for i in range(5)])

In [35]: df[0]
Out[35]: 
0   2013-01-01 00:00:00
1   2013-01-02 00:00:00
2   2013-01-03 00:00:00
3   2013-01-04 00:00:00
4   2013-01-05 00:00:00
Name: 0, dtype: datetime64[ns]

In [33]: df[0].apply(lambda x: x.date())
Out[33]: 
0    2013-01-01
1    2013-01-02
2    2013-01-03
3    2013-01-04
4    2013-01-05
Name: 0, dtype: object

In [34]: type(df[0].apply(lambda x: x.date())[0])
Out[34]: datetime.date

mirage007 · 2013-04-11T15:02:52Z

Yes, the issue is that I understood .values property as numpy,array of the original data

In the code above dtSeries.values gets me back a datetime64[ns] in 0.11 where as np.array(dt1d) gets me an array of objects.

You can imagine this to be used profusely and probably interchangeably a lot of places in the code when sometimes we don't care about the indices and want to operate on the array directly, so the fact that it's inconsistent with dates causes some errors.

I notice that throughout the versions, when they were stored as objects, this was one and the same, but from the code and printout above it looks like the dtype after the creation is different as versions increase

in 0.10.0 both dtSeries and dtDf create the column as object, so the operations are consistent when using apply vs iterating over the .values

in 0.10.1 Series still keeps them as object whereas DataFrame converts them into datetime64[ns] and so it breaks when not using apply.

in 0.11.0 it seems the fix was to have Series also convert the dtype to datetime64[ns]
so that .values doesn't work:
print [x.date() for x in dtSeries.values]

but this works:
print [x.date() for x in dtSeries]

so how should i now think of .values?

jreback · 2013-04-11T15:11:11Z

The type conversions (started in 0.10.1) on datetime like objects are meant to hide a lot of buggy numpy code (a lot has been fixed in 1.7.0 pandas still accepts 1.6.2). Storing data as object dtype in numpy is almost never a good idea if you can use a native type, performance differences are drastic. Datetimes are inherently integers so they work very naturally that way.

Trying to access types like float with .values will be ok, but if you mix types (e.g. float32/float64), which is now possibile in 0.11, .values gives you raw upcasted types. I cannot think of a single reason ever to use .values directly. All pandas objects respect the array interface so they can be used in place of numpy objects; and they provide proper nan handling including with dates (which uses NaT).

So unless you have a really good reason, no reason to use .values directly

jreback · 2013-04-11T15:30:08Z

following up....internally different dtypes are stored separately, so when you ask for .values you are saying combing all of the data and shove it into a single container capable of holding it, if you try this in numpy you get weird results, so very odd to do

This is how numpy misbehaves (its really just a printing issue though)

In [38]: pd.Series([datetime.datetime(2013,1,1) + datetime.timedelta(days= i) for i in range(5) ])
Out[38]: 
0   2013-01-01 00:00:00
1   2013-01-02 00:00:00
2   2013-01-03 00:00:00
3   2013-01-04 00:00:00
4   2013-01-05 00:00:00
dtype: datetime64[ns]

In [39]: pd.Series([datetime.datetime(2013,1,1) + datetime.timedelta(days= i) for i in range(5) ]).values
Out[39]: 
array([1970-01-16 48:00:00, 1970-01-16 72:00:00, 1970-01-16 96:00:00,
       1970-01-16 120:00:00, 1970-01-16 144:00:00], dtype=datetime64[ns])

jreback · 2013-04-11T22:22:54Z

close this?

mirage007 · 2013-04-11T22:33:47Z

Yea please, thanks for the clarification, not online at the moment if you
would do the honors.
On Apr 11, 2013 6:23 PM, "jreback" notifications@github.com wrote:

close this?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3320#issuecomment-16265522
.

…andas-dev#3320

jreback closed this as completed Apr 11, 2013

mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 19, 2020

BUG: Maintain the order of the bins in group_quantile. Updated tests p…

7f94c20

…andas-dev#3320

mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 19, 2020

BUG: Maintain the order of the bins in group_quantile. Updated tests p…

c15b8b8

…andas-dev#3320

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistent behavior between DataFrame and Series when underlying objects are datetime #3320

inconsistent behavior between DataFrame and Series when underlying objects are datetime #3320

mirage007 commented Apr 11, 2013

jreback commented Apr 11, 2013

mirage007 commented Apr 11, 2013

jreback commented Apr 11, 2013

jreback commented Apr 11, 2013

mirage007 commented Apr 11, 2013

jreback commented Apr 11, 2013

jreback commented Apr 11, 2013

jreback commented Apr 11, 2013

mirage007 commented Apr 11, 2013

inconsistent behavior between DataFrame and Series when underlying objects are datetime #3320

inconsistent behavior between DataFrame and Series when underlying objects are datetime #3320

Comments

mirage007 commented Apr 11, 2013

The output for 0.10.0 in ipython:

The output in 0.10.1:

Name: 0

jreback commented Apr 11, 2013

mirage007 commented Apr 11, 2013

0.11.dev

jreback commented Apr 11, 2013

jreback commented Apr 11, 2013

mirage007 commented Apr 11, 2013

jreback commented Apr 11, 2013

jreback commented Apr 11, 2013

jreback commented Apr 11, 2013

mirage007 commented Apr 11, 2013