Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in DataFrame.duplicated() when dealing with datetime64 #1833

Closed
snth opened this issue Sep 3, 2012 · 1 comment
Closed

Bug in DataFrame.duplicated() when dealing with datetime64 #1833

snth opened this issue Sep 3, 2012 · 1 comment
Labels
Bug Datetime Datetime data dtype
Milestone

Comments

@snth
Copy link
Contributor

snth commented Sep 3, 2012

The following looks like a bug to me as DataFrame.duplicated() gives different results on what should be identical inputs. To me it looks like the problem is with the datetime64 values because if you look at the output of dates.values it's clear that the last 4 values are duplicates.

Please see the code below that reproduces the problem:

In [701]: dates = date_range('2010-07-01', end='2010-08-05')

In [702]: tst = DataFrame({'symbol': 'AAA', 'date': dates})

In [703]: tst.tail()
Out[703]: 
                  date symbol
31 2010-08-01 00:00:00    AAA
32 2010-08-02 00:00:00    AAA
33 2010-08-03 00:00:00    AAA
34 2010-08-04 00:00:00    AAA
35 2010-08-05 00:00:00    AAA

In [704]: tst.duplicated().tail()
Out[704]: 
31    False
32    False
33    False
34    False
35    False

In [705]: tst.duplicated(['date', 'symbol']).tail()
Out[705]: 
31    False
32     True
33     True
34     True
35     True

In [706]: dates.values
Out[706]: 
array([1970-01-15 40:00:00, 1970-01-15 64:00:00, 1970-01-15 88:00:00,
       1970-01-15 112:00:00, 1970-01-15 136:00:00, 1970-01-15 160:00:00,
       1970-01-15 184:00:00, 1970-01-15 208:00:00, 1970-01-15 232:00:00,
       1970-01-15 00:00:00, 1970-01-15 24:00:00, 1970-01-15 48:00:00,
       1970-01-15 72:00:00, 1970-01-15 96:00:00, 1970-01-15 120:00:00,
       1970-01-15 144:00:00, 1970-01-15 168:00:00, 1970-01-15 192:00:00,
       1970-01-15 216:00:00, 1970-01-15 240:00:00, 1970-01-15 08:00:00,
       1970-01-15 32:00:00, 1970-01-15 56:00:00, 1970-01-15 80:00:00,
       1970-01-15 104:00:00, 1970-01-15 128:00:00, 1970-01-15 152:00:00,
       1970-01-15 176:00:00, 1970-01-15 200:00:00, 1970-01-15 224:00:00,
       1970-01-15 248:00:00, 1970-01-15 16:00:00, 1970-01-15 40:00:00,
       1970-01-15 64:00:00, 1970-01-15 88:00:00, 1970-01-15 112:00:00], dtype=datetime64[ns])

In [707]: dates
Out[707]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-07-01 00:00:00, ..., 2010-08-05 00:00:00]
Length: 36, Freq: D, Timezone: None

In [708]: dates.dtype
Out[708]: dtype('datetime64[ns]')

In [709]: dates.values.dtype
Out[709]: dtype('datetime64[ns]')

In [710]: sys.version
Out[710]: '2.7.3 (default, Aug  1 2012, 05:14:39) \n[GCC 4.6.3]'

In [711]: np.version.version
Out[711]: '1.6.1'

In [713]: pd.version.version
Out[713]: '0.8.1'

In [714]: 

I remember reading somewhere that there are problems with datetime64 in numpy 1.6 but I don't understand what coercions are taking place behind the scenes. Also, if someone could please explain to me why the dates in dates.values above are wrong and how to avoid this, I would appreciate it.

@wesm wesm closed this as completed in dc0db65 Sep 9, 2012
@wesm
Copy link
Member

wesm commented Sep 9, 2012

Thanks for reporting-- fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

2 participants