Weird results grouping data by day #1580

danse · 2012-07-07T20:22:59Z

Hi I was feeding pandas (0.8.0rc2) with dates and found some errors. The
amounts from following csv file are grouped by date, but the sums for some days
are wrong:

2011-02-02 resulting: 0 correct: 40
2011-08-21 resulting: 3 correct: 133
2012-10-22 resulting: 157 correct: 27

This is the script I am running:

import sys
import pandas

f=pandas.read_csv(sys.stdin, index_col=1, parse_dates=True)

f.sort()
f=f.resample('D', how='sum')
f['amount'] = f['amount'].fillna(0)

f.to_csv(sys.stdout)

Maybe I'm using the time series methods in a wrong way.

The file with data is not too long, it is hosted here: https://raw.github.com/danse/sparkles/master/cleaned.csv

danse · 2012-07-07T20:32:59Z

This is weird because all other sums look correct. Notice that the 2011-08-21 date has an entry out of order in the .csv file, but this is why I sort the frame before resampling.

For the 2012-02-02 entry, the behaviour is as such: the correct value (40) is shifted to the next day, and from that day to the end of the file, all entries are shifted by one day.

danse · 2012-07-07T20:38:02Z

I am using python 3.2.3

changhiskhan · 2012-07-11T17:02:28Z

By default, DataFrame.sort is NOT inplace.

try f.sort().resample('D', how='sum')

wesm · 2012-07-11T23:34:45Z

This is definitely a bug; the resampling code was not checking for monotonicity (sortedness) in the data, thus the bug. I'm adding a check (and sorting if not), and this problem goes away.

danse · 2012-07-12T19:37:42Z

Thanks, using sorting not in place fixed the most of the errors, but the shift in resampling at a same point still remains. I reduced the data file to few rows. Processing the following rows:

amount,date
370,2012-02-01T00:00:00
0,2012-02-01T22:27:48.633911
20,2012-02-02T13:27:24.828871
20,2012-02-02T21:35:41.482386
20,2012-02-04T12:13:37.426859
20,2012-02-04T21:42:11.164113
200,2012-02-05T18:48:24.171116
50,2012-02-06T19:11:07.339103
18,2012-02-07T20:11:59.232420
21,2012-02-11T09:16:25.231366
40,2012-02-11T09:16:44.850074
35,2012-02-13T00:28:00.666619
40,2012-02-13T20:16:40.714301

With the following script:

import sys
import pandas

f=pandas.read_csv(sys.stdin, index_col=1, parse_dates=True)
f=f.sort().resample('D', how='sum')
f['amount'] = f['amount'].fillna(0)
f.to_csv(sys.stdout)

The result is:

,amount
2012-02-01 00:00:00,370.0
2012-02-02 00:00:00,0.0
2012-02-03 00:00:00,40.0
2012-02-04 00:00:00,0.0
2012-02-05 00:00:00,40.0
2012-02-06 00:00:00,200.0
2012-02-07 00:00:00,50.0
2012-02-08 00:00:00,18.0
2012-02-09 00:00:00,0.0
2012-02-10 00:00:00,0.0
2012-02-11 00:00:00,0.0
2012-02-12 00:00:00,61.0
2012-02-13 00:00:00,0.0
2012-02-14 00:00:00,75.0

So from 2012-02-02 all the sums are shifted by one day. I do not have permission to reopen the issue; should I fill in a new one?

wesm · 2012-07-12T19:52:19Z

I think what you're really looking for is one of:

In [9]: f.sort().resample('D', how='sum', kind='period')
Out[9]: 
             amount
01-Feb-2012     370
02-Feb-2012      40
03-Feb-2012     NaN
04-Feb-2012      40
05-Feb-2012     200
06-Feb-2012      50
07-Feb-2012      18
08-Feb-2012     NaN
09-Feb-2012     NaN
10-Feb-2012     NaN
11-Feb-2012      61
12-Feb-2012     NaN
13-Feb-2012      75

or

In [11]: from pandas.tseries.tools import normalize_date

In [12]: f.groupby(normalize_date).sum()
Out[12]: 
            amount
2012-02-01     370
2012-02-02      40
2012-02-04      40
2012-02-05     200
2012-02-06      50
2012-02-07      18
2012-02-11      61
2012-02-13      75

The thing about the resampling algorithm is that it segments the data by bin edges, then has to assign a label to each bin. This actually gives you what you want as timestamps:

In [16]: f.sort().resample('D', how='sum', label='left', closed='left')
Out[16]: 
            amount
2012-02-01     370
2012-02-02      40
2012-02-03     NaN
2012-02-04      40
2012-02-05     200
2012-02-06      50
2012-02-07      18
2012-02-08     NaN
2012-02-09     NaN
2012-02-10     NaN
2012-02-11      61
2012-02-12     NaN
2012-02-13      75

danse · 2012-07-12T22:34:57Z

Thanks, kind="period" will be perfect. normalize_data is not a solution because I need to fill the gaps. I prefer using kind="period" rather than label left, closed left. I thought that it was a bin issue, but I was misleaded by the fact that the shift happened at a precise point in data. However on original data everything is correct using kind period. Many thanks! :)

wesm closed this as completed in 9a8ad65 Jul 11, 2012

wesm reopened this Jul 12, 2012

danse closed this as completed Jul 12, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird results grouping data by day #1580

Weird results grouping data by day #1580

danse commented Jul 7, 2012

danse commented Jul 7, 2012

danse commented Jul 7, 2012

changhiskhan commented Jul 11, 2012

wesm commented Jul 11, 2012

danse commented Jul 12, 2012

wesm commented Jul 12, 2012

danse commented Jul 12, 2012

Weird results grouping data by day #1580

Weird results grouping data by day #1580

Comments

danse commented Jul 7, 2012

danse commented Jul 7, 2012

danse commented Jul 7, 2012

changhiskhan commented Jul 11, 2012

wesm commented Jul 11, 2012

danse commented Jul 12, 2012

wesm commented Jul 12, 2012

danse commented Jul 12, 2012