Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird results grouping data by day #1580

Closed
danse opened this issue Jul 7, 2012 · 7 comments
Closed

Weird results grouping data by day #1580

danse opened this issue Jul 7, 2012 · 7 comments
Labels
Bug Datetime Datetime data dtype Groupby
Milestone

Comments

@danse
Copy link

danse commented Jul 7, 2012

Hi I was feeding pandas (0.8.0rc2) with dates and found some errors. The
amounts from following csv file are grouped by date, but the sums for some days
are wrong:

2011-02-02 resulting: 0 correct: 40
2011-08-21 resulting: 3 correct: 133
2012-10-22 resulting: 157 correct: 27

This is the script I am running:

import sys
import pandas

f=pandas.read_csv(sys.stdin, index_col=1, parse_dates=True)

f.sort()
f=f.resample('D', how='sum')
f['amount'] = f['amount'].fillna(0)

f.to_csv(sys.stdout)

Maybe I'm using the time series methods in a wrong way.

The file with data is not too long, it is hosted here: https://raw.github.com/danse/sparkles/master/cleaned.csv

@danse
Copy link
Author

danse commented Jul 7, 2012

This is weird because all other sums look correct. Notice that the 2011-08-21 date has an entry out of order in the .csv file, but this is why I sort the frame before resampling.

For the 2012-02-02 entry, the behaviour is as such: the correct value (40) is shifted to the next day, and from that day to the end of the file, all entries are shifted by one day.

@danse
Copy link
Author

danse commented Jul 7, 2012

I am using python 3.2.3

@changhiskhan
Copy link
Contributor

By default, DataFrame.sort is NOT inplace.

try f.sort().resample('D', how='sum')

@wesm
Copy link
Member

wesm commented Jul 11, 2012

This is definitely a bug; the resampling code was not checking for monotonicity (sortedness) in the data, thus the bug. I'm adding a check (and sorting if not), and this problem goes away.

@wesm wesm closed this as completed in 9a8ad65 Jul 11, 2012
@danse
Copy link
Author

danse commented Jul 12, 2012

Thanks, using sorting not in place fixed the most of the errors, but the shift in resampling at a same point still remains. I reduced the data file to few rows. Processing the following rows:

amount,date
370,2012-02-01T00:00:00
0,2012-02-01T22:27:48.633911
20,2012-02-02T13:27:24.828871
20,2012-02-02T21:35:41.482386
20,2012-02-04T12:13:37.426859
20,2012-02-04T21:42:11.164113
200,2012-02-05T18:48:24.171116
50,2012-02-06T19:11:07.339103
18,2012-02-07T20:11:59.232420
21,2012-02-11T09:16:25.231366
40,2012-02-11T09:16:44.850074
35,2012-02-13T00:28:00.666619
40,2012-02-13T20:16:40.714301

With the following script:

import sys
import pandas

f=pandas.read_csv(sys.stdin, index_col=1, parse_dates=True)
f=f.sort().resample('D', how='sum')
f['amount'] = f['amount'].fillna(0)
f.to_csv(sys.stdout)

The result is:

,amount
2012-02-01 00:00:00,370.0
2012-02-02 00:00:00,0.0
2012-02-03 00:00:00,40.0
2012-02-04 00:00:00,0.0
2012-02-05 00:00:00,40.0
2012-02-06 00:00:00,200.0
2012-02-07 00:00:00,50.0
2012-02-08 00:00:00,18.0
2012-02-09 00:00:00,0.0
2012-02-10 00:00:00,0.0
2012-02-11 00:00:00,0.0
2012-02-12 00:00:00,61.0
2012-02-13 00:00:00,0.0
2012-02-14 00:00:00,75.0

So from 2012-02-02 all the sums are shifted by one day. I do not have permission to reopen the issue; should I fill in a new one?

@wesm wesm reopened this Jul 12, 2012
@wesm
Copy link
Member

wesm commented Jul 12, 2012

I think what you're really looking for is one of:

In [9]: f.sort().resample('D', how='sum', kind='period')
Out[9]: 
             amount
01-Feb-2012     370
02-Feb-2012      40
03-Feb-2012     NaN
04-Feb-2012      40
05-Feb-2012     200
06-Feb-2012      50
07-Feb-2012      18
08-Feb-2012     NaN
09-Feb-2012     NaN
10-Feb-2012     NaN
11-Feb-2012      61
12-Feb-2012     NaN
13-Feb-2012      75

or

In [11]: from pandas.tseries.tools import normalize_date

In [12]: f.groupby(normalize_date).sum()
Out[12]: 
            amount
2012-02-01     370
2012-02-02      40
2012-02-04      40
2012-02-05     200
2012-02-06      50
2012-02-07      18
2012-02-11      61
2012-02-13      75

The thing about the resampling algorithm is that it segments the data by bin edges, then has to assign a label to each bin. This actually gives you what you want as timestamps:

In [16]: f.sort().resample('D', how='sum', label='left', closed='left')
Out[16]: 
            amount
2012-02-01     370
2012-02-02      40
2012-02-03     NaN
2012-02-04      40
2012-02-05     200
2012-02-06      50
2012-02-07      18
2012-02-08     NaN
2012-02-09     NaN
2012-02-10     NaN
2012-02-11      61
2012-02-12     NaN
2012-02-13      75

@danse
Copy link
Author

danse commented Jul 12, 2012

Thanks, kind="period" will be perfect. normalize_data is not a solution because I need to fill the gaps. I prefer using kind="period" rather than label left, closed left. I thought that it was a bin issue, but I was misleaded by the fact that the shift happened at a precise point in data. However on original data everything is correct using kind period. Many thanks! :)

@danse danse closed this as completed Jul 12, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Groupby
Projects
None yet
Development

No branches or pull requests

3 participants