Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/inconsistent result when indexing DataFrame by YY-MM-DD #2306

Closed
pikeas opened this issue Nov 20, 2012 · 10 comments
Closed

Bug/inconsistent result when indexing DataFrame by YY-MM-DD #2306

pikeas opened this issue Nov 20, 2012 · 10 comments
Labels
Milestone

Comments

@pikeas
Copy link

pikeas commented Nov 20, 2012

rng = date_range('2012-01-01', periods=1000, freq='12H')
vals = randn(1000)
df = DataFrame(vals, index=rng)

df.ix['2012'] and df['2012-05'] (selection by year or year and month) work as expected. However:

In [314]: df.ix['2012-01-01']
Out[314]: 
0    0.107819
Name: 2012-01-01 00:00:00

The output from indexing by year/month/day is unexpected - we should receive either all values matching the day or an error (if indexing by day has not been implemented), instead we get only the first match.

@wesm
Copy link
Member

wesm commented Nov 20, 2012

I've noticed this bug as well. I'll have to see how difficult it is to fix

@pikeas
Copy link
Author

pikeas commented Nov 20, 2012

Probably not an issue with string parsing, as I just tried df.ix[datetime(2012, 1, 5)] with the same result.

@pikeas
Copy link
Author

pikeas commented Nov 20, 2012

More info that may help you track it down:

Slicing with df.ix['1/5/2012':'1/6/2012'] works correctly, but df.ix['1/5/2012':'1/5/2012'] shows the same incorrect result.

I then tried `df.ix['1/2012':'1/2012'] to see what the "correct" response for this sort of slice is, and was surprised to see the same bug! Only this time the outputted value seems...a bit random:

In [321]: df2.ix['1/2012':'1/2012']
Out[321]: 
                   A
2012-01-20 -0.079448

Why on earth would slicing from the year to itself return a single value from the middle of the range?

Hope this helps!

Correction: df.ix['1/5/2012':'1/6/2012'] is actually incorrect - it fetches all values for 1/5, but also gets the first value from 1/6.

@pikeas
Copy link
Author

pikeas commented Nov 20, 2012

(Sorry for the comment spam!)

I've just found another, possibly related, bug. Let me know if you'd like me to break this off into a separate GitHub issue.

rng_dates = DatetimeIndex('1/1/2012', '1/2/2012', '1/2/2012', '1/4/2012')
vals = np.arange(4)
df = DataFrame(vals, index=rng_dates)

Here, df.resample('D') works as expected:

Out[337]: 
              0
2012-01-01  1.0
2012-01-02  2.5
2012-01-03  NaN
2012-01-04  4.0

However, watch what happens when we change the datetimes to timestamps:

# Switching formats slightly so the parser doesn't complain
rng_times = DatetimeIndex(['1-1-2012 01:00', '1-2-2012 1:00', '1-2-2012 1:01', '1-4-2012 1:00'])
df = DataFrame(vals, index=rng_times)
df.resample('D')

Produces:

Out[355]: 
              0
2012-01-02  0.0
2012-01-03  1.5
2012-01-04  NaN
2012-01-05  3.0

I thought this might be because the days were aligning to the hour I chose (1:00), so I tried:

rng_times = DatetimeIndex(['1-1-2012 00:00', '1-2-2012 0:00', '1-2-2012 0:01', '1-4-2012 0:00'])
df = DataFrame(vals, index=rng_times)
df.resample('D')

This worked as expected, so I tried one last thing:

rng_times = DatetimeIndex(['1-1-2012 00:00', '1-2-2012 0:00', '1-2-2012 0:01', '1-4-2012 1:00'])
df = DataFrame(vals, index=rng_times)
df.resample('D')

The only change being the last data point, moved from 0:00 to 1:00, but remaining on the same day. And behold:

Out[367]: 
              0
2012-01-01  0.0
2012-01-02  1.5
2012-01-03  NaN
2012-01-04  NaN
2012-01-05  3.0

So...uh...I feel like I've gone down a rabbit hole! Am I in bug central, or am I misunderstanding how pandas date ranges work?

@pikeas
Copy link
Author

pikeas commented Nov 26, 2012

Do you have any recommendations on how to work around this issue, and on when this bug can be expected to trigger?

@changhiskhan
Copy link
Contributor

DataFrame.resample has keywords closed and label both defaulted to 'right'. If you change them to 'left' it should give you the right results for your second example. See docstring for details:

# Switching formats slightly so the parser doesn't complain
rng_times = DatetimeIndex(['1-1-2012 01:00', '1-2-2012 1:00', '1-2-2012 1:01', '1-4-2012 1:00'])
df = DataFrame(vals, index=rng_times)
df.resample('D', closed='left', label='right')
Out[47]: 
              0
2012-01-01  0.0
2012-01-02  1.5
2012-01-03  NaN
2012-01-04  3.0

Now your last example seems to be a bug in the parser. If you examine the index, '1-2-2012 0:01' is actually being parsed as midnight for some reason.

In [54]: rng_times = DatetimeIndex(['1-1-2012 00:00', '1-2-2012 0:00', '1-2-2012 0:01', '1-4-2012 0:00'])

In [55]: rng_times[2]
Out[55]: <Timestamp: 2012-01-02 00:00:00>

We'll put in a fix for this. As a workaround, use pd.to_datetime:

In [56]: rng_times = pd.to_datetime(['1-1-2012 00:00', '1-2-2012 0:00', '1-2-2012 0:01', '1-4-2012 0:00'])

In [57]: rng_times[2]
Out[57]: <Timestamp: 2012-01-02 00:01:00>

@changhiskhan
Copy link
Contributor

Should have broken this into two issues. This commit does not fix original issue but fixes datetime parsing bug in section starting "(Sorry for the comment spam!)"

@changhiskhan
Copy link
Contributor

@pikeas as for the original issue, python dateutil parses '2012-01-01' as a datetime at midnight, so that's why it returns the first entry.

There is special handling inside pandas for YYYY and YYYY-MM type of strings to use as "partial date slices". I'll have to do more digging but I don't think there's an easy way to do that for YYYY-MM-DD that both retains performance and doesn't impact other use cases.

@wesm
Copy link
Member

wesm commented Nov 27, 2012

note: indexing time series with strings is purely convenience-- so we could sacrifice performance for correctness here. The main issue I saw in this is recognizing daily-only time series from intraday time series-- in some sense in this case you might want to check that there are all midnight timestamps and cache that so you know whether to return scalar values or sub-TimeSeries

@changhiskhan
Copy link
Contributor

Oh you know what is_normalized is already cached so we can use that

On Nov 27, 2012, at 11:39 AM, Wes McKinney notifications@github.com wrote:

note: indexing time series with strings is purely convenience-- so we could sacrifice performance for correctness here. The main issue I saw in this is recognizing daily-only time series from intraday time series-- in some sense in this case you might want to check that there are all midnight timestamps and cache that so you know whether to return scalar values or sub-TimeSeries

@wesm wesm closed this as completed Dec 2, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants