Bug/inconsistent result when indexing DataFrame by YY-MM-DD #2306

pikeas · 2012-11-20T21:36:29Z

rng = date_range('2012-01-01', periods=1000, freq='12H')
vals = randn(1000)
df = DataFrame(vals, index=rng)

df.ix['2012'] and df['2012-05'] (selection by year or year and month) work as expected. However:

In [314]: df.ix['2012-01-01']
Out[314]: 
0    0.107819
Name: 2012-01-01 00:00:00

The output from indexing by year/month/day is unexpected - we should receive either all values matching the day or an error (if indexing by day has not been implemented), instead we get only the first match.

The text was updated successfully, but these errors were encountered:

wesm · 2012-11-20T21:40:50Z

I've noticed this bug as well. I'll have to see how difficult it is to fix

pikeas · 2012-11-20T21:42:13Z

Probably not an issue with string parsing, as I just tried df.ix[datetime(2012, 1, 5)] with the same result.

pikeas · 2012-11-20T21:49:09Z

More info that may help you track it down:

Slicing with df.ix['1/5/2012':'1/6/2012'] works correctly, but df.ix['1/5/2012':'1/5/2012'] shows the same incorrect result.

I then tried `df.ix['1/2012':'1/2012'] to see what the "correct" response for this sort of slice is, and was surprised to see the same bug! Only this time the outputted value seems...a bit random:

In [321]: df2.ix['1/2012':'1/2012']
Out[321]: 
                   A
2012-01-20 -0.079448

Why on earth would slicing from the year to itself return a single value from the middle of the range?

Hope this helps!

Correction: df.ix['1/5/2012':'1/6/2012'] is actually incorrect - it fetches all values for 1/5, but also gets the first value from 1/6.

pikeas · 2012-11-20T22:27:31Z

(Sorry for the comment spam!)

I've just found another, possibly related, bug. Let me know if you'd like me to break this off into a separate GitHub issue.

rng_dates = DatetimeIndex('1/1/2012', '1/2/2012', '1/2/2012', '1/4/2012')
vals = np.arange(4)
df = DataFrame(vals, index=rng_dates)

Here, df.resample('D') works as expected:

Out[337]: 
              0
2012-01-01  1.0
2012-01-02  2.5
2012-01-03  NaN
2012-01-04  4.0

However, watch what happens when we change the datetimes to timestamps:

# Switching formats slightly so the parser doesn't complain
rng_times = DatetimeIndex(['1-1-2012 01:00', '1-2-2012 1:00', '1-2-2012 1:01', '1-4-2012 1:00'])
df = DataFrame(vals, index=rng_times)
df.resample('D')

Produces:

Out[355]: 
              0
2012-01-02  0.0
2012-01-03  1.5
2012-01-04  NaN
2012-01-05  3.0

I thought this might be because the days were aligning to the hour I chose (1:00), so I tried:

rng_times = DatetimeIndex(['1-1-2012 00:00', '1-2-2012 0:00', '1-2-2012 0:01', '1-4-2012 0:00'])
df = DataFrame(vals, index=rng_times)
df.resample('D')

This worked as expected, so I tried one last thing:

rng_times = DatetimeIndex(['1-1-2012 00:00', '1-2-2012 0:00', '1-2-2012 0:01', '1-4-2012 1:00'])
df = DataFrame(vals, index=rng_times)
df.resample('D')

The only change being the last data point, moved from 0:00 to 1:00, but remaining on the same day. And behold:

Out[367]: 
              0
2012-01-01  0.0
2012-01-02  1.5
2012-01-03  NaN
2012-01-04  NaN
2012-01-05  3.0

So...uh...I feel like I've gone down a rabbit hole! Am I in bug central, or am I misunderstanding how pandas date ranges work?

pikeas · 2012-11-26T20:53:25Z

Do you have any recommendations on how to work around this issue, and on when this bug can be expected to trigger?

changhiskhan · 2012-11-26T22:42:49Z

DataFrame.resample has keywords closed and label both defaulted to 'right'. If you change them to 'left' it should give you the right results for your second example. See docstring for details:

# Switching formats slightly so the parser doesn't complain
rng_times = DatetimeIndex(['1-1-2012 01:00', '1-2-2012 1:00', '1-2-2012 1:01', '1-4-2012 1:00'])
df = DataFrame(vals, index=rng_times)
df.resample('D', closed='left', label='right')

Out[47]: 
              0
2012-01-01  0.0
2012-01-02  1.5
2012-01-03  NaN
2012-01-04  3.0

Now your last example seems to be a bug in the parser. If you examine the index, '1-2-2012 0:01' is actually being parsed as midnight for some reason.

In [54]: rng_times = DatetimeIndex(['1-1-2012 00:00', '1-2-2012 0:00', '1-2-2012 0:01', '1-4-2012 0:00'])

In [55]: rng_times[2]
Out[55]: <Timestamp: 2012-01-02 00:00:00>

We'll put in a fix for this. As a workaround, use pd.to_datetime:

In [56]: rng_times = pd.to_datetime(['1-1-2012 00:00', '1-2-2012 0:00', '1-2-2012 0:01', '1-4-2012 0:00'])

In [57]: rng_times[2]
Out[57]: <Timestamp: 2012-01-02 00:01:00>

…#2306

changhiskhan · 2012-11-26T23:09:39Z

Should have broken this into two issues. This commit does not fix original issue but fixes datetime parsing bug in section starting "(Sorry for the comment spam!)"

changhiskhan · 2012-11-27T16:34:24Z

@pikeas as for the original issue, python dateutil parses '2012-01-01' as a datetime at midnight, so that's why it returns the first entry.

There is special handling inside pandas for YYYY and YYYY-MM type of strings to use as "partial date slices". I'll have to do more digging but I don't think there's an easy way to do that for YYYY-MM-DD that both retains performance and doesn't impact other use cases.

wesm · 2012-11-27T16:39:20Z

note: indexing time series with strings is purely convenience-- so we could sacrifice performance for correctness here. The main issue I saw in this is recognizing daily-only time series from intraday time series-- in some sense in this case you might want to check that there are all midnight timestamps and cache that so you know whether to return scalar values or sub-TimeSeries

changhiskhan · 2012-11-27T16:45:05Z

Oh you know what is_normalized is already cached so we can use that

On Nov 27, 2012, at 11:39 AM, Wes McKinney notifications@github.com wrote:

note: indexing time series with strings is purely convenience-- so we could sacrifice performance for correctness here. The main issue I saw in this is recognizing daily-only time series from intraday time series-- in some sense in this case you might want to check that there are all midnight timestamps and cache that so you know whether to return scalar values or sub-TimeSeries

changhiskhan added a commit that referenced this issue Nov 26, 2012

BUG: tools.parse_time_string does not parse intraday strings correctly …

b76fe2a

…#2306

wesm closed this as completed Dec 2, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug/inconsistent result when indexing DataFrame by YY-MM-DD #2306

Bug/inconsistent result when indexing DataFrame by YY-MM-DD #2306

pikeas commented Nov 20, 2012

wesm commented Nov 20, 2012

pikeas commented Nov 20, 2012

pikeas commented Nov 20, 2012

pikeas commented Nov 20, 2012

pikeas commented Nov 26, 2012

changhiskhan commented Nov 26, 2012

changhiskhan commented Nov 26, 2012

changhiskhan commented Nov 27, 2012

wesm commented Nov 27, 2012

changhiskhan commented Nov 27, 2012

Bug/inconsistent result when indexing DataFrame by YY-MM-DD #2306

Bug/inconsistent result when indexing DataFrame by YY-MM-DD #2306

Comments

pikeas commented Nov 20, 2012

wesm commented Nov 20, 2012

pikeas commented Nov 20, 2012

pikeas commented Nov 20, 2012

pikeas commented Nov 20, 2012

pikeas commented Nov 26, 2012

changhiskhan commented Nov 26, 2012

changhiskhan commented Nov 26, 2012

changhiskhan commented Nov 27, 2012

wesm commented Nov 27, 2012

changhiskhan commented Nov 27, 2012