Out-by-one error when slicing a TimeSeries with date-strings #3905

dhirschfeld · 2013-06-14T14:50:11Z

...when the hour is not zero.

Example below:

In [7]: series = pd.TimeSeries(numpy.arange(5), pd.date_range('2014-01-01 09:00', periods=5, freq='D'))
   ...: series
   ...: 
Out[7]: 
2014-01-01 09:00:00    0
2014-01-02 09:00:00    1
2014-01-03 09:00:00    2
2014-01-04 09:00:00    3
2014-01-05 09:00:00    4
Freq: D, dtype: int32

In [8]: series[pd.datetime(2014, 1, 2):pd.datetime(2014, 1, 4)]
Out[8]: 
2014-01-02 09:00:00    1
2014-01-03 09:00:00    2
Freq: D, dtype: int32

In [9]: series['02-Jan-2014':'04-Jan-2014']
Out[9]: 
2014-01-02 09:00:00    1
2014-01-03 09:00:00    2
2014-01-04 09:00:00    3   <------------------------- ???
Freq: D, dtype: int32

In [10]:

jreback · 2013-06-14T15:19:28Z

@dhirschfeld I believe the string slicing is correct here, while the datetime object slicing is off by one. Labels by definition are INCLUSIVE of the endpoints, while integer are not. Since slicing a datetimeindex is never 'integer' based, I think it is appropraite to always include endpoints.

Though a case could be made for the reverse if a frequency is defined.

What do you think?

dhirschfeld · 2013-06-14T15:30:27Z

Since each datetime in the index has hour = 9, the end date of 04-Jan-2014 00:00 isn't included in the index and is in fact strictly less than the last included date 04-Jan-2014 09:00 of the returned slice.

i.e. the returned slice has included a value which is 9 hours after my specified end point!

If I had instead specified 04-Jan-2014 09:00 as the end point then I would expect the value 3 to be returned because of the inclusive nature of pandas' label based indexing.

@jreback, surely this is a bug?

jreback · 2013-06-14T15:36:06Z

this issue added this, IOW, the last day is included, regardless of the time: #3546

jreback · 2013-06-14T15:38:15Z

@dhirschfeld when does the last day end (11:59:59.9999999) ?

your example implies it should end exactly 1 day before

dhirschfeld · 2013-06-14T16:22:58Z

IMHO that's exactly the problem with having an inclusive end point on a continuously defined interval - there is no inclusive end point as there is an infinite number of reals between any two reals.

That's beside the point in this example though as 04-Jan-2014 09:00 is strictly greater than 04-Jan-2014 00:00. Given that, if you ask for all dates less than or equal to 04-Jan-2014 00:00 you would not expect to get 04-Jan-2014 09:00.

Perhaps the misunderstanding is that I'm interpreting the string '04-Jan-2014' to mean the instant in time pd.datetime(2014, 1, 4, 0, 0, 0) (which I'd argue is the correct interpretation)?

jreback · 2013-06-14T16:27:44Z

but that's exactly the problem

the string '04-jan-2014' is ambiguous

you think it should not include anything on that day
but that is not convention

if you really mean the Instant in time then you need to be more specific

'04-jan-2014 00:00:00' and that why the datetime example does what u want, it IS specific

dhirschfeld · 2013-06-17T10:51:07Z

I'd argue that it is a strongly held convention that any unspecified hours/minutes/seconds are equal to zero. It holds in Excel, dateutil, numpy and even in the rest of pandas:

In [1]: from dateutil.parser import parser

In [2]: parser().parse('04-Jan-2014')
Out[2]: datetime.datetime(2014, 1, 4, 0, 0)

In [3]: np.datetime64('2014-01-04').astype(pd.datetime)
Out[3]: datetime.date(2014, 1, 4)

In [4]: pd.DatetimeIndex(['01-Jan-2014'])[0]
Out[4]: <Timestamp: 2014-01-01 00:00:00>

Looking at #3546 I can see the logic - it's handy to be able to get a whole period out without having to explicitly specify the start and end points of the slice e.g. series['2014'].

I'm not sure that this utility extends to slices though - it seems extremely surprising and far too magical to me to have a date string in a slice interpreted as anything other than a datetime instance as would be parsed by pandas itself in other circumstances.

I'll post a link on the mailing list to solicit other feedback as it could be that I'm in the minority by being surprised by the current behaviour. If so it may still be worthwhile to have a note in the documentation.

jreback · 2013-06-17T12:07:42Z

you can easily have your slice interpreted exactly as you want, either string or datetime, just be specific (to seconds level).

I don't think is all that magical and is quite useful; we are debating whether if you specifiy

'2014-01-01' whether that should include times on the day; I think its way more unexepected not to do this.

Your examples are not on point as they are single points in time, not slices, which is exactly how this is interpreted.

jreback · 2013-06-17T23:44:53Z

@dhirschfeld I put up #3939, please take a look and let me know if this makes sense (I know we disagree about the actual semantics, but I am documenting the current behavior, if it is changed, these will change too)

dhirschfeld · 2013-06-21T16:10:45Z

Sorry, I'm mostly offline for the next week and haven't had a chance to really look at it properly.

My interpretation of the current behaviour is that slicing with date strings works differently to slicing with date objects. When slicing with date strings the both the start and end dates are treated as period objects whose frequency is determined by the "accuracy" of the passed date object. e.g. if 2013-04 is passed in it is treated as a period with a monthly frequency whereas 2013-04-01 would be treated as a period with a daily frequency.

Treating the slice arguments as periods has the important consequence that because the end point is inclusive all timestamps falling within the last period will be included in the slice. e.g. the slice series['01-Apr-2013':'04-Apr-2013'] will include all values which occurred on 04-Apr-2013, even one which occurred at 04-Apr-2013 23:59:59

jreback · 2013-06-21T17:54:49Z

I had put this section up in the main docs:
http://pandas.pydata.org/pandas-docs/dev/timeseries.html#partial-string-indexing

your interpretation is exactly right....date strings are different than datetime objects

I think I should include an example using datetime objects (as well), as well as some of your language from above

dhirschfeld · 2013-07-01T15:21:15Z

Great - with the new documentation I'm happy for this issue to be closed.

If you were going to expand the description further it may even deserve a mention or link from the gotchas section.

jreback · 2013-07-01T15:43:48Z

great...thanks for the discussion!

jreback mentioned this issue Jun 17, 2013

DOC: expand section on partial string indexing of timeseries #3938

Closed

jreback closed this as completed Jul 1, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-by-one error when slicing a TimeSeries with date-strings #3905

Out-by-one error when slicing a TimeSeries with date-strings #3905

dhirschfeld commented Jun 14, 2013

jreback commented Jun 14, 2013

dhirschfeld commented Jun 14, 2013

jreback commented Jun 14, 2013

jreback commented Jun 14, 2013

dhirschfeld commented Jun 14, 2013

jreback commented Jun 14, 2013

dhirschfeld commented Jun 17, 2013

jreback commented Jun 17, 2013

jreback commented Jun 17, 2013

dhirschfeld commented Jun 21, 2013

jreback commented Jun 21, 2013

dhirschfeld commented Jul 1, 2013

jreback commented Jul 1, 2013

Out-by-one error when slicing a TimeSeries with date-strings #3905

Out-by-one error when slicing a TimeSeries with date-strings #3905

Comments

dhirschfeld commented Jun 14, 2013

jreback commented Jun 14, 2013

dhirschfeld commented Jun 14, 2013

jreback commented Jun 14, 2013

jreback commented Jun 14, 2013

dhirschfeld commented Jun 14, 2013

jreback commented Jun 14, 2013

dhirschfeld commented Jun 17, 2013

jreback commented Jun 17, 2013

jreback commented Jun 17, 2013

dhirschfeld commented Jun 21, 2013

jreback commented Jun 21, 2013

dhirschfeld commented Jul 1, 2013

jreback commented Jul 1, 2013