Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-by-one error when slicing a TimeSeries with date-strings #3905

Closed
dhirschfeld opened this issue Jun 14, 2013 · 13 comments
Closed

Out-by-one error when slicing a TimeSeries with date-strings #3905

dhirschfeld opened this issue Jun 14, 2013 · 13 comments
Labels
API Design Datetime Datetime data dtype
Milestone

Comments

@dhirschfeld
Copy link
Contributor

...when the hour is not zero.

Example below:

In [7]: series = pd.TimeSeries(numpy.arange(5), pd.date_range('2014-01-01 09:00', periods=5, freq='D'))
   ...: series
   ...: 
Out[7]: 
2014-01-01 09:00:00    0
2014-01-02 09:00:00    1
2014-01-03 09:00:00    2
2014-01-04 09:00:00    3
2014-01-05 09:00:00    4
Freq: D, dtype: int32

In [8]: series[pd.datetime(2014, 1, 2):pd.datetime(2014, 1, 4)]
Out[8]: 
2014-01-02 09:00:00    1
2014-01-03 09:00:00    2
Freq: D, dtype: int32

In [9]: series['02-Jan-2014':'04-Jan-2014']
Out[9]: 
2014-01-02 09:00:00    1
2014-01-03 09:00:00    2
2014-01-04 09:00:00    3   <------------------------- ???
Freq: D, dtype: int32

In [10]: 
@jreback
Copy link
Contributor

jreback commented Jun 14, 2013

@dhirschfeld I believe the string slicing is correct here, while the datetime object slicing is off by one. Labels by definition are INCLUSIVE of the endpoints, while integer are not. Since slicing a datetimeindex is never 'integer' based, I think it is appropraite to always include endpoints.

Though a case could be made for the reverse if a frequency is defined.

What do you think?

@dhirschfeld
Copy link
Contributor Author

Since each datetime in the index has hour = 9, the end date of 04-Jan-2014 00:00 isn't included in the index and is in fact strictly less than the last included date 04-Jan-2014 09:00 of the returned slice.

i.e. the returned slice has included a value which is 9 hours after my specified end point!

If I had instead specified 04-Jan-2014 09:00 as the end point then I would expect the value 3 to be returned because of the inclusive nature of pandas' label based indexing.

@jreback, surely this is a bug?

@jreback
Copy link
Contributor

jreback commented Jun 14, 2013

this issue added this, IOW, the last day is included, regardless of the time: #3546

@jreback
Copy link
Contributor

jreback commented Jun 14, 2013

@dhirschfeld when does the last day end (11:59:59.9999999) ?

your example implies it should end exactly 1 day before

@dhirschfeld
Copy link
Contributor Author

IMHO that's exactly the problem with having an inclusive end point on a continuously defined interval - there is no inclusive end point as there is an infinite number of reals between any two reals.

That's beside the point in this example though as 04-Jan-2014 09:00 is strictly greater than 04-Jan-2014 00:00. Given that, if you ask for all dates less than or equal to 04-Jan-2014 00:00 you would not expect to get 04-Jan-2014 09:00.

Perhaps the misunderstanding is that I'm interpreting the string '04-Jan-2014' to mean the instant in time pd.datetime(2014, 1, 4, 0, 0, 0) (which I'd argue is the correct interpretation)?

@jreback
Copy link
Contributor

jreback commented Jun 14, 2013

but that's exactly the problem

the string '04-jan-2014' is ambiguous

you think it should not include anything on that day
but that is not convention

if you really mean the Instant in time then you need to be more specific

'04-jan-2014 00:00:00' and that why the datetime example does what u want, it IS specific

@dhirschfeld
Copy link
Contributor Author

I'd argue that it is a strongly held convention that any unspecified hours/minutes/seconds are equal to zero. It holds in Excel, dateutil, numpy and even in the rest of pandas:

In [1]: from dateutil.parser import parser

In [2]: parser().parse('04-Jan-2014')
Out[2]: datetime.datetime(2014, 1, 4, 0, 0)

In [3]: np.datetime64('2014-01-04').astype(pd.datetime)
Out[3]: datetime.date(2014, 1, 4)

In [4]: pd.DatetimeIndex(['01-Jan-2014'])[0]
Out[4]: <Timestamp: 2014-01-01 00:00:00>

Looking at #3546 I can see the logic - it's handy to be able to get a whole period out without having to explicitly specify the start and end points of the slice e.g. series['2014'].

I'm not sure that this utility extends to slices though - it seems extremely surprising and far too magical to me to have a date string in a slice interpreted as anything other than a datetime instance as would be parsed by pandas itself in other circumstances.

I'll post a link on the mailing list to solicit other feedback as it could be that I'm in the minority by being surprised by the current behaviour. If so it may still be worthwhile to have a note in the documentation.

@jreback
Copy link
Contributor

jreback commented Jun 17, 2013

you can easily have your slice interpreted exactly as you want, either string or datetime, just be specific (to seconds level).

I don't think is all that magical and is quite useful; we are debating whether if you specifiy

'2014-01-01' whether that should include times on the day; I think its way more unexepected not to do this.

Your examples are not on point as they are single points in time, not slices, which is exactly how this is interpreted.

@jreback
Copy link
Contributor

jreback commented Jun 17, 2013

@dhirschfeld I put up #3939, please take a look and let me know if this makes sense (I know we disagree about the actual semantics, but I am documenting the current behavior, if it is changed, these will change too)

@dhirschfeld
Copy link
Contributor Author

Sorry, I'm mostly offline for the next week and haven't had a chance to really look at it properly.

My interpretation of the current behaviour is that slicing with date strings works differently to slicing with date objects. When slicing with date strings the both the start and end dates are treated as period objects whose frequency is determined by the "accuracy" of the passed date object. e.g. if 2013-04 is passed in it is treated as a period with a monthly frequency whereas 2013-04-01 would be treated as a period with a daily frequency.

Treating the slice arguments as periods has the important consequence that because the end point is inclusive all timestamps falling within the last period will be included in the slice. e.g. the slice series['01-Apr-2013':'04-Apr-2013'] will include all values which occurred on 04-Apr-2013, even one which occurred at 04-Apr-2013 23:59:59

@jreback
Copy link
Contributor

jreback commented Jun 21, 2013

I had put this section up in the main docs:
http://pandas.pydata.org/pandas-docs/dev/timeseries.html#partial-string-indexing

your interpretation is exactly right....date strings are different than datetime objects

I think I should include an example using datetime objects (as well), as well as some of your language from above

@dhirschfeld
Copy link
Contributor Author

Great - with the new documentation I'm happy for this issue to be closed.

If you were going to expand the description further it may even deserve a mention or link from the gotchas section.

@jreback
Copy link
Contributor

jreback commented Jul 1, 2013

great...thanks for the discussion!

@jreback jreback closed this as completed Jul 1, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

2 participants