Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: handling of invalid datetimes in constructors #26853

Open
jorisvandenbossche opened this issue Jun 14, 2019 · 8 comments
Open

API: handling of invalid datetimes in constructors #26853

jorisvandenbossche opened this issue Jun 14, 2019 · 8 comments
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Datetime Datetime data dtype

Comments

@jorisvandenbossche
Copy link
Member

Triggered by the regression reported in #26206 and my attempt to fix it (#26848), I looked a bit into how our different constructors handle different cases of invalid datetimes.

Out of bound datetimes (cases: a non-ns unit numpy datetime64 array, datetime.datetime objects, strings):

M8[D] out of bound datetime.datetime out of bound list of M8[D] scalars out of bound string out of bound
dateframe datetime64[ns] 1842-11-22 0 ... object object object
datetimeindex OutOfBoundsDatetime OutOfBoundsDatetime OutOfBoundsDatetime OutOfBoundsDatetime
index OutOfBoundsDatetime object object object
index (explicit dtype) OutOfBoundsDatetime OutOfBoundsDatetime OutOfBoundsDatetime OutOfBoundsDatetime
pd.to_datetime OutOfBoundsDatetime OutOfBoundsDatetime OutOfBoundsDatetime OutOfBoundsDatetime
series datetime64[ns] 1842-11-22 0 ... object object object
series (explicit dtype) datetime64[ns] 1842-11-22 0 ... datetime64[ns] 1842-11-22 0 ... datetime64[ns] 1842-11-22 0 ... datetime64[ns] 1842-11-22 0 ...

Some remarks here:

  • The above table is for master (and the same for 0.24). However, on pandas 0.22, the case of out of bounds datetime64 array raised an error in all cases (so also for Series(..), that's REGR? no error anymore when converting out of bounds datetime64[non-ns] data #26206).
  • For to_datetime, all the OutOfBounds are expected, as the default is to error.
  • For the ones with a specified dtype (either explicit dtype passed or DatetimeIndex), we also expect an error. So DatetimeIndex and Index(.., dtype='M8[ns]') are fine, but Series(.., dtype='M8[ns]) is clearly broken (gives an incorrect date for all cases)
  • The inconsistencies are mainly when no dtype is enforced. In some cases (eg for datetime.datetime objects) we return a object dtype Index/Series, in other cases we raise an error.

The last item is the main question: when inferring (no dtype enforced) and in case the data are clearly timestamps (either numpy datetime64 or datetime.datetime), should we raise an error or return object dtype.

Note that for datetime.datetime, returning object dtype might be more logical as it means returning the input as is. While for datetime64[non-ns], it actually means converting a numpy dtype to object data.

@mroeschke
Copy link
Member

I agree with your suggestion that datetime.datetime should coerce to an object dtype. And I think non-ns numpy.datetime64 should raise an OutOfBoundsDatetime in any constructor since we already use the numpy.datetime64 dtype system and don't really support non-ns resolutions (wouldn't want to make an exception here).

@mroeschke mroeschke added API Design Bug Datetime Datetime data dtype labels Jun 14, 2019
@jorisvandenbossche
Copy link
Member Author

And I think non-ns numpy.datetime64 should raise an OutOfBoundsDatetime in any constructor since we already use the numpy.datetime64 dtype system

Which means: handling out of bound datetime64 differently as out of bound datetime.datetime (I don't have a strong opinion here, we just need to make a choice)
For in-bounds datetime64 that is not ns unit, we do however convert to datetime64[ns], so we do support those to some extent (meaning: in input). The main difference that might warrant this difference is that we can't store the datetime64[non-ns] 'as is', while for out of bounds datetime.datetime we can (as those are already object dtype).

@mroeschke
Copy link
Member

The main difference that might warrant this difference is that we can't store the datetime64[non-ns] 'as is', while for out of bounds datetime.datetime we can (as those are already object dtype).

Yes this is the rational that for my opinion. Any thoughts @jbrockmendel

@jbrockmendel
Copy link
Member

I agree that this is the best we can do under the circumstances.

@jorisvandenbossche
Copy link
Member Author

@jreback thoughts here? On the PR I think you argued for converting to object, also for out of bound non-ns datetime64

@jreback
Copy link
Contributor

jreback commented Jun 15, 2019

i would agree that datetime.datetime should be object as indicated above (agree with everyone)

slightl preference for converting np.datetime to datetime.datetime (and handling as object)
but this would be a bit of work to implement as we can then consistently handle all inputs.

I don’t like raising in the Series / Dataframe constructors

bottom line though willing to go along with consensus (unless the above is easy)

@jorisvandenbossche
Copy link
Member Author

I am not sure the one or the other behaviour is necessarily harder to implement. I think we should rather decide on what behaviour we want.

Are there other numpy dtypes we cannot represent? (to check what we do there?) I don't think there are? (except for np.str which we convert to object, but that's a bit a special case).

@jorisvandenbossche
Copy link
Member Author

Another inconsistency I noticed (but another case, handling of iNaT, thus the integer number):

In [66]: pd.Series([datetime(2001, 1, 2, 0, 0), pd._libs.tslib.iNaT], dtype='M8[ns]')
Out[66]: 
0   2001-01-02
1          NaT
dtype: datetime64[ns]

In [68]: pd.Index([datetime(2001, 1, 2, 0, 0), pd._libs.tslib.iNaT], dtype='M8[ns]') 
...
ValueError: mixed datetimes and integers in passed array

In [69]: pd.to_datetime([datetime(2001, 1, 2, 0, 0), pd._libs.tslib.iNaT])
...
ValueError: mixed datetimes and integers in passed array

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

4 participants