Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: catch out-of-bounds datetime64 in Series/DataFrame constructor #26848

Conversation

jorisvandenbossche
Copy link
Member

TODO: Need to add a whatsnew and test for DataFrame.

Question:

I am now letting this raise an error. But, an alternative could be to return an object array (by simply removing the except OutOfBoundsDatetime: raise, as the code below that fall backs to object array if no dtype is specified.

This would be more consistent with if it is not an array but list of scalars:

In [5]: a = np.array(['2262-04-12'], dtype='datetime64[s]')

In [6]: pd.Series(a)   # now raising on this PR
...
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-04-12 00:00:00

In [7]: pd.Series([a[0]])    # already falling back to object array on master
Out[7]: 
0    2262-04-12T00:00:00
dtype: object

That is somewhat the general logic of the _try_cast. Of course, if you specify the required dtype explicitly, you should still get an error (which is actually also a bug on master, I see now).

However, that would be inconsistent with the current pd.Index behaviour, which raises when passing out of bounds datetime64 (but also returns object dtype with list of datetime64 scalars)

cc @jbrockmendel @mroeschke

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Datetime Datetime data dtype labels Jun 14, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.25.0 milestone Jun 14, 2019
@codecov
Copy link

codecov bot commented Jun 14, 2019

Codecov Report

Merging #26848 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26848      +/-   ##
==========================================
- Coverage   91.97%   91.97%   -0.01%     
==========================================
  Files         180      180              
  Lines       50756    50760       +4     
==========================================
+ Hits        46685    46686       +1     
- Misses       4071     4074       +3
Flag Coverage Δ
#multiple 90.56% <100%> (ø) ⬆️
#single 41.84% <33.33%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 90.56% <100%> (+0.03%) ⬆️
pandas/core/internals/construction.py 95.95% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.89% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.94% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9b081d...369eead. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 14, 2019

Codecov Report

Merging #26848 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26848      +/-   ##
==========================================
- Coverage   91.86%   91.85%   -0.01%     
==========================================
  Files         179      179              
  Lines       50700    50703       +3     
==========================================
- Hits        46576    46575       -1     
- Misses       4124     4128       +4
Flag Coverage Δ
#multiple 90.45% <100%> (ø) ⬆️
#single 41.12% <40%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 90.53% <100%> (ø) ⬆️
pandas/core/internals/construction.py 95.96% <100%> (+0.03%) ⬆️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8374a1d...c5fc3c6. Read the comment docs.

@jorisvandenbossche
Copy link
Member Author

not sure how this is inconsistent

Because we have been doing the opposite for some releases?

Before 0.23.0, all of Index, Series and DataFrame constructors have been consistently raising an error when out-of-bounds datetime64 array was passed, even if no explicit dtype was specified.

So yes, I consider converting to object an option (this is why I raised the question in the top post), but it's not necessarily the "only correct" way. For example, for Index, this would be a change in behaviour.

@jreback
Copy link
Contributor

jreback commented Jun 14, 2019

Before 0.23.0, all of Index, Series and DataFrame constructors have been consistently raising an error when out-of-bounds datetime64 array was passed, even if no explicit dtype was specified.

this is likely a bug / regression

Series constructor was designed to not raise and rather coerce

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 14, 2019

Series constructor was designed to not raise and rather coerce

I suppose you don't mean "coerce" here in meaning of to_datetime, as that is returning NaT ?

@jorisvandenbossche
Copy link
Member Author

@jreback looked at some different cases and made a general issue here: #26853

def test_constructor_datetime64_outofbound(self):
# GH-26206 out of bound non-ns unit
with pytest.raises(pd.errors.OutOfBoundsDatetime):
pd.Series(np.array(['2262-04-12'], dtype='datetime64[D]'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also test the minimum date as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's relying on a general routine that does this bound checking and which is already tested thoroughly through other means, here I mainly want to test to ensure this routine is used. Therefore, I decided to keep the tests here a bit simpler (adding is not that hard, but it just complicates the test further, and there are already several cases). Is that OK?

@pep8speaks
Copy link

pep8speaks commented Jun 17, 2019

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-21 06:16:22 UTC

pandas/core/dtypes/cast.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change looks good (ideally we want to simplify that code, but that is non-trivial / out-of scope for this). can you add a whatsnew note (I think single line is ok, not sure we need a sub-section to explain, though ok for that too)

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 18, 2019

One question about the tests: I want to add the exact same for DataFrame and Index constructors. I can copy an edited version to the respective testing files for DataFrame and Index, but on the other hand, I think there is also benefit in keeping the test together (less duplication of the boiler plate around it, could do a parametrize). Do we have a place to put general constructor tests that check for consistency? (because doing it here in the series tests is also a bit out of place)

@jreback
Copy link
Contributor

jreback commented Jun 18, 2019

One question about the tests: I want to add the exact same for DataFrame and Index constructors. I can copy an edited version to the respective testing files for DataFrame and Index, but on the other hand, I think there is also benefit in keeping the test together (less duplication of the boiler plate around it). Do we have a place to put general constructor tests that check for consistency? (because doing it here in the series tests is also a bit out of place)

we have 2 mechanisms for this, neither complete solutions. test_generic you can do Series/DataFrame and test_base you can do Series/Index.

generally for this as its really the Series you are fixing is to put it in test_base and then copy a similar example to the appropriate palace in frame tests.

@jorisvandenbossche
Copy link
Member Author

My last commit does an example parametrize of Index/DataFrame/Series

generally for this as its really the Series you are fixing

yes, but that also fixes DataFrame (with passing a dict) as it is using the same code. For Index it is indeed only adding tests to ensure consistency.

@jorisvandenbossche
Copy link
Member Author

Will move to test_base, that seems most appropriate (it indeed already has some tests that checks consistency across Series, Index)

@jorisvandenbossche
Copy link
Member Author

What both the dataframe and series fix have in common is that the code (where the fixes are applied) is now centralized in construction.py. So it might make sense to also have a test_construction.py at some point.

@jorisvandenbossche
Copy link
Member Author

This should be ready. I opened a new issue for the xfail DataFrame case that I added, as this is a more general issue that we are doing unsafe astype: #26919

@jreback
Copy link
Contributor

jreback commented Jun 21, 2019

pls merge master. lgtm. merge on green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants