Skip to content

Commit

Permalink
ENH/WIP: resolution inference in pd.to_datetime, DatetimeIndex (#55901)
Browse files Browse the repository at this point in the history
* ENH: read_stata return non-nano

* GH ref

* move whatsnew

* remove outdated whatsnew

* ENH: read_stata return non-nano

* avoid Series.view

* dont go through Series

* TST: dt64 units

* BUG: cut with non-nano

* BUG: round with non-nanosecond raising OverflowError

* woops

* BUG: cut with non-nano

* TST: parametrize tests over dt64 unit

* xfail non-nano

* revert

* BUG: mixed-type mixed-timezone/awareness

* commit so i can unstash something else i hope

* ENH: infer resolution in to_datetime, DatetimeIndex

* revert commented-out

* revert commented-out

* revert commented-out

* remove commented-out

* remove comment

* revert unnecessary

* revert unnecessary

* fix window tests

* Fix resample tests

* restore comment

* revert unnecessary

* remove no-longer necessary

* revert no-longer-necessary

* revert no-longer-necessary

* update tests

* revert no-longer-necessary

* update tests

* revert bits

* update tests

* cleanup

* revert

* revert

* parametrize over unit

* update tests

* update tests

* revert no-longer-needed

* revert no-longer-necessary

* revert no-longer-necessary

* revert no-longer-necessary

* revert no-longer-necessary

* Revert no-longer-necessary

* update test

* update test

* simplify

* update tests

* update tests

* update tests

* revert no-longer-necessary

* post-merge fixup

* revert no-longer-necessary

* update tests

* update test

* update tests

* update tests

* remove commented-out

* revert no-longer-necessary

* as_unit->astype

* cleanup

* merge fixup

* revert bit

* revert no-longer-necessary, xfail

* update multithread test

* update tests

* update doctest

* update tests

* update doctests

* update tests

* update db tests

* troubleshoot db tests

* update test

* troubleshoot sql tests

* update test

* update tests

* mypy fixup

* Update test

* kludge test

* update test

* update for min-version tests

* fix adbc check

* troubleshoot minimum version deps

* troubleshoot

* troubleshoot

* troubleshoot

* whatsnew

* update abdc-driver-postgresql minimum version

* update doctest

* fix doc example

* troubleshoot test_api_custom_dateparsing_error

* troubleshoot

* troubleshoot

* troubleshoot

* troubleshoot

* troubleshoot

* troubleshoot

* update exp instead of object cast

* revert accidental

* simplify test
  • Loading branch information
jbrockmendel authored May 31, 2024
1 parent a2a78d3 commit 2ea036f
Show file tree
Hide file tree
Showing 77 changed files with 745 additions and 457 deletions.
63 changes: 63 additions & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,69 @@ notable_bug_fix2
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_300.api_breaking.datetime_resolution_inference:

Datetime resolution inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Converting a sequence of strings, ``datetime`` objects, or ``np.datetime64`` objects to
a ``datetime64`` dtype now performs inference on the appropriate resolution (AKA unit) for the output dtype. This affects :class:`Series`, :class:`DataFrame`, :class:`Index`, :class:`DatetimeIndex`, and :func:`to_datetime`.

Previously, these would always give nanosecond resolution:

.. code-block:: ipython
In [1]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime()
In [2]: pd.to_datetime([dt]).dtype
Out[2]: dtype('<M8[ns]')
In [3]: pd.Index([dt]).dtype
Out[3]: dtype('<M8[ns]')
In [4]: pd.DatetimeIndex([dt]).dtype
Out[4]: dtype('<M8[ns]')
In [5]: pd.Series([dt]).dtype
Out[5]: dtype('<M8[ns]')
This now infers the unit microsecond unit "us" from the pydatetime object, matching the scalar :class:`Timestamp` behavior.

.. ipython:: python
In [1]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime()
In [2]: pd.to_datetime([dt]).dtype
In [3]: pd.Index([dt]).dtype
In [4]: pd.DatetimeIndex([dt]).dtype
In [5]: pd.Series([dt]).dtype
Similar when passed a sequence of ``np.datetime64`` objects, the resolution of the passed objects will be retained (or for lower-than-second resolution, second resolution will be used).

When passing strings, the resolution will depend on the precision of the string, again matching the :class:`Timestamp` behavior. Previously:

.. code-block:: ipython
In [2]: pd.to_datetime(["2024-03-22 11:43:01"]).dtype
Out[2]: dtype('<M8[ns]')
In [3]: pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype
Out[3]: dtype('<M8[ns]')
In [4]: pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype
Out[4]: dtype('<M8[ns]')
In [5]: pd.to_datetime(["2024-03-22 11:43:01.002003004"]).dtype
Out[5]: dtype('<M8[ns]')
The inferred resolution now matches that of the input strings:

.. ipython:: python
In [2]: pd.to_datetime(["2024-03-22 11:43:01"]).dtype
In [3]: pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype
In [4]: pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype
In [5]: pd.to_datetime(["2024-03-22 11:43:01.002003004"]).dtype
In cases with mixed-resolution inputs, the highest resolution is used:

.. code-block:: ipython
In [2]: pd.to_datetime([pd.Timestamp("2024-03-22 11:43:01"), "2024-03-22 11:43:01.002"]).dtype
Out[2]: dtype('<M8[ns]')
.. _whatsnew_300.api_breaking.deps:

Increased minimum versions for dependencies
Expand Down
10 changes: 2 additions & 8 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -96,16 +96,12 @@ from pandas._libs.missing cimport (
is_null_datetime64,
is_null_timedelta64,
)
from pandas._libs.tslibs.conversion cimport (
_TSObject,
convert_to_tsobject,
)
from pandas._libs.tslibs.conversion cimport convert_to_tsobject
from pandas._libs.tslibs.nattype cimport (
NPY_NAT,
c_NaT as NaT,
checknull_with_nat,
)
from pandas._libs.tslibs.np_datetime cimport NPY_FR_ns
from pandas._libs.tslibs.offsets cimport is_offset_object
from pandas._libs.tslibs.period cimport is_period_object
from pandas._libs.tslibs.timedeltas cimport convert_to_timedelta64
Expand Down Expand Up @@ -2497,7 +2493,6 @@ def maybe_convert_objects(ndarray[object] objects,
ndarray[uint8_t] mask
Seen seen = Seen()
object val
_TSObject tsobj
float64_t fnan = NaN

if dtype_if_all_nat is not None:
Expand Down Expand Up @@ -2604,8 +2599,7 @@ def maybe_convert_objects(ndarray[object] objects,
else:
seen.datetime_ = True
try:
tsobj = convert_to_tsobject(val, None, None, 0, 0)
tsobj.ensure_reso(NPY_FR_ns)
convert_to_tsobject(val, None, None, 0, 0)
except OutOfBoundsDatetime:
# e.g. test_out_of_s_bounds_datetime64
seen.object_ = True
Expand Down
17 changes: 11 additions & 6 deletions pandas/_libs/tslib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,10 @@ from pandas._libs.tslibs.conversion cimport (
get_datetime64_nanos,
parse_pydatetime,
)
from pandas._libs.tslibs.dtypes cimport npy_unit_to_abbrev
from pandas._libs.tslibs.dtypes cimport (
get_supported_reso,
npy_unit_to_abbrev,
)
from pandas._libs.tslibs.nattype cimport (
NPY_NAT,
c_nat_strings as nat_strings,
Expand Down Expand Up @@ -260,7 +263,7 @@ cpdef array_to_datetime(
bint dayfirst=False,
bint yearfirst=False,
bint utc=False,
NPY_DATETIMEUNIT creso=NPY_FR_ns,
NPY_DATETIMEUNIT creso=NPY_DATETIMEUNIT.NPY_FR_GENERIC,
str unit_for_numerics=None,
):
"""
Expand Down Expand Up @@ -288,8 +291,8 @@ cpdef array_to_datetime(
yearfirst parsing behavior when encountering datetime strings
utc : bool, default False
indicator whether the dates should be UTC
creso : NPY_DATETIMEUNIT, default NPY_FR_ns
Set to NPY_FR_GENERIC to infer a resolution.
creso : NPY_DATETIMEUNIT, default NPY_FR_GENERIC
If NPY_FR_GENERIC, conduct inference.
unit_for_numerics : str, default "ns"
Returns
Expand Down Expand Up @@ -389,7 +392,7 @@ cpdef array_to_datetime(
# GH#32264 np.str_ object
val = str(val)

if parse_today_now(val, &iresult[i], utc, creso):
if parse_today_now(val, &iresult[i], utc, creso, infer_reso=infer_reso):
# We can't _quite_ dispatch this to convert_str_to_tsobject
# bc there isn't a nice way to pass "utc"
item_reso = NPY_DATETIMEUNIT.NPY_FR_us
Expand Down Expand Up @@ -533,7 +536,9 @@ def array_to_datetime_with_tz(
if state.creso_ever_changed:
# We encountered mismatched resolutions, need to re-parse with
# the correct one.
return array_to_datetime_with_tz(values, tz=tz, creso=creso)
return array_to_datetime_with_tz(
values, tz=tz, dayfirst=dayfirst, yearfirst=yearfirst, creso=creso
)
elif creso == NPY_DATETIMEUNIT.NPY_FR_GENERIC:
# i.e. we never encountered anything non-NaT, default to "s". This
# ensures that insert and concat-like operations with NaT
Expand Down
6 changes: 3 additions & 3 deletions pandas/_libs/tslibs/strptime.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -354,7 +354,7 @@ def array_strptime(
bint exact=True,
errors="raise",
bint utc=False,
NPY_DATETIMEUNIT creso=NPY_FR_ns,
NPY_DATETIMEUNIT creso=NPY_DATETIMEUNIT.NPY_FR_GENERIC,
):
"""
Calculates the datetime structs represented by the passed array of strings
Expand All @@ -365,7 +365,7 @@ def array_strptime(
fmt : string-like regex
exact : matches must be exact if True, search if False
errors : string specifying error handling, {'raise', 'coerce'}
creso : NPY_DATETIMEUNIT, default NPY_FR_ns
creso : NPY_DATETIMEUNIT, default NPY_FR_GENERIC
Set to NPY_FR_GENERIC to infer a resolution.
"""

Expand Down Expand Up @@ -712,7 +712,7 @@ cdef tzinfo _parse_with_format(
elif len(s) <= 6:
item_reso[0] = NPY_DATETIMEUNIT.NPY_FR_us
else:
item_reso[0] = NPY_DATETIMEUNIT.NPY_FR_ns
item_reso[0] = NPY_FR_ns
# Pad to always return nanoseconds
s += "0" * (9 - len(s))
us = int(s)
Expand Down
8 changes: 5 additions & 3 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -346,14 +346,15 @@ def unique(values):
array([2, 1])
>>> pd.unique(pd.Series([pd.Timestamp("20160101"), pd.Timestamp("20160101")]))
array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
array(['2016-01-01T00:00:00'], dtype='datetime64[s]')
>>> pd.unique(
... pd.Series(
... [
... pd.Timestamp("20160101", tz="US/Eastern"),
... pd.Timestamp("20160101", tz="US/Eastern"),
... ]
... ],
... dtype="M8[ns, US/Eastern]",
... )
... )
<DatetimeArray>
Expand All @@ -365,7 +366,8 @@ def unique(values):
... [
... pd.Timestamp("20160101", tz="US/Eastern"),
... pd.Timestamp("20160101", tz="US/Eastern"),
... ]
... ],
... dtype="M8[ns, US/Eastern]",
... )
... )
DatetimeIndex(['2016-01-01 00:00:00-05:00'],
Expand Down
12 changes: 6 additions & 6 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -1849,11 +1849,11 @@ def strftime(self, date_format: str) -> npt.NDArray[np.object_]:
>>> rng_tz.floor("2h", ambiguous=False)
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
>>> rng_tz.floor("2h", ambiguous=True)
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
"""

_floor_example = """>>> rng.floor('h')
Expand All @@ -1876,11 +1876,11 @@ def strftime(self, date_format: str) -> npt.NDArray[np.object_]:
>>> rng_tz.floor("2h", ambiguous=False)
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
>>> rng_tz.floor("2h", ambiguous=True)
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
"""

_ceil_example = """>>> rng.ceil('h')
Expand All @@ -1903,11 +1903,11 @@ def strftime(self, date_format: str) -> npt.NDArray[np.object_]:
>>> rng_tz.ceil("h", ambiguous=False)
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
>>> rng_tz.ceil("h", ambiguous=True)
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
dtype='datetime64[s, Europe/Amsterdam]', freq=None)
"""


Expand Down
35 changes: 18 additions & 17 deletions pandas/core/arrays/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ class DatetimeArray(dtl.TimelikeOps, dtl.DatelikeOps): # type: ignore[misc]
... )
<DatetimeArray>
['2023-01-01 00:00:00', '2023-01-02 00:00:00']
Length: 2, dtype: datetime64[ns]
Length: 2, dtype: datetime64[s]
"""

_typ = "datetimearray"
Expand Down Expand Up @@ -613,7 +613,7 @@ def tz(self) -> tzinfo | None:
>>> s
0 2020-01-01 10:00:00+00:00
1 2020-02-01 11:00:00+00:00
dtype: datetime64[ns, UTC]
dtype: datetime64[s, UTC]
>>> s.dt.tz
datetime.timezone.utc
Expand Down Expand Up @@ -1047,7 +1047,7 @@ def tz_localize(
4 2018-10-28 02:30:00+01:00
5 2018-10-28 03:00:00+01:00
6 2018-10-28 03:30:00+01:00
dtype: datetime64[ns, CET]
dtype: datetime64[s, CET]
In some cases, inferring the DST is impossible. In such cases, you can
pass an ndarray to the ambiguous parameter to set the DST explicitly
Expand All @@ -1059,14 +1059,14 @@ def tz_localize(
0 2018-10-28 01:20:00+02:00
1 2018-10-28 02:36:00+02:00
2 2018-10-28 03:46:00+01:00
dtype: datetime64[ns, CET]
dtype: datetime64[s, CET]
If the DST transition causes nonexistent times, you can shift these
dates forward or backwards with a timedelta object or `'shift_forward'`
or `'shift_backwards'`.
>>> s = pd.to_datetime(pd.Series(['2015-03-29 02:30:00',
... '2015-03-29 03:30:00']))
... '2015-03-29 03:30:00'], dtype="M8[ns]"))
>>> s.dt.tz_localize('Europe/Warsaw', nonexistent='shift_forward')
0 2015-03-29 03:00:00+02:00
1 2015-03-29 03:30:00+02:00
Expand Down Expand Up @@ -1427,7 +1427,7 @@ def time(self) -> npt.NDArray[np.object_]:
>>> s
0 2020-01-01 10:00:00+00:00
1 2020-02-01 11:00:00+00:00
dtype: datetime64[ns, UTC]
dtype: datetime64[s, UTC]
>>> s.dt.time
0 10:00:00
1 11:00:00
Expand Down Expand Up @@ -1470,7 +1470,7 @@ def timetz(self) -> npt.NDArray[np.object_]:
>>> s
0 2020-01-01 10:00:00+00:00
1 2020-02-01 11:00:00+00:00
dtype: datetime64[ns, UTC]
dtype: datetime64[s, UTC]
>>> s.dt.timetz
0 10:00:00+00:00
1 11:00:00+00:00
Expand Down Expand Up @@ -1512,7 +1512,7 @@ def date(self) -> npt.NDArray[np.object_]:
>>> s
0 2020-01-01 10:00:00+00:00
1 2020-02-01 11:00:00+00:00
dtype: datetime64[ns, UTC]
dtype: datetime64[s, UTC]
>>> s.dt.date
0 2020-01-01
1 2020-02-01
Expand Down Expand Up @@ -1861,7 +1861,7 @@ def isocalendar(self) -> DataFrame:
>>> s
0 2020-01-01 10:00:00+00:00
1 2020-02-01 11:00:00+00:00
dtype: datetime64[ns, UTC]
dtype: datetime64[s, UTC]
>>> s.dt.dayofyear
0 1
1 32
Expand Down Expand Up @@ -1897,7 +1897,7 @@ def isocalendar(self) -> DataFrame:
>>> s
0 2020-01-01 10:00:00+00:00
1 2020-04-01 11:00:00+00:00
dtype: datetime64[ns, UTC]
dtype: datetime64[s, UTC]
>>> s.dt.quarter
0 1
1 2
Expand Down Expand Up @@ -1933,7 +1933,7 @@ def isocalendar(self) -> DataFrame:
>>> s
0 2020-01-01 10:00:00+00:00
1 2020-02-01 11:00:00+00:00
dtype: datetime64[ns, UTC]
dtype: datetime64[s, UTC]
>>> s.dt.daysinmonth
0 31
1 29
Expand Down Expand Up @@ -2372,9 +2372,9 @@ def _sequence_to_dt64(
data, copy = maybe_convert_dtype(data, copy, tz=tz)
data_dtype = getattr(data, "dtype", None)

if out_unit is None:
out_unit = "ns"
out_dtype = np.dtype(f"M8[{out_unit}]")
out_dtype = DT64NS_DTYPE
if out_unit is not None:
out_dtype = np.dtype(f"M8[{out_unit}]")

if data_dtype == object or is_string_dtype(data_dtype):
# TODO: We do not have tests specific to string-dtypes,
Expand All @@ -2400,7 +2400,7 @@ def _sequence_to_dt64(
dayfirst=dayfirst,
yearfirst=yearfirst,
allow_object=False,
out_unit=out_unit or "ns",
out_unit=out_unit,
)
copy = False
if tz and inferred_tz:
Expand Down Expand Up @@ -2508,7 +2508,7 @@ def objects_to_datetime64(
utc: bool = False,
errors: DateTimeErrorChoices = "raise",
allow_object: bool = False,
out_unit: str = "ns",
out_unit: str | None = None,
) -> tuple[np.ndarray, tzinfo | None]:
"""
Convert data to array of timestamps.
Expand All @@ -2524,7 +2524,8 @@ def objects_to_datetime64(
allow_object : bool
Whether to return an object-dtype ndarray instead of raising if the
data contains more than one timezone.
out_unit : str, default "ns"
out_unit : str or None, default None
None indicates we should do resolution inference.
Returns
-------
Expand Down
Loading

0 comments on commit 2ea036f

Please sign in to comment.