Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: In 1.5rc0 casting Series to datetime64 with specific but non-nanosecond units has no effect #48574

Open
3 tasks done
zaneselvans opened this issue Sep 15, 2022 · 7 comments
Labels
Astype Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution Regression Functionality that used to work in a prior pandas version

Comments

@zaneselvans
Copy link

zaneselvans commented Sep 15, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas.testing import assert_series_equal
print(f"{pd.__version__=}")

ser = pd.Series(["2000-02-02", "2000-03-03"], dtype="datetime64[ns]")

# Passes on pandas 1.4.4
# Fails on pandas 1.5.0rc0
# Fails on pandas 1.6.0.dev0+136.gbbf17ea692
assert_series_equal(
    ser.astype("datetime64[Y]"),
    pd.Series(["2000-01-01", "2000-01-01"], dtype="datetime64[ns]")
)

Issue Description

In pandas 1.4.4 and earlier, it was possible to affect the contents of datetime64[ns] type Series by casting to datetime64 with other non-nanosecond units, even though the resulting Series still had datetime64[ns] as its type. As of 1.5rc0 this behavior seems to have changed. Casting to datetime64 with other units no longer seems to have any effect.

Based on comments in PR #48555 (closing issue #47844) referring to the numpy unit conversion it seems like this might not be the intended behavior, and it's a breaking change (we were relying on this behavior to turn month-start dates into the corresponding year-start dates). Snippet from that PR:

       elif (
            self.tz is None
            and is_datetime64_dtype(dtype)
            and dtype != self.dtype
            and is_unitless(dtype)
        ):
            # TODO(2.0): just fall through to dtl.DatetimeLikeArrayMixin.astype
            warnings.warn(
                "Passing unit-less datetime64 dtype to .astype is deprecated "
                "and will raise in a future version. Pass 'datetime64[ns]' instead",
                FutureWarning,
                stacklevel=find_stack_level(inspect.currentframe()),
            )
            # unit conversion e.g. datetime64[s]
            return self._ndarray.astype(dtype)

Expected Behavior

I expected the dates in the series to be adjusted to be consistent with the frequency of the datetime64 type used in astype(), as illustrated in the example above.

Installed Versions

INSTALLED VERSIONS

commit : bbf17ea
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-47-generic
Version : #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.6.0.dev0+136.gbbf17ea692
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 65.3.0
pip : 22.2.2
Cython : 0.29.32
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@zaneselvans zaneselvans added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2022
zaneselvans added a commit to catalyst-cooperative/pudl that referenced this issue Sep 16, 2022
This is a (small) hack that works around changes in how pandas deals
with casting Series to `datetime64` types have units larger than
seconds. We depended on the previous behavior. Not sure if it's
something that will get fixed in pandas, but I made this issue and it's
a breaking change, so hopefully:

pandas-dev/pandas#48574
@phofl
Copy link
Member

phofl commented Sep 19, 2022

cc @jbrockmendel thoughts here?

@jbrockmendel
Copy link
Member

Can fix for 1.5.1. For 2.0 we'll support [s, ms, us, ns], and i think astype to anything else will raise. long-term the user should use .floor i think

@phofl phofl added this to the 1.5.1 milestone Sep 19, 2022
@phofl phofl added Regression Functionality that used to work in a prior pandas version Non-Nano datetime64/timedelta64 with non-nanosecond resolution and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 19, 2022
@zaneselvans
Copy link
Author

Messing around with this a bit, Series.dt.floor() works well for Day, Hour, & Minute frequencies, but it doesn't allow snapping to the the start of years or months, so now I'm pulling out the numpy array and casting it to another frequency and constructing a new Series. It seems like there must be a better way I'm not aware of.

import numpy as np
import pandas as pd

df = (
    pd.DataFrame()
    .assign(
        hourly=pd.Series(np.arange("2000-01-01", "2003-01-01", dtype="datetime64[h]")),
        daily=lambda x: x.hourly.dt.floor("D"),
        monthly=lambda x: pd.Series(x["hourly"].to_numpy().astype("datetime64[M]")),
        yearly=lambda x: pd.Series(x["hourly"].to_numpy().astype("datetime64[Y]")),
    )
)

df.sample(10)
hourly daily monthly yearly
5280 2000-08-08 00:00:00 2000-08-08 2000-08-01 2000-01-01
6378 2000-09-22 18:00:00 2000-09-22 2000-09-01 2000-01-01
18764 2002-02-20 20:00:00 2002-02-20 2002-02-01 2002-01-01
25739 2002-12-08 11:00:00 2002-12-08 2002-12-01 2002-01-01
8285 2000-12-11 05:00:00 2000-12-11 2000-12-01 2000-01-01
16114 2001-11-02 10:00:00 2001-11-02 2001-11-01 2001-01-01
15167 2001-09-23 23:00:00 2001-09-23 2001-09-01 2001-01-01
3424 2000-05-22 16:00:00 2000-05-22 2000-05-01 2000-01-01
19729 2002-04-02 01:00:00 2002-04-02 2002-04-01 2002-01-01
3148 2000-05-11 04:00:00 2000-05-11 2000-05-01 2000-01-01

@jbrockmendel
Copy link
Member

but it doesn't allow snapping to the the start of years or months

Good catch. I think we'd need something like pd.offsets.YearStart().rollback(obj) but for arrays instead of scalars xref #7449.

For 1.5.0 your best bet is, like you've found, doing the astype directly on the underlying numpy arrays.

For 1.5.1 we can restore that behavior.

@datapythonista datapythonista modified the milestones: 1.5.1, 1.5.2 Oct 20, 2022
@datapythonista datapythonista modified the milestones: 1.5.2, 1.5.3 Nov 15, 2022
@datapythonista datapythonista modified the milestones: 1.5.3, 1.5.4 Jan 18, 2023
@datapythonista datapythonista modified the milestones: 1.5.4, 2.0 Feb 27, 2023
@MarcoGorelli MarcoGorelli modified the milestones: 2.0, 2.1 Mar 27, 2023
@MarcoGorelli
Copy link
Member

moving off 2.0 as ser.astype("datetime64[Y]") raises now anyway

@lingyielia
Copy link

Is there a plan to restore ser.astype("datetime64[Y]") in future versions? Or it will always raise?

@MarcoGorelli
Copy link
Member

hey @lingyielia - this will continue to raise

are you trying to floor to the beginning of the year? If so, you could do

ser + pd.tseries.offsets.YearBegin() - pd.tseries.offsets.YearBegin()

In the future, it should be possible to do pd.tseries.offsets.YearBegin.rollbackward(ser)

@lithomas1 lithomas1 removed this from the 2.1 milestone Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Astype Bug Non-Nano datetime64/timedelta64 with non-nanosecond resolution Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

7 participants