Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subtracting CFTimeIndex can cause pd.TimedeltaIndex to overflow #3535

Closed
mathause opened this issue Nov 14, 2019 · 2 comments · Fixed by #3543
Closed

subtracting CFTimeIndex can cause pd.TimedeltaIndex to overflow #3535

mathause opened this issue Nov 14, 2019 · 2 comments · Fixed by #3543

Comments

@mathause
Copy link
Collaborator

mathause commented Nov 14, 2019

MCVE Code Sample

import xarray as xr
i1 = xr.cftime_range("4991-01-01", periods=1)
i2 = xr.cftime_range("7190-12-31", periods=1)
i2 - i1

Expected Output

a timedelta

Problem Description

returns OverflowError: Python int too large to convert to C long. Originally I stumbled upon this when trying to open_mfdataset files from a long simulation (piControl). I did not figure out yet where this subtraction happens in open_mfdataset. (Opening the single files and using xr.concat works).

The offending lines are here

return pd.TimedeltaIndex(np.array(self) - np.array(other))

Ultimately this is probably a pandas problem as it tries to convert datetime.timedelta(days=803532) to '<m8[ns]'. pd.TimedeltaIndex has a (undocumented) dtype argument but I was not able to make anything else work (e.g. '<m8[D]').

@spencerkclark

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp151.28.25-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.14.0+44.g4dce93f1
pandas: 0.25.2
numpy: 1.17.3
scipy: 1.3.1
netCDF4: 1.5.0.1
pydap: None
h5netcdf: 0.7.4
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.4.2
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.0.22
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.6.0
distributed: 2.6.0
matplotlib: 3.1.1
cartopy: 0.17.0
seaborn: 0.9.0
numbagg: None
setuptools: 41.4.0
pip: 19.3.1
conda: None
pytest: 5.2.2
IPython: 7.9.0
sphinx: 2.2.1

@mathause
Copy link
Collaborator Author

mathause commented Nov 15, 2019

This happens in xr.combinde_by_coords. Note that the OverflowError is "ignored in: pandas._libs.algos.are_diff'". So xr.combinde_by_coords` can return a wrong dataset (although this does not happen silently):

import xarray as xr
i1 = xr.cftime_range("4500-12-31", periods=1)
i2 = xr.cftime_range("4600-12-31", periods=1)
i3 = xr.cftime_range("5100-12-31", periods=1)

d1 = xr.DataArray([0], dims=("time", ), coords={"time": ("time", i1)}).to_dataset(name="a")
d2 = xr.DataArray([1], dims=("time", ), coords={"time": ("time", i2)}).to_dataset(name="a")
d3 = xr.DataArray([2], dims=("time", ), coords={"time": ("time", i3)}).to_dataset(name="a")

xr.combine_by_coords([d1, d2, d3]).time

returns:

<xarray.DataArray 'time' (time: 2)>
array([cftime.DatetimeGregorian(4500-12-31 00:00:00),
       cftime.DatetimeGregorian(5100-12-31 00:00:00)], dtype=object)
Coordinates:
  * time     (time) object 4500-12-31 00:00:00 5100-12-31 00:00:00

note how d2 is missing.


Within xr.combine_by_coords the error happens here:

rank = series.rank(method="dense", ascending=ascending)

import pandas as pd

indexes = [i1, i2, i3]

# the code from _infer_concat_order_from_coords
first_items = pd.Index([index.take([0]) for index in indexes])

series = first_items.to_series()
rank = series.rank(method="dense", ascending=ascending)
order = rank.astype(int).values - 1

order
>>> array([0, 1, 1])

This causes the second item to be dropped.

@spencerkclark
Copy link
Member

Thanks for raising this issue @mathause. In hindsight this does not surprise me. Pandas's strict use of nanosecond-resolution datetimes and timedeltas was part of the motivation for the CFTimeIndex. While convenient, because it allows us to re-use code already written in pandas, holding the result of the difference between two CFTimeIndexes in a TimedeltaIndex clearly prevents us from taking the difference between distant dates.

Perhaps a more robust (yet more complex) solution for #2484 would be to write a version of a TimedeltaIndex that does not internally cast the timedeltas to type np.timedelta64[ns], and rather leaves them as datetime.timedelta objects, which are the actual result of subtracting two sequences of cftime.datetime objects.

Regarding the combine_by_coords issue, though, there might be an easier fix. Is there a reason that first_items is an Index of length-one Indexes? It's not clear to me why that needs to be the case.

first_items = pd.Index([index.take([0]) for index in indexes])

It appears if we just select the first value of each index (i.e. a cftime.datetime object in this example), e.g.

first_items = pd.Index([index[0] for index in indexes])

pandas's rank method works properly and combine_by_coords produces the correct result:

>>> xr.combine_by_coords([d1, d2, d3]).time
<xarray.DataArray 'time' (time: 3)>
array([cftime.DatetimeGregorian(4500, 12, 31, 0, 0, 0, 0, 4, 365),
       cftime.DatetimeGregorian(4600, 12, 31, 0, 0, 0, 0, 2, 365),
       cftime.DatetimeGregorian(5100, 12, 31, 0, 0, 0, 0, 0, 365)],
      dtype=object)
Coordinates:
  * time     (time) object 4500-12-31 00:00:00 ... 5100-12-31 00:00:00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants