subtracting CFTimeIndex can cause pd.TimedeltaIndex to overflow #3535

mathause · 2019-11-14T18:45:03Z

MCVE Code Sample

import xarray as xr
i1 = xr.cftime_range("4991-01-01", periods=1)
i2 = xr.cftime_range("7190-12-31", periods=1)
i2 - i1

Expected Output

a timedelta

Problem Description

returns OverflowError: Python int too large to convert to C long. Originally I stumbled upon this when trying to open_mfdataset files from a long simulation (piControl). I did not figure out yet where this subtraction happens in open_mfdataset. (Opening the single files and using xr.concat works).

The offending lines are here

xarray/xarray/coding/cftimeindex.py

Line 433 in 40588dc

return pd.TimedeltaIndex(np.array(self) - np.array(other))

Ultimately this is probably a pandas problem as it tries to convert datetime.timedelta(days=803532) to '<m8[ns]'. pd.TimedeltaIndex has a (undocumented) dtype argument but I was not able to make anything else work (e.g. '<m8[D]').

@spencerkclark

Output of `xr.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp151.28.25-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.14.0+44.g4dce93f1
pandas: 0.25.2
numpy: 1.17.3
scipy: 1.3.1
netCDF4: 1.5.0.1
pydap: None
h5netcdf: 0.7.4
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.4.2
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.0.22
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.6.0
distributed: 2.6.0
matplotlib: 3.1.1
cartopy: 0.17.0
seaborn: 0.9.0
numbagg: None
setuptools: 41.4.0
pip: 19.3.1
conda: None
pytest: 5.2.2
IPython: 7.9.0
sphinx: 2.2.1

The text was updated successfully, but these errors were encountered:

mathause · 2019-11-15T11:05:44Z

This happens in xr.combinde_by_coords. Note that the OverflowError is "ignored in: pandas._libs.algos.are_diff'". So xr.combinde_by_coords` can return a wrong dataset (although this does not happen silently):

import xarray as xr
i1 = xr.cftime_range("4500-12-31", periods=1)
i2 = xr.cftime_range("4600-12-31", periods=1)
i3 = xr.cftime_range("5100-12-31", periods=1)

d1 = xr.DataArray([0], dims=("time", ), coords={"time": ("time", i1)}).to_dataset(name="a")
d2 = xr.DataArray([1], dims=("time", ), coords={"time": ("time", i2)}).to_dataset(name="a")
d3 = xr.DataArray([2], dims=("time", ), coords={"time": ("time", i3)}).to_dataset(name="a")

xr.combine_by_coords([d1, d2, d3]).time

returns:

<xarray.DataArray 'time' (time: 2)>
array([cftime.DatetimeGregorian(4500-12-31 00:00:00),
       cftime.DatetimeGregorian(5100-12-31 00:00:00)], dtype=object)
Coordinates:
  * time     (time) object 4500-12-31 00:00:00 5100-12-31 00:00:00

note how d2 is missing.

Within xr.combine_by_coords the error happens here:

xarray/xarray/core/combine.py

Line 98 in 7b4a286

rank = series.rank(method="dense", ascending=ascending)

import pandas as pd

indexes = [i1, i2, i3]

# the code from _infer_concat_order_from_coords
first_items = pd.Index([index.take([0]) for index in indexes])

series = first_items.to_series()
rank = series.rank(method="dense", ascending=ascending)
order = rank.astype(int).values - 1

order
>>> array([0, 1, 1])

This causes the second item to be dropped.

spencerkclark · 2019-11-16T15:39:23Z

Thanks for raising this issue @mathause. In hindsight this does not surprise me. Pandas's strict use of nanosecond-resolution datetimes and timedeltas was part of the motivation for the CFTimeIndex. While convenient, because it allows us to re-use code already written in pandas, holding the result of the difference between two CFTimeIndexes in a TimedeltaIndex clearly prevents us from taking the difference between distant dates.

Perhaps a more robust (yet more complex) solution for #2484 would be to write a version of a TimedeltaIndex that does not internally cast the timedeltas to type np.timedelta64[ns], and rather leaves them as datetime.timedelta objects, which are the actual result of subtracting two sequences of cftime.datetime objects.

Regarding the combine_by_coords issue, though, there might be an easier fix. Is there a reason that first_items is an Index of length-one Indexes? It's not clear to me why that needs to be the case.

xarray/xarray/core/combine.py

Line 91 in 56c16e4

first_items = pd.Index([index.take([0]) for index in indexes])

It appears if we just select the first value of each index (i.e. a cftime.datetime object in this example), e.g.

first_items = pd.Index([index[0] for index in indexes])

pandas's rank method works properly and combine_by_coords produces the correct result:

>>> xr.combine_by_coords([d1, d2, d3]).time
<xarray.DataArray 'time' (time: 3)>
array([cftime.DatetimeGregorian(4500, 12, 31, 0, 0, 0, 0, 4, 365),
       cftime.DatetimeGregorian(4600, 12, 31, 0, 0, 0, 0, 2, 365),
       cftime.DatetimeGregorian(5100, 12, 31, 0, 0, 0, 0, 0, 365)],
      dtype=object)
Coordinates:
  * time     (time) object 4500-12-31 00:00:00 ... 5100-12-31 00:00:00

spencerkclark mentioned this issue Nov 16, 2019

Minor fix to combine_by_coords to allow for the combination of CFTimeIndexes separated by large time intervals #3543

Merged

4 tasks

dcherian closed this as completed in #3543 Dec 7, 2019

spencerkclark mentioned this issue Dec 18, 2019

Add support for CFTimeIndex in get_clean_interp_index #3631

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subtracting CFTimeIndex can cause pd.TimedeltaIndex to overflow #3535

subtracting CFTimeIndex can cause pd.TimedeltaIndex to overflow #3535

mathause commented Nov 14, 2019 •

edited

Loading

INSTALLED VERSIONS

mathause commented Nov 15, 2019 •

edited

Loading

spencerkclark commented Nov 16, 2019

subtracting CFTimeIndex can cause pd.TimedeltaIndex to overflow #3535

subtracting CFTimeIndex can cause pd.TimedeltaIndex to overflow #3535

Comments

mathause commented Nov 14, 2019 • edited Loading

MCVE Code Sample

Expected Output

Problem Description

Output of xr.show_versions()

INSTALLED VERSIONS

mathause commented Nov 15, 2019 • edited Loading

spencerkclark commented Nov 16, 2019

mathause commented Nov 14, 2019 •

edited

Loading

Output of `xr.show_versions()`

mathause commented Nov 15, 2019 •

edited

Loading