Problems with MultiIndexes #743

DamienIrving · 2021-06-11T05:25:11Z

xclim version: 0.27
Python version: 3.9.4
Operating System: Linux

Description

I've been using xclim.analog.spatial_analogs to perform Kolmogorov Smirnov tests. I recently discovered I get a ValueError: Names should be list-like for a MultiIndex error if the dist_dim is a MultiIndex. I've created a toy example below to demonstrate the problem.

What I Did

import xarray as xr
import xclim.analog
from xclim.testing import open_dataset

ds = xr.tutorial.open_dataset('air_temperature').resample(time='D').mean(keep_attrs=True)
ds = ds.mean(['lon'], keep_attrs=True)

ds_stacked = ds.stack(sample=['time'])

ks_stacked = xclim.analog.spatial_analogs(target=ds_stacked,
                                          candidates=ds_stacked,
                                          dist_dim='sample',
                                          method='kolmogorov_smirnov')

What I Received

ValueError                                Traceback (most recent call last)
<ipython-input-21-0fe1ec90dec4> in <module>
----> 1 ks_stacked = xclim.analog.spatial_analogs(target=ds_stacked,
      2                                           candidates=ds_stacked,
      3                                           dist_dim='sample',
      4                                           method='kolmogorov_smirnov')

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/xclim/analog.py in spatial_analogs(target, candidates, dist_dim, method, **kwargs)
    124         xr.DataArray(list(target.data_vars.keys()), dims=("indices",), name="indices"),
    125     )
--> 126     target = target.stack(dist=dist_dim)
    127 
    128     # Create the target DataArray:

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/xarray/core/dataarray.py in stack(self, dimensions, **dimensions_kwargs)
   2070         DataArray.unstack
   2071         """
-> 2072         ds = self._to_temp_dataset().stack(dimensions, **dimensions_kwargs)
   2073         return self._from_temp_dataset(ds)
   2074 

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/xarray/core/dataset.py in stack(self, dimensions, **dimensions_kwargs)
   3774         result = self
   3775         for new_dim, dims in dimensions.items():
-> 3776             result = result._stack_once(dims, new_dim)
   3777         return result
   3778 

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/xarray/core/dataset.py in _stack_once(self, dims, new_dim)
   3727         # consider dropping levels that are unused?
   3728         levels = [self.get_index(dim) for dim in dims]
-> 3729         idx = utils.multiindex_from_product_levels(levels, names=dims)
   3730         variables[new_dim] = IndexVariable(new_dim, idx)
   3731 

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/xarray/core/utils.py in multiindex_from_product_levels(levels, names)
    140     labels_mesh = np.meshgrid(*split_labels, indexing="ij")
    141     labels = [x.ravel() for x in labels_mesh]
--> 142     return pd.MultiIndex(levels, labels, sortorder=0, names=names)
    143 
    144 

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity)
    314 
    315         if verify_integrity:
--> 316             new_codes = result._verify_integrity()
    317             result._codes = new_codes
    318 

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
    367         # nlevels matches nor that sortorder matches actually sortorder.
    368         codes = codes or self.codes
--> 369         levels = levels or self.levels
    370 
    371         if len(levels) != len(codes):

pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in levels(self)
    722         # create new IndexEngine
    723         # https://github.com/pandas-dev/pandas/issues/31648
--> 724         result = [
    725             x._shallow_copy(name=name) for x, name in zip(self._levels, self._names)
    726         ]

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in <listcomp>(.0)
    723         # https://github.com/pandas-dev/pandas/issues/31648
    724         result = [
--> 725             x._shallow_copy(name=name) for x, name in zip(self._levels, self._names)
    726         ]
    727         for level in result:

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in _shallow_copy(self, values, name)
   1093             return type(self).from_tuples(values, sortorder=None, names=names)
   1094 
-> 1095         result = type(self)(
   1096             levels=self.levels,
   1097             codes=self.codes,

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity)
    306         if names is not None:
    307             # handles name validation
--> 308             result._set_names(names)
    309 
    310         if sortorder is not None:

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in _set_names(self, names, level, validate)
   1424         # Don't allow a single string for names in a MultiIndex
   1425         if names is not None and not is_list_like(names):
-> 1426             raise ValueError("Names should be list-like for a MultiIndex")
   1427         names = list(names)
   1428 

ValueError: Names should be list-like for a MultiIndex

The text was updated successfully, but these errors were encountered:

aulemahal · 2021-06-11T14:04:40Z

Hi! So the issue arise because we are trying to stack an already stacked array, we didn't think of this use case! I know this is just a MWE, but was your goal to stack multiple dims into what the spatial analogs algo uses for the different distributions / samples?
In that case, this is already what the function does, so you could pass a list of dims directly, for example:

ds = xr.tutorial.open_dataset('air_temperature').resample(time='D').mean(keep_attrs=True)
ds = ds.mean(['lon'], keep_attrs=True)

ks = xclim.analog.spatial_analogs(
    target=ds,
    candidates=ds,
    dist_dim=['time'],
    method='kolmogorov_smirnov'
)

However, if the problem is that your data, for a reason or another, arrives with a MultiIndex, then we'll have to find a workaround!

DamienIrving · 2021-06-15T01:10:47Z

Ah, that makes sense!

Our real application (rather than this MWE) is a comparison between observational data and forecast data.

The observational data has a time axis (e.g. time, lat, lon), while temporal information in the forecast data is represented by initial_date and lead_time axes (e.g. initial_date, lead_time, lat, lon, ensemble_member). For the KS test we stack initial_date, lead_time and ensemble into a sample dimension, e.g.

forecast.stack({'sample': ['ensemble_member', 'init_date', 'lead_time']})

and then create a similar sample dimension for the observational data:

obs.stack({'sample': ['time']})

We could massage the obs into a initial_date / lead_time / ensemble_member format, but those axes won't be the same length as for the forecast data (which means xclim.analog.spatial_analogs will fail)...

aulemahal · 2021-06-17T20:57:04Z

I see!

Then I think we could fix this in xclim:

if isinstance(dist_dim, str):
    # rename to "dist" instead of stacking
else:
   # Current behaviour : stack to "dist".

(
In the meantime, a workaround would be to remove the "coords" data for sample:

fc = forecast.stack({'sample': ['ensemble_member', 'init_date', 'lead_time']})
fc['sample'] = np.arange(fc_crd.size)

out = spatial_analogs(obs.rename(time='sample'), fc, dist_dim='sample')

)

DamienIrving · 2021-06-24T00:52:28Z

That workaround would work...

fc = forecast.stack({'sample': ['ensemble_member', 'init_date', 'lead_time']})
fc['sample'] = np.arange(fc['sample'].size)

ob = obs.rename(time='sample')
ob['sample'] = np.arange(ob['sample'].size)

... but only if the length of the time axis in the observations is exactly the same length as the stacked ensemble_member/init_date/lead_time axis in the forecast data (otherwise you get a ValueError: indexes along dimension 'dist' are not equal). Is the solution to pad the observation data with nans? (Somehow using the ensembles module?)

aulemahal · 2021-06-24T15:36:31Z

Oh that's a real problem, intrinsic to how xclim uses xr.apply_ufunc for the dissimilarity function! Padding with NaNs will not work, currently the method is non-nan aware, any nan will make it return nan. (Which is another thing to change I guess).

So this issue shows that we need to separate the dimensions management of target and candidates.

The way I see it now, we should remove the stacking part in spatial_analogs. When needed (as in your case for fc), this treatment should be applied by the user. Our current way is assuming too much similarity between the two inputs and I don't think there's a case to try and cover all the cases, so let's just leave that as a job for the user.

Secondly, we should internally rename the dist dimension of one of target or candidates, that will solve the non-alignable problem.

Including today, I have 4 (busy) days left before my vacations (which end on july 26th), and before the next planned xclim release (0.28). Xclim 0.29 is planned for the end of august. (In august, I will begin a project involving spatial analogs, which means a high probability of improvements to the analog module.)

So sorry for the possible delay, but a colleague might have time, I'll ping them!

DamienIrving · 2021-06-25T00:31:12Z

No problem at all about the delay - thanks for considering a reconfiguration of spatial_analogs to address our use case 😄

huard mentioned this issue Jun 28, 2021

Allow passing multi-index dimension to spatial_analogs #758

Merged

6 tasks

huard closed this as completed in #758 Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with MultiIndexes #743

Problems with MultiIndexes #743

DamienIrving commented Jun 11, 2021

aulemahal commented Jun 11, 2021 •

edited

Loading

DamienIrving commented Jun 15, 2021

aulemahal commented Jun 17, 2021 •

edited

Loading

DamienIrving commented Jun 24, 2021

aulemahal commented Jun 24, 2021 •

edited

Loading

DamienIrving commented Jun 25, 2021

Problems with MultiIndexes #743

Problems with MultiIndexes #743

Comments

DamienIrving commented Jun 11, 2021

Description

What I Did

What I Received

aulemahal commented Jun 11, 2021 • edited Loading

DamienIrving commented Jun 15, 2021

aulemahal commented Jun 17, 2021 • edited Loading

DamienIrving commented Jun 24, 2021

aulemahal commented Jun 24, 2021 • edited Loading

DamienIrving commented Jun 25, 2021

aulemahal commented Jun 11, 2021 •

edited

Loading

aulemahal commented Jun 17, 2021 •

edited

Loading

aulemahal commented Jun 24, 2021 •

edited

Loading