Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with MultiIndexes #743

Closed
DamienIrving opened this issue Jun 11, 2021 · 6 comments · Fixed by #758
Closed

Problems with MultiIndexes #743

DamienIrving opened this issue Jun 11, 2021 · 6 comments · Fixed by #758

Comments

@DamienIrving
Copy link

  • xclim version: 0.27
  • Python version: 3.9.4
  • Operating System: Linux

Description

I've been using xclim.analog.spatial_analogs to perform Kolmogorov Smirnov tests. I recently discovered I get a ValueError: Names should be list-like for a MultiIndex error if the dist_dim is a MultiIndex. I've created a toy example below to demonstrate the problem.

What I Did

import xarray as xr
import xclim.analog
from xclim.testing import open_dataset

ds = xr.tutorial.open_dataset('air_temperature').resample(time='D').mean(keep_attrs=True)
ds = ds.mean(['lon'], keep_attrs=True)

ds_stacked = ds.stack(sample=['time'])

ks_stacked = xclim.analog.spatial_analogs(target=ds_stacked,
                                          candidates=ds_stacked,
                                          dist_dim='sample',
                                          method='kolmogorov_smirnov')

What I Received

ValueError                                Traceback (most recent call last)
<ipython-input-21-0fe1ec90dec4> in <module>
----> 1 ks_stacked = xclim.analog.spatial_analogs(target=ds_stacked,
      2                                           candidates=ds_stacked,
      3                                           dist_dim='sample',
      4                                           method='kolmogorov_smirnov')

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/xclim/analog.py in spatial_analogs(target, candidates, dist_dim, method, **kwargs)
    124         xr.DataArray(list(target.data_vars.keys()), dims=("indices",), name="indices"),
    125     )
--> 126     target = target.stack(dist=dist_dim)
    127 
    128     # Create the target DataArray:

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/xarray/core/dataarray.py in stack(self, dimensions, **dimensions_kwargs)
   2070         DataArray.unstack
   2071         """
-> 2072         ds = self._to_temp_dataset().stack(dimensions, **dimensions_kwargs)
   2073         return self._from_temp_dataset(ds)
   2074 

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/xarray/core/dataset.py in stack(self, dimensions, **dimensions_kwargs)
   3774         result = self
   3775         for new_dim, dims in dimensions.items():
-> 3776             result = result._stack_once(dims, new_dim)
   3777         return result
   3778 

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/xarray/core/dataset.py in _stack_once(self, dims, new_dim)
   3727         # consider dropping levels that are unused?
   3728         levels = [self.get_index(dim) for dim in dims]
-> 3729         idx = utils.multiindex_from_product_levels(levels, names=dims)
   3730         variables[new_dim] = IndexVariable(new_dim, idx)
   3731 

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/xarray/core/utils.py in multiindex_from_product_levels(levels, names)
    140     labels_mesh = np.meshgrid(*split_labels, indexing="ij")
    141     labels = [x.ravel() for x in labels_mesh]
--> 142     return pd.MultiIndex(levels, labels, sortorder=0, names=names)
    143 
    144 

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity)
    314 
    315         if verify_integrity:
--> 316             new_codes = result._verify_integrity()
    317             result._codes = new_codes
    318 

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
    367         # nlevels matches nor that sortorder matches actually sortorder.
    368         codes = codes or self.codes
--> 369         levels = levels or self.levels
    370 
    371         if len(levels) != len(codes):

pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in levels(self)
    722         # create new IndexEngine
    723         # https://github.com/pandas-dev/pandas/issues/31648
--> 724         result = [
    725             x._shallow_copy(name=name) for x, name in zip(self._levels, self._names)
    726         ]

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in <listcomp>(.0)
    723         # https://github.com/pandas-dev/pandas/issues/31648
    724         result = [
--> 725             x._shallow_copy(name=name) for x, name in zip(self._levels, self._names)
    726         ]
    727         for level in result:

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in _shallow_copy(self, values, name)
   1093             return type(self).from_tuples(values, sortorder=None, names=names)
   1094 
-> 1095         result = type(self)(
   1096             levels=self.levels,
   1097             codes=self.codes,

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity)
    306         if names is not None:
    307             # handles name validation
--> 308             result._set_names(names)
    309 
    310         if sortorder is not None:

/g/data/e14/dbi599/miniconda3/envs/test/lib/python3.9/site-packages/pandas/core/indexes/multi.py in _set_names(self, names, level, validate)
   1424         # Don't allow a single string for names in a MultiIndex
   1425         if names is not None and not is_list_like(names):
-> 1426             raise ValueError("Names should be list-like for a MultiIndex")
   1427         names = list(names)
   1428 

ValueError: Names should be list-like for a MultiIndex
@aulemahal
Copy link
Collaborator

aulemahal commented Jun 11, 2021

Hi! So the issue arise because we are trying to stack an already stacked array, we didn't think of this use case! I know this is just a MWE, but was your goal to stack multiple dims into what the spatial analogs algo uses for the different distributions / samples?
In that case, this is already what the function does, so you could pass a list of dims directly, for example:

ds = xr.tutorial.open_dataset('air_temperature').resample(time='D').mean(keep_attrs=True)
ds = ds.mean(['lon'], keep_attrs=True)

ks = xclim.analog.spatial_analogs(
    target=ds,
    candidates=ds,
    dist_dim=['time'],
    method='kolmogorov_smirnov'
)

However, if the problem is that your data, for a reason or another, arrives with a MultiIndex, then we'll have to find a workaround!

@DamienIrving
Copy link
Author

Ah, that makes sense!

Our real application (rather than this MWE) is a comparison between observational data and forecast data.

The observational data has a time axis (e.g. time, lat, lon), while temporal information in the forecast data is represented by initial_date and lead_time axes (e.g. initial_date, lead_time, lat, lon, ensemble_member). For the KS test we stack initial_date, lead_time and ensemble into a sample dimension, e.g.

forecast.stack({'sample': ['ensemble_member', 'init_date', 'lead_time']})

and then create a similar sample dimension for the observational data:

obs.stack({'sample': ['time']})

We could massage the obs into a initial_date / lead_time / ensemble_member format, but those axes won't be the same length as for the forecast data (which means xclim.analog.spatial_analogs will fail)...

@aulemahal
Copy link
Collaborator

aulemahal commented Jun 17, 2021

I see!

Then I think we could fix this in xclim:

if isinstance(dist_dim, str):
    # rename to "dist" instead of stacking
else:
   # Current behaviour : stack to "dist".

(
In the meantime, a workaround would be to remove the "coords" data for sample:

fc = forecast.stack({'sample': ['ensemble_member', 'init_date', 'lead_time']})
fc['sample'] = np.arange(fc_crd.size)

out = spatial_analogs(obs.rename(time='sample'), fc, dist_dim='sample')

)

@DamienIrving
Copy link
Author

That workaround would work...

fc = forecast.stack({'sample': ['ensemble_member', 'init_date', 'lead_time']})
fc['sample'] = np.arange(fc['sample'].size)

ob = obs.rename(time='sample')
ob['sample'] = np.arange(ob['sample'].size)

... but only if the length of the time axis in the observations is exactly the same length as the stacked ensemble_member/init_date/lead_time axis in the forecast data (otherwise you get a ValueError: indexes along dimension 'dist' are not equal). Is the solution to pad the observation data with nans? (Somehow using the ensembles module?)

@aulemahal
Copy link
Collaborator

aulemahal commented Jun 24, 2021

Oh that's a real problem, intrinsic to how xclim uses xr.apply_ufunc for the dissimilarity function! Padding with NaNs will not work, currently the method is non-nan aware, any nan will make it return nan. (Which is another thing to change I guess).

So this issue shows that we need to separate the dimensions management of target and candidates.

The way I see it now, we should remove the stacking part in spatial_analogs. When needed (as in your case for fc), this treatment should be applied by the user. Our current way is assuming too much similarity between the two inputs and I don't think there's a case to try and cover all the cases, so let's just leave that as a job for the user.

Secondly, we should internally rename the dist dimension of one of target or candidates, that will solve the non-alignable problem.

Including today, I have 4 (busy) days left before my vacations (which end on july 26th), and before the next planned xclim release (0.28). Xclim 0.29 is planned for the end of august. (In august, I will begin a project involving spatial analogs, which means a high probability of improvements to the analog module.)

So sorry for the possible delay, but a colleague might have time, I'll ping them!

@DamienIrving
Copy link
Author

No problem at all about the delay - thanks for considering a reconfiguration of spatial_analogs to address our use case 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants