Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.resample() adds time dimension to independent variables #2145

Open
malmans2 opened this issue May 17, 2018 · 5 comments
Open

Dataset.resample() adds time dimension to independent variables #2145

malmans2 opened this issue May 17, 2018 · 5 comments

Comments

@malmans2
Copy link
Contributor

malmans2 commented May 17, 2018

Code Sample, a copy-pastable example if possible

ds = ds.resample(time='1D',keep_attrs=True).mean()

Problem description

I'm downsampling in time a dataset which also contains timeless variables.
I've noticed that resample adds the time dimension to the timeless variables.
One workaround is:

  1. Split the dataset in a timeless and a time-dependent dataset
  2. Resample the time-dependent dataset
  3. Merge the two datasets

This is not a big deal, but I was wondering if I'm missing some flag that avoids this behavior.
If not, is it something that can be easily implemented in resample?
It would be very useful for datasets with variables on staggered grids.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-693.17.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: None.None

xarray: 0.10.3
pandas: 0.20.2
numpy: 1.12.1
scipy: 0.19.1
netCDF4: 1.2.4
h5netcdf: 0.5.1
h5py: 2.7.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.4
distributed: 1.21.8
matplotlib: 2.0.2
cartopy: 0.16.0
seaborn: 0.7.1
setuptools: 39.1.0
pip: 9.0.1
conda: 4.5.3
pytest: 3.1.2
IPython: 6.1.0
sphinx: 1.6.2

@fmaussion
Copy link
Member

Thanks for the report! Do you think you can craft a minimal working example ?

@malmans2
Copy link
Contributor Author

malmans2 commented May 18, 2018

In my previous comment I said that this would be useful for staggered grids, but then I realized that resample only operates on the time dimension. Anyway, here is my example:

import xarray as xr
import pandas as pd
import numpy as np

# Create coordinates
time  = pd.date_range('1/1/2018', periods=365, freq='D')
space = pd.np.arange(10)

# Create random variables
var_withtime1 = np.random.randn(len(time), len(space))
var_withtime2 = np.random.randn(len(time), len(space))
var_timeless1 = np.random.randn(len(space))
var_timeless2 = np.random.randn(len(space))

# Create dataset
ds = xr.Dataset({'var_withtime1': (['time', 'space'], var_withtime1),
                 'var_withtime2': (['time', 'space'], var_withtime2),
                 'var_timeless1': (['space'], var_timeless1),
                 'var_timeless2': (['space'], var_timeless2)},
                coords={'time': (['time',], time),
                        'space': (['space',], space)})

# Standard resample: this add the time dimension to the timeless variables
ds_resampled = ds.resample(time='1M').mean()

# My workaround: this does not add the time dimension to the timeless variables
ds_withtime = ds.drop([ var for var in ds.variables if not 'time' in ds[var].dims ])
ds_timeless = ds.drop([ var for var in ds.variables if     'time' in ds[var].dims ])
ds_workaround = xr.merge([ds_timeless, ds_withtime.resample(time='1M').mean()])

Datasets:

>>> ds
<xarray.Dataset>
Dimensions:        (space: 10, time: 365)
Coordinates:
  * time           (time) datetime64[ns] 2018-01-01 2018-01-02 2018-01-03 ...
  * space          (space) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    var_withtime1  (time, space) float64 -1.137 -0.5727 -1.287 0.8102 ...
    var_withtime2  (time, space) float64 1.406 0.8448 1.276 0.02579 0.5684 ...
    var_timeless1  (space) float64 0.02073 -2.117 -0.2891 1.735 -1.535 0.209 ...
    var_timeless2  (space) float64 0.4357 -0.3257 -0.8321 0.8409 0.1454 ...

>> ds_resampled
<xarray.Dataset>
Dimensions:        (space: 10, time: 12)
Coordinates:
  * time           (time) datetime64[ns] 2018-01-31 2018-02-28 2018-03-31 ...
  * space          (space) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    var_withtime1  (time, space) float64 0.08149 0.02121 -0.05635 0.1788 ...
    var_withtime2  (time, space) float64 0.08991 0.5728 0.05394 0.214 0.3523 ...
    var_timeless1  (time, space) float64 0.02073 -2.117 -0.2891 1.735 -1.535 ...
    var_timeless2  (time, space) float64 0.4357 -0.3257 -0.8321 0.8409 ...

>>> ds_workaround
<xarray.Dataset>
Dimensions:        (space: 10, time: 12)
Coordinates:
  * space          (space) int64 0 1 2 3 4 5 6 7 8 9
  * time           (time) datetime64[ns] 2018-01-31 2018-02-28 2018-03-31 ...
Data variables:
    var_timeless1  (space) float64 0.4582 -0.6946 -0.3451 1.183 -1.14 0.1849 ...
    var_timeless2  (space) float64 1.658 -0.1719 -0.2202 -0.1789 -1.247 ...
    var_withtime1  (time, space) float64 -0.3901 0.3725 0.02935 -0.1315 ...
    var_withtime2  (time, space) float64 0.07145 -0.08536 0.07049 0.1025 ...

@fmaussion fmaussion changed the title Resample add usless dimensions Dataset.resample() adds time dimension to independant variables May 18, 2018
@fmaussion
Copy link
Member

I see. Note that groupby does the same. I don't know what the rationale is behind that decision, but there might be a reason...

@shoyer
Copy link
Member

shoyer commented May 22, 2018

This is not really desirable behavior, but it's an implication of how xarray implements ds.resample(time='1M').mean():

  • Resample is converted into a groupby call, e.g., ds.groupby(time_starts).mean('time')
  • .mean('time') for each grouped dataset averages over the 'time' dimension, resulting in a dataset with only a 'space' dimension, e.g.,
>>> list(ds.resample(time='1M'))[0][1].mean('time')
<xarray.Dataset>
Dimensions:        (space: 10)
Coordinates:
  * space          (space) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    var_withtime1  (space) float64 0.008982 -0.09879 0.1361 -0.2485 -0.023 ...
    var_withtime2  (space) float64 0.2621 0.06009 -0.1686 0.07397 0.1095 ...
    var_timeless1  (space) float64 0.8519 -0.4253 -0.8581 0.9085 -0.4797 ...
    var_timeless2  (space) float64 0.8006 1.954 -0.5349 0.3317 1.778 -0.7954 ...
  • concat() is used to combine grouped datasets into the final result, but it doesn't know anything about which variables were aggregated, so every data variable gets the "time" dimension added.

To fix this I would suggest three steps:

  1. Add a keep_dims argument to xarray reductions like mean(), indicating that a dimension should be preserved with length 1, like keep_dims=True for numpy reductions (keepdims=True for xarray reductions #2170).
  2. Fix concat to only concatenate variables that already have the concatenated dimension, as discussed in concat_dim getting added to *all* variables of multifile datasets #2064
  3. Use keep_dims=True in groupby reductions. Then the result should automatically only include aggregated dimensions. This would convenient allow us to remove existing logic in groupby() for restoring the original order of aggregated dimensions (see _restore_dim_order()).

@dcherian
Copy link
Contributor

There is compatibility code in GroupBy._binary_op that could be removed when this is fixed. (See #6160)

@max-sixty max-sixty changed the title Dataset.resample() adds time dimension to independant variables Dataset.resample() adds time dimension to independent variables Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants