-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected chunking of 3d DataArray in polyfit()
#4554
Comments
Thanks for opening an issue @paigem . Sorry for the delay. @aulemahal do you have time to take a look here? |
Took a look and it seems to originate from the stacking part and someting in In import xarray as xr
import dask.array as dsa
nz, ny, nx = (10, 20, 30)
data = dsa.ones((nz, ny, nx), chunks=(1, 5, nx))
da = xr.DataArray(data, dims=['z', 'y', 'x'])
da.chunks
# ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,))
stk = da.stack(zy=['z', 'y'])
print(stk.dims, stk.chunks)
# ('x', 'zy') ((30,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20))
# Merged chunks! And then I went down the rabbit hole (ok it's not that deep) and is all goes down here: xarray/xarray/core/variable.py Line 1507 in e0358e5
In # Let's stack as xarray does: x, z, y -> x, zy
data_t = data.transpose(2, 0, 1) # Dask array with shape (30, 10, 20), the same as `reordered` in `Variable._stack_once`.
new_data = data_t.reshape((30, -1), merge_chunks=True) # True is the default, this is the same call as in xarray
new_data.chunks
# ((30,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20))
new_data = data_t.reshape((30, -1), merge_chunks=False)
new_data.shape # I'm printing shape because chunks is too large, but see the bug:
# (30, 6000) # instead of (30, 200)!!!
# Doesn't happen when we do not transpose. So let's reshape data as z, y, x -> zy, x
new_data = data.reshape((-1, 30), merge_chunks=True)
new_data.chunks
# ((5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), (30,))
# Chunks were not merged? But this is the output expected by paigem.
new_data = data.reshape((-1, 30), merge_chunks=False)
new_data.chunks
# ((5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), (30,))
# That's what I expected with merge_chunks=False. For |
FWIW, this is the exact same thing we do in xhistorgram in order to apply histogram over a specific group of axes: We noticed a similar problem with Dask's reshape implementation, raised here: dask/dask#5544 |
The reshaping will now keep chunk sizes consistent to avoid this explosion with the next dask release |
What happened:
When running
polyfit()
on a 3d chunked xarray DataArray, the output is chunked differently than the input array.What you expected to happen:
I expect the output to have the same chunking as the input.
Minimal Complete Verifiable Example:
(from @rabernat in xgcm/xrft#116)
Example: number of chunks decreases
Example: number of chunks increases
(This discussion started in xgcm/xrft#116 with @rabernat and @navidcy.)
Environment:
Running on Pangeo Cloud
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 4.19.112+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4
xarray: 0.16.1
pandas: 1.1.3
numpy: 1.19.2
scipy: 1.5.2
netCDF4: 1.5.4
pydap: installed
h5netcdf: 0.8.1
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.2.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.1.7
cfgrib: 0.9.8.4
iris: None
bottleneck: 1.3.2
dask: 2.30.0
distributed: 2.30.0
matplotlib: 3.3.2
cartopy: 0.18.0
seaborn: None
numbagg: None
pint: 0.16.1
setuptools: 49.6.0.post20201009
pip: 20.2.3
conda: None
pytest: 6.1.1
IPython: 7.18.1
sphinx: 3.2.1
The text was updated successfully, but these errors were encountered: