Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected chunking of 3d DataArray in polyfit() #4554

Closed
paigem opened this issue Oct 30, 2020 · 4 comments
Closed

Unexpected chunking of 3d DataArray in polyfit() #4554

paigem opened this issue Oct 30, 2020 · 4 comments

Comments

@paigem
Copy link
Contributor

paigem commented Oct 30, 2020

What happened:
When running polyfit() on a 3d chunked xarray DataArray, the output is chunked differently than the input array.

What you expected to happen:
I expect the output to have the same chunking as the input.

Minimal Complete Verifiable Example:
(from @rabernat in xgcm/xrft#116)

Example: number of chunks decreases

import dask.array as dsa
import xarray as xr

nz, ny, nx = (10, 20, 30)
data = dsa.ones((nz, ny, nx), chunks=(1, 5, nx))
da = xr.DataArray(data, dims=['z', 'y', 'x'])
da.chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,))

pf = da.polyfit('x', 1)
pf.polyfit_coefficients.chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (20,), (30,))
# chunks on the y dimension have been consolidated!

pv = xr.polyval(da.x, pf.polyfit_coefficients).transpose('z', 'y', 'x')
pv.chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (20,), (30,))
# and this propagates to polyval

# align back against the original data
(da - pv).chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,))
# hides the fact that we have chunk consolidation happening upstream

Example: number of chunks increases

nz, ny, nx = (6, 10, 4)
data = dsa.ones((nz, ny, nx), chunks=(2, 10, 2))
da = xr.DataArray(data, dims=['z', 'y', 'x'])
da.chunks
# -> ((2, 2, 2), (10,), (2, 2))

pf = da.polyfit('y', 1)
pf.polyfit_coefficients.chunks
# -> ((2,), (1, 1, 1, 1, 1, 1), (4,))

pv = xr.polyval(da.y, pf.polyfit_coefficients).transpose('z', 'y', 'x')
pv.chunks
# -> ((1, 1, 1, 1, 1, 1), (10,), (4,))

(da - pv).chunks
# -> ((1, 1, 1, 1, 1, 1), (10,), (2, 2))

(This discussion started in xgcm/xrft#116 with @rabernat and @navidcy.)

Environment:

Running on Pangeo Cloud

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 4.19.112+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.1
pandas: 1.1.3
numpy: 1.19.2
scipy: 1.5.2
netCDF4: 1.5.4
pydap: installed
h5netcdf: 0.8.1
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.2.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.1.7
cfgrib: 0.9.8.4
iris: None
bottleneck: 1.3.2
dask: 2.30.0
distributed: 2.30.0
matplotlib: 3.3.2
cartopy: 0.18.0
seaborn: None
numbagg: None
pint: 0.16.1
setuptools: 49.6.0.post20201009
pip: 20.2.3
conda: None
pytest: 6.1.1
IPython: 7.18.1
sphinx: 3.2.1

@dcherian
Copy link
Contributor

Thanks for opening an issue @paigem . Sorry for the delay.

@aulemahal do you have time to take a look here?

@aulemahal
Copy link
Contributor

Took a look and it seems to originate from the stacking part and someting in dask.

In polyfit, we rearrange the DataArrays to 2D arrays, so we can run the least squares with np/dsa.apply_along_axis. But I checked and the chunking problem seems to appear before any call of the sort. MWE:

import xarray as xr
import dask.array as dsa
nz, ny, nx = (10, 20, 30)
data = dsa.ones((nz, ny, nx), chunks=(1, 5, nx))
da = xr.DataArray(data, dims=['z', 'y', 'x'])
da.chunks
# ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,))

stk = da.stack(zy=['z', 'y'])
print(stk.dims, stk.chunks)
# ('x', 'zy') ((30,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20))
# Merged chunks!

And then I went down the rabbit hole (ok it's not that deep) and is all goes down here:

new_data = reordered.data.reshape(new_shape)

In Variable._stack_one the stacking is performed and Variable.data.reshape is called. Dask itself is rechunking the output, merging the chunks. There is a merge_chunks kwarg for reshape, but I think it has a bug:

# Let's stack as xarray does: x, z, y -> x, zy
data_t = data.transpose(2, 0, 1)  # Dask array with shape (30, 10, 20), the same as `reordered` in `Variable._stack_once`.

new_data = data_t.reshape((30, -1), merge_chunks=True)  # True is the default, this is the same call as in xarray
new_data.chunks
# ((30,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20))

new_data = data_t.reshape((30, -1), merge_chunks=False)
new_data.shape  # I'm printing shape because chunks is too large, but see the bug:
# (30, 6000)  # instead of (30, 200)!!!

# Doesn't happen when we do not transpose. So let's reshape data as z, y, x -> zy, x
new_data = data.reshape((-1, 30), merge_chunks=True) 
new_data.chunks
# ((5, 5, 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5, 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5), (30,))
# Chunks were not merged? But this is the output expected by paigem.

new_data = data.reshape((-1, 30), merge_chunks=False)
new_data.chunks
# ((5, 5, 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5, 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5), (30,))
# That's what I expected with merge_chunks=False.

For polyfit itself, the apply_along_axis call could be changed to a apply_ufunc with vectorize=True, I think this would avoid the problem and behave the same on the user's side. Would need some refactoring.

@rabernat
Copy link
Contributor

we rearrange the DataArrays to 2D arrays

FWIW, this is the exact same thing we do in xhistorgram in order to apply histogram over a specific group of axes:

https://github.com/xgcm/xhistogram/blob/2681aee6fe04e7656c458f32277f87e76653b6e8/xhistogram/core.py#L238-L254

We noticed a similar problem with Dask's reshape implementation, raised here: dask/dask#5544

@phofl
Copy link
Contributor

phofl commented Aug 14, 2024

The reshaping will now keep chunk sizes consistent to avoid this explosion with the next dask release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants