Unexpected chunking of 3d DataArray in `polyfit()` #4554

paigem · 2020-10-30T06:07:34Z

What happened:
When running polyfit() on a 3d chunked xarray DataArray, the output is chunked differently than the input array.

What you expected to happen:
I expect the output to have the same chunking as the input.

Minimal Complete Verifiable Example:
(from @rabernat in xgcm/xrft#116)

Example: number of chunks decreases

import dask.array as dsa
import xarray as xr

nz, ny, nx = (10, 20, 30)
data = dsa.ones((nz, ny, nx), chunks=(1, 5, nx))
da = xr.DataArray(data, dims=['z', 'y', 'x'])
da.chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,))

pf = da.polyfit('x', 1)
pf.polyfit_coefficients.chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (20,), (30,))
# chunks on the y dimension have been consolidated!

pv = xr.polyval(da.x, pf.polyfit_coefficients).transpose('z', 'y', 'x')
pv.chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (20,), (30,))
# and this propagates to polyval

# align back against the original data
(da - pv).chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,))
# hides the fact that we have chunk consolidation happening upstream

Example: number of chunks increases

nz, ny, nx = (6, 10, 4)
data = dsa.ones((nz, ny, nx), chunks=(2, 10, 2))
da = xr.DataArray(data, dims=['z', 'y', 'x'])
da.chunks
# -> ((2, 2, 2), (10,), (2, 2))

pf = da.polyfit('y', 1)
pf.polyfit_coefficients.chunks
# -> ((2,), (1, 1, 1, 1, 1, 1), (4,))

pv = xr.polyval(da.y, pf.polyfit_coefficients).transpose('z', 'y', 'x')
pv.chunks
# -> ((1, 1, 1, 1, 1, 1), (10,), (4,))

(da - pv).chunks
# -> ((1, 1, 1, 1, 1, 1), (10,), (2, 2))

(This discussion started in xgcm/xrft#116 with @rabernat and @navidcy.)

Environment:

Running on Pangeo Cloud

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 4.19.112+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.1
pandas: 1.1.3
numpy: 1.19.2
scipy: 1.5.2
netCDF4: 1.5.4
pydap: installed
h5netcdf: 0.8.1
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.2.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.1.7
cfgrib: 0.9.8.4
iris: None
bottleneck: 1.3.2
dask: 2.30.0
distributed: 2.30.0
matplotlib: 3.3.2
cartopy: 0.18.0
seaborn: None
numbagg: None
pint: 0.16.1
setuptools: 49.6.0.post20201009
pip: 20.2.3
conda: None
pytest: 6.1.1
IPython: 7.18.1
sphinx: 3.2.1

The text was updated successfully, but these errors were encountered:

dcherian · 2021-04-19T02:51:22Z

Thanks for opening an issue @paigem . Sorry for the delay.

@aulemahal do you have time to take a look here?

aulemahal · 2021-04-19T15:37:30Z

Took a look and it seems to originate from the stacking part and someting in dask.

In polyfit, we rearrange the DataArrays to 2D arrays, so we can run the least squares with np/dsa.apply_along_axis. But I checked and the chunking problem seems to appear before any call of the sort. MWE:

import xarray as xr
import dask.array as dsa
nz, ny, nx = (10, 20, 30)
data = dsa.ones((nz, ny, nx), chunks=(1, 5, nx))
da = xr.DataArray(data, dims=['z', 'y', 'x'])
da.chunks
# ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,))

stk = da.stack(zy=['z', 'y'])
print(stk.dims, stk.chunks)
# ('x', 'zy') ((30,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20))
# Merged chunks!

And then I went down the rabbit hole (ok it's not that deep) and is all goes down here:

xarray/xarray/core/variable.py

Line 1507 in e0358e5

new_data = reordered.data.reshape(new_shape)

In Variable._stack_one the stacking is performed and Variable.data.reshape is called. Dask itself is rechunking the output, merging the chunks. There is a merge_chunks kwarg for reshape, but I think it has a bug:

# Let's stack as xarray does: x, z, y -> x, zy
data_t = data.transpose(2, 0, 1)  # Dask array with shape (30, 10, 20), the same as `reordered` in `Variable._stack_once`.

new_data = data_t.reshape((30, -1), merge_chunks=True)  # True is the default, this is the same call as in xarray
new_data.chunks
# ((30,), (20, 20, 20, 20, 20, 20, 20, 20, 20, 20))

new_data = data_t.reshape((30, -1), merge_chunks=False)
new_data.shape  # I'm printing shape because chunks is too large, but see the bug:
# (30, 6000)  # instead of (30, 200)!!!

# Doesn't happen when we do not transpose. So let's reshape data as z, y, x -> zy, x
new_data = data.reshape((-1, 30), merge_chunks=True) 
new_data.chunks
# ((5, 5, 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5, 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5), (30,))
# Chunks were not merged? But this is the output expected by paigem.

new_data = data.reshape((-1, 30), merge_chunks=False)
new_data.chunks
# ((5, 5, 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5, 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5), (30,))
# That's what I expected with merge_chunks=False.

For polyfit itself, the apply_along_axis call could be changed to a apply_ufunc with vectorize=True, I think this would avoid the problem and behave the same on the user's side. Would need some refactoring.

rabernat · 2021-04-19T15:44:07Z

we rearrange the DataArrays to 2D arrays

FWIW, this is the exact same thing we do in xhistorgram in order to apply histogram over a specific group of axes:

https://github.com/xgcm/xhistogram/blob/2681aee6fe04e7656c458f32277f87e76653b6e8/xhistogram/core.py#L238-L254

We noticed a similar problem with Dask's reshape implementation, raised here: dask/dask#5544

phofl · 2024-08-14T13:55:58Z

The reshaping will now keep chunk sizes consistent to avoid this explosion with the next dask release

navidcy mentioned this issue Oct 30, 2020

[WIP] Make sure xarray.polyfit does not rechunk datarray xgcm/xrft#117

Closed

dcherian added the topic-dask label Apr 19, 2021

jbusecke mentioned this issue Jul 22, 2021

Polyfit performance on large datasets - Suboptimal dask task graph #5629

Closed

slevang mentioned this issue Nov 3, 2021

Reimplement .polyfit() with apply_ufunc #5933

Closed

6 tasks

dcherian closed this as completed Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected chunking of 3d DataArray in `polyfit()` #4554

Unexpected chunking of 3d DataArray in `polyfit()` #4554

paigem commented Oct 30, 2020

INSTALLED VERSIONS

dcherian commented Apr 19, 2021

aulemahal commented Apr 19, 2021

rabernat commented Apr 19, 2021

phofl commented Aug 14, 2024

Unexpected chunking of 3d DataArray in polyfit() #4554

Unexpected chunking of 3d DataArray in polyfit() #4554

Comments

paigem commented Oct 30, 2020

INSTALLED VERSIONS

dcherian commented Apr 19, 2021

aulemahal commented Apr 19, 2021

rabernat commented Apr 19, 2021

phofl commented Aug 14, 2024

Unexpected chunking of 3d DataArray in `polyfit()` #4554

Unexpected chunking of 3d DataArray in `polyfit()` #4554