Undesired chunking of 3d input when using linear detrend with `poyfit()` #116

paigem · 2020-10-30T00:56:47Z

I am running the most up-to-date version of xrft, that uses xarray's polyfit() for linear detrending.

When I run xrft.dft() on one dimension with linear detrending on a 3d input, I am getting an output that has different chunking than the original input. Specifically, the first axis that is not the fft dimension appears to be output with smaller chunks.

For example, if I construct a simple 3d input:

Then the output is as follows. Note the different chunking along the 'lon' axis.

In the above image, I have printed out the original input array chunks, the chunks that result from applying polyfit(), and the resulting chunks after all detrending. It thus appears that polyfit() is rechunking the 'lon' dimension from chunks of length 2 to chunks of length 1, and that this persists after the detrending function.

Any idea if this is normal behavior? I wonder if it might be a bug in xarray's polyfit() implementation, but perhaps I have misinterpreted how it is supposed to work. I'm currently testing whether rechunking after detrending to the original chunk sizes would be a workaround for xrft's polyfit() implementation.

This rechunking doesn't cause problems if the data is not very large, but when I try to run this on, e.g., the global CESM ocean data, dask can't handle it (presumably due to very small chunk sizes/large number of tasks).

Discussed this issue with @navidcy

The text was updated successfully, but these errors were encountered:

rabernat · 2020-10-30T01:49:50Z

I wonder if it might be a bug in xarray's polyfit() implementation, but perhaps I have misinterpreted how it is supposed to work.

This is a good hunch to pursue. My assumption would be that polyfit would not alter the chunk structure at all, particularly if you are applying polyfit over a contiguous (non-chunked) dimension. But that appears not to be the case.

I tried to make a very simple reproducible example, and I actually got the opposite behavior: chunks get consolidated rather than decimated! (This is still a problem.)

import dask.array as dsa
import xarray as xr

nz, ny, nx = (10, 20, 30)
data = dsa.ones((nz, ny, nx), chunks=(1, 5, nx))
da = xr.DataArray(data, dims=['z', 'y', 'x'])
da.chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,))

pf = da.polyfit('x', 1)
pf.polyfit_coefficients.chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (20,), (30,))
# chunks on the y dimension have been consolidated!

pv = xr.polyval(da.x, pf.polyfit_coefficients).transpose('z', 'y', 'x')
pv.chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (20,), (30,))
# and this propagates to polyval

# align back against the original data
(da - pv).chunks
# -> ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (5, 5, 5, 5), (30,))
# hides the fact that we have chunk consolidation happening upstream

Repeating this with your arrays sizes gives similar results to you

nz, ny, nx = (6, 10, 4)
data = dsa.ones((nz, ny, nx), chunks=(2, 10, 2))
da = xr.DataArray(data, dims=['z', 'y', 'x'])
da.chunks
# -> ((2, 2, 2), (10,), (2, 2))

pf = da.polyfit('y', 1)
pf.polyfit_coefficients.chunks
# -> ((2,), (1, 1, 1, 1, 1, 1), (4,))

pv = xr.polyval(da.y, pf.polyfit_coefficients).transpose('z', 'y', 'x')
pv.chunks
# -> ((1, 1, 1, 1, 1, 1), (10,), (4,))

(da - pv).chunks
# -> ((1, 1, 1, 1, 1, 1), (10,), (2, 2))

I think this is an xarray problem; we should probably open an issue in xarray: https://github.com/pydata/xarray/issues.

However, I'm not optimistic it can be fixed soon, since xarray itself relies on dask's implementation of polyfit.

So yes, in the meantime, we will need a workaround within xrft. Unfortunately this means revisiting #112. 😭 I think we can simplify things by just acknowledging that we will never detrend over a chunked dimension and just using map_blocks to parallelize the operation in a clean, transparent way.

I'm sorry that you're still bogged down in this technical detail. On the other hand, you are getting a valuable crash course in chunking and the pitfalls around it.

paigem · 2020-10-30T02:02:57Z

So yes, in the meantime, we will need a workaround within xrft. Unfortunately this means revisiting #112. 😭 I think we can simplify things by just acknowledging that we will never detrend over a chunked dimension and just using map_blocks to parallelize the operation in a clean, transparent way.

This plan sounds just fine to me, though I might need some help implementing map_blocks. The other option at the moment (i.e. without needing to update the xrft code) is that I can just run xrft.dft() on smaller spatial regions and stitch together a global plot. Not ideal, but likely doable. It's also worth noting that I don't think this problem would arise in 2d cases.

I think this is an xarray problem; we should probably open an issue in xarray: https://github.com/pydata/xarray/issues.

Happy to open an issue about this in xarray, once I get back from lunch!

I'm sorry that you're still bogged down in this technical detail. On the other hand, you are getting a valuable crash course in chunking and the pitfalls around it.

At least for now, I don't mind the technical details! I'm learning a lot along the way! 😄

navidcy · 2020-10-30T02:38:48Z

I think we can simplify things by just acknowledging that we will never detrend over a chunked dimension and just using map_blocks to parallelize the operation in a clean, transparent way.

Just a note on this: But we never want to detrend over a chunked dimension anyway. I mean, xarray.polyfit() can do it but, on the other hand xrft.dft() requires the dimension that the transform is taken to be un-chunked!
Therefore, a solution would be simply to make sure that after xarray.polyfit has done its shenanigans we make sure the output data array has the same rechecking as the original one?

I.e., I'm suggesting simply replacing

xrft/xrft/xrft.py

Lines 143 to 144 in 3662a85

    
           linear_fit = xr.polyval(da[dim[0]], p.polyfit_coefficients) 
        
           return da - linear_fit

with

linear_fit = xr.polyval(da[dim[0]], p.polyfit_coefficients) 
linear_fit = linear_fit.chunk(da.chunks) # see discussion in https://github.com/xgcm/xrft/issues/116
return da - linear_fit

or

linear_fit = xr.polyval(da[dim[0]], p.polyfit_coefficients)
return (da - linear_fit).chunk(da.chunks)  # see discussion in https://github.com/xgcm/xrft/issues/116

@rabernat, @paigem ?

paigem · 2020-10-30T03:26:34Z

@navidcy Yes! I agree! Your suggested fix is exactly what I am currently trying to test (to see if it runs without error on the global dataset), but my dask computation seems to have stalled. Not sure yet if that's because of the polyfit rechunking or an error on my end (likely the latter). But yes, thank you for clarifying this point!

navidcy · 2020-10-30T03:51:56Z

I opened a PR. It's marked as WIP since the discussion here might conclude that the solution I suggested is not ideal. Feel free to close the PR in that case ;)

paigem · 2020-10-30T05:34:00Z

With @navidcy 's suggested fix, I am still unable to run xrft linear detrending using polyfit() on the global CESM dataset. By running _apply_detrend() directly, I have isolated the problem to being within polyfit(), i.e. even rechunking after applying polyfit() as in @navidcy's PR doesn't solve the problem with a large dataset since it is the calculation of polyfit() that is causing the issue.

This is likely only a problem for very large datasets, since it works on a subset of the global CESM data.

navidcy · 2020-11-16T19:57:25Z

Is this closed via #118?

rabernat · 2020-11-16T20:10:16Z

yes

navidcy mentioned this issue Oct 30, 2020

[WIP] Make sure xarray.polyfit does not rechunk datarray #117

Closed

paigem mentioned this issue Oct 30, 2020

Unexpected chunking of 3d DataArray in polyfit() pydata/xarray#4554

Closed

rabernat closed this as completed Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undesired chunking of 3d input when using linear detrend with `poyfit()` #116

Undesired chunking of 3d input when using linear detrend with `poyfit()` #116

paigem commented Oct 30, 2020 •

edited

Loading

rabernat commented Oct 30, 2020

paigem commented Oct 30, 2020

navidcy commented Oct 30, 2020 •

edited

Loading

paigem commented Oct 30, 2020

navidcy commented Oct 30, 2020

paigem commented Oct 30, 2020

navidcy commented Nov 16, 2020

rabernat commented Nov 16, 2020

Undesired chunking of 3d input when using linear detrend with poyfit() #116

Undesired chunking of 3d input when using linear detrend with poyfit() #116

Comments

paigem commented Oct 30, 2020 • edited Loading

rabernat commented Oct 30, 2020

paigem commented Oct 30, 2020

navidcy commented Oct 30, 2020 • edited Loading

paigem commented Oct 30, 2020

navidcy commented Oct 30, 2020

paigem commented Oct 30, 2020

navidcy commented Nov 16, 2020

rabernat commented Nov 16, 2020

Undesired chunking of 3d input when using linear detrend with `poyfit()` #116

Undesired chunking of 3d input when using linear detrend with `poyfit()` #116

paigem commented Oct 30, 2020 •

edited

Loading

navidcy commented Oct 30, 2020 •

edited

Loading