-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some warnings about rechunking to the docs #6569
Conversation
Thanks @fmaussion ! |
doc/user-guide/dask.rst
Outdated
|
||
1. Do your spatial and temporal indexing (e.g. ``.sel()`` or ``.isel()``) early in the pipeline, especially before calling ``resample()`` or ``groupby()``. Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn't been implemented in Dask yet. (See `Dask issue #746 <https://github.com/dask/dask/issues/746>`_). | ||
|
||
2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_) | ||
|
||
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1). | ||
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"chunks of data referring to different chunks" is kinda confusing, how about "subsets of data which span multiple chunks"? Or is that not the intended meaning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm also not sure what was meant here (this is like this in the docs).
"subsets of data which span multiple chunks" sounds much better.
I have a related question though. When one opens a single netcdf file, is it better to:
- subset first (data still not loaded), then chunk (convert the lazily loaded arrays from xarray to dask arrays)?
- chunk first (at call of with ds.chunk), then subset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subset first to reduce memory; otherwise the chunk will get loaded into memory, and then values will be discarded.
I've edited a few more sentences, to me this is ready to merge! I've been struggling with groupby a good deal of my week and I added a warning regarding groupby as well. Feel free to disagree, but I've been unable to get it to work on a large datasets across multiple files yet (see pangeo discourse post). |
Thanks as ever @fmaussion ! |
* main: (24 commits) Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598) Enable flox in GroupBy and resample (pydata#5734) Add setuptools as dependency in ASV benchmark CI (pydata#6609) change polyval dim ordering (pydata#6601) re-add timedelta support for polyval (pydata#6599) Minor Dataset.map docstr clarification (pydata#6595) New inline_array kwarg for open_dataset (pydata#6566) Fix polyval overloads (pydata#6593) Restore old MultiIndex dropping behaviour (pydata#6592) [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558) Fix zarr append dtype checks (pydata#6476) Add missing space in exception message (pydata#6590) Doc Link to accessors list in extending-xarray.rst (pydata#6587) Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579) Add some warnings about rechunking to the docs (pydata#6569) [pre-commit.ci] pre-commit autoupdate (pydata#6584) terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583) Allow string formatting of scalar DataArrays (pydata#5981) Fix mypy issues & reenable in tests (pydata#6581) polyval: Use Horner's algorithm + support chunked inputs (pydata#6548) ...
commit 398f1b6 Author: dcherian <deepak@cherian.net> Date: Fri May 20 08:47:56 2022 -0600 Backward compatibility dask commit bde40e4 Merge: 0783df3 4cae8d0 Author: dcherian <deepak@cherian.net> Date: Fri May 20 07:54:48 2022 -0600 Merge branch 'main' into dask-datetime-to-numeric * main: concatenate docs style (pydata#6621) Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612) {full,zeros,ones}_like typing (pydata#6611) commit 0783df3 Merge: 5cff4f1 8de7061 Author: dcherian <deepak@cherian.net> Date: Sun May 15 21:03:50 2022 -0600 Merge branch 'main' into dask-datetime-to-numeric * main: (24 commits) Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598) Enable flox in GroupBy and resample (pydata#5734) Add setuptools as dependency in ASV benchmark CI (pydata#6609) change polyval dim ordering (pydata#6601) re-add timedelta support for polyval (pydata#6599) Minor Dataset.map docstr clarification (pydata#6595) New inline_array kwarg for open_dataset (pydata#6566) Fix polyval overloads (pydata#6593) Restore old MultiIndex dropping behaviour (pydata#6592) [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558) Fix zarr append dtype checks (pydata#6476) Add missing space in exception message (pydata#6590) Doc Link to accessors list in extending-xarray.rst (pydata#6587) Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579) Add some warnings about rechunking to the docs (pydata#6569) [pre-commit.ci] pre-commit autoupdate (pydata#6584) terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583) Allow string formatting of scalar DataArrays (pydata#5981) Fix mypy issues & reenable in tests (pydata#6581) polyval: Use Horner's algorithm + support chunked inputs (pydata#6548) ... commit 5cff4f1 Merge: dfe200d 6144c61 Author: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Date: Sun May 1 15:16:33 2022 -0700 Merge branch 'main' into dask-datetime-to-numeric commit dfe200d Author: dcherian <deepak@cherian.net> Date: Sun May 1 11:04:03 2022 -0600 Minor cleanup commit 35ed378 Author: dcherian <deepak@cherian.net> Date: Sun May 1 10:57:36 2022 -0600 Support dask arrays in datetime_to_numeric
This adds some warnings at the right places when rechunking a dataset opened with
open_mfdataset
(see pangeo-data/rechunker#100 (comment) for context)Thanks to @dcherian for the wisdom of the day!