Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some warnings about rechunking to the docs #6569

Merged
merged 6 commits into from
May 10, 2022

Conversation

fmaussion
Copy link
Member

This adds some warnings at the right places when rechunking a dataset opened with open_mfdataset (see pangeo-data/rechunker#100 (comment) for context)

Thanks to @dcherian for the wisdom of the day!

@max-sixty
Copy link
Collaborator

Thanks @fmaussion !

@max-sixty max-sixty added the plan to merge Final call for comments label May 3, 2022

1. Do your spatial and temporal indexing (e.g. ``.sel()`` or ``.isel()``) early in the pipeline, especially before calling ``resample()`` or ``groupby()``. Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn't been implemented in Dask yet. (See `Dask issue #746 <https://github.com/dask/dask/issues/746>`_).

2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)

3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"chunks of data referring to different chunks" is kinda confusing, how about "subsets of data which span multiple chunks"? Or is that not the intended meaning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm also not sure what was meant here (this is like this in the docs).

"subsets of data which span multiple chunks" sounds much better.

I have a related question though. When one opens a single netcdf file, is it better to:

  1. subset first (data still not loaded), then chunk (convert the lazily loaded arrays from xarray to dask arrays)?
  2. chunk first (at call of with ds.chunk), then subset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subset first to reduce memory; otherwise the chunk will get loaded into memory, and then values will be discarded.

@fmaussion
Copy link
Member Author

I've edited a few more sentences, to me this is ready to merge!

I've been struggling with groupby a good deal of my week and I added a warning regarding groupby as well. Feel free to disagree, but I've been unable to get it to work on a large datasets across multiple files yet (see pangeo discourse post).

doc/user-guide/dask.rst Outdated Show resolved Hide resolved
doc/user-guide/dask.rst Outdated Show resolved Hide resolved
@max-sixty max-sixty merged commit 218e77a into pydata:main May 10, 2022
@max-sixty
Copy link
Collaborator

Thanks as ever @fmaussion !

dcherian added a commit to dcherian/xarray that referenced this pull request May 20, 2022
* main: (24 commits)
  Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598)
  Enable flox in GroupBy and resample (pydata#5734)
  Add setuptools as dependency in ASV benchmark CI (pydata#6609)
  change polyval dim ordering (pydata#6601)
  re-add timedelta support for polyval (pydata#6599)
  Minor Dataset.map docstr clarification (pydata#6595)
  New inline_array kwarg for open_dataset (pydata#6566)
  Fix polyval overloads (pydata#6593)
  Restore old MultiIndex dropping behaviour (pydata#6592)
  [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558)
  Fix zarr append dtype checks (pydata#6476)
  Add missing space in exception message (pydata#6590)
  Doc Link to accessors list in extending-xarray.rst (pydata#6587)
  Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579)
  Add some warnings about rechunking to the docs (pydata#6569)
  [pre-commit.ci] pre-commit autoupdate (pydata#6584)
  terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583)
  Allow string formatting of scalar DataArrays (pydata#5981)
  Fix mypy issues & reenable in tests (pydata#6581)
  polyval: Use Horner's algorithm + support chunked inputs (pydata#6548)
  ...
dcherian added a commit to headtr1ck/xarray that referenced this pull request May 20, 2022
commit 398f1b6
Author: dcherian <deepak@cherian.net>
Date:   Fri May 20 08:47:56 2022 -0600

    Backward compatibility dask

commit bde40e4
Merge: 0783df3 4cae8d0
Author: dcherian <deepak@cherian.net>
Date:   Fri May 20 07:54:48 2022 -0600

    Merge branch 'main' into dask-datetime-to-numeric

    * main:
      concatenate docs style (pydata#6621)
      Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612)
      {full,zeros,ones}_like typing (pydata#6611)

commit 0783df3
Merge: 5cff4f1 8de7061
Author: dcherian <deepak@cherian.net>
Date:   Sun May 15 21:03:50 2022 -0600

    Merge branch 'main' into dask-datetime-to-numeric

    * main: (24 commits)
      Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598)
      Enable flox in GroupBy and resample (pydata#5734)
      Add setuptools as dependency in ASV benchmark CI (pydata#6609)
      change polyval dim ordering (pydata#6601)
      re-add timedelta support for polyval (pydata#6599)
      Minor Dataset.map docstr clarification (pydata#6595)
      New inline_array kwarg for open_dataset (pydata#6566)
      Fix polyval overloads (pydata#6593)
      Restore old MultiIndex dropping behaviour (pydata#6592)
      [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558)
      Fix zarr append dtype checks (pydata#6476)
      Add missing space in exception message (pydata#6590)
      Doc Link to accessors list in extending-xarray.rst (pydata#6587)
      Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579)
      Add some warnings about rechunking to the docs (pydata#6569)
      [pre-commit.ci] pre-commit autoupdate (pydata#6584)
      terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583)
      Allow string formatting of scalar DataArrays (pydata#5981)
      Fix mypy issues & reenable in tests (pydata#6581)
      polyval: Use Horner's algorithm + support chunked inputs (pydata#6548)
      ...

commit 5cff4f1
Merge: dfe200d 6144c61
Author: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
Date:   Sun May 1 15:16:33 2022 -0700

    Merge branch 'main' into dask-datetime-to-numeric

commit dfe200d
Author: dcherian <deepak@cherian.net>
Date:   Sun May 1 11:04:03 2022 -0600

    Minor cleanup

commit 35ed378
Author: dcherian <deepak@cherian.net>
Date:   Sun May 1 10:57:36 2022 -0600

    Support dask arrays in datetime_to_numeric
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants