Skip to content

Commit

Permalink
some dask stuff.
Browse files Browse the repository at this point in the history
  • Loading branch information
dcherian committed May 23, 2019
1 parent 6a0d515 commit 3bce6a8
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 4 deletions.
18 changes: 15 additions & 3 deletions doc/dask.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ Parallel computing with Dask

xarray integrates with `Dask <http://dask.pydata.org/>`__ to support parallel
computations and streaming computation on datasets that don't fit into memory.

Currently, Dask is an entirely optional feature for xarray. However, the
benefits of using Dask are sufficiently strong that Dask may become a required
dependency in a future version of xarray.

For a full example of how to use xarray's Dask integration, read the
`blog post introducing xarray and Dask`_.
`blog post introducing xarray and Dask`_. More up-to-date examples
may be found at the `Pangeo project's use-cases <http://pangeo.io/use_cases/index.html>`_.

.. _blog post introducing xarray and Dask: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/

Expand Down Expand Up @@ -396,4 +396,16 @@ With analysis pipelines involving both spatial subsetting and temporal resamplin

2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)

3. Specify smaller chunks across space when using ``open_mfdataset()`` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).

4. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset`
can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.

5. Some dask-specific tips may be found `here <https://docs.dask.org/en/latest/array-best-practices.html>`_.

6. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be
useful in identifying performance bottlenecks.

7. Installing the optional `bottleneck <https://github.com/kwgoodman/bottleneck>`_ library
will result in greatly reduced memory usage when using :py:meth:`~xarray.Dataset.rolling`
on dask arrays,
2 changes: 1 addition & 1 deletion doc/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ string, e.g., to access subgroup 'bar' within group 'foo' pass
pass ``mode='a'`` to ``to_netcdf`` to ensure that each call does not delete the
file.

Data is always loaded lazily from netCDF files. You can manipulate, slice and subset
Data is *always* loaded lazily from netCDF files. You can manipulate, slice and subset
Dataset and DataArray objects, and no array values are loaded into memory until
you try to perform some sort of actual computation. For an example of how these
lazy arrays work, see the OPeNDAP section below.
Expand Down

0 comments on commit 3bce6a8

Please sign in to comment.