some dask stuff.

pydata · May 23, 2019 · 3bce6a8 · 3bce6a8
1 parent 6a0d515
commit 3bce6a8
Show file tree

Hide file tree

Showing 2 changed files with 16 additions and 4 deletions.
diff --git a/doc/dask.rst b/doc/dask.rst
@@ -5,13 +5,13 @@ Parallel computing with Dask
 
 xarray integrates with `Dask <http://dask.pydata.org/>`__ to support parallel
 computations and streaming computation on datasets that don't fit into memory.
-
 Currently, Dask is an entirely optional feature for xarray. However, the
 benefits of using Dask are sufficiently strong that Dask may become a required
 dependency in a future version of xarray.
 
 For a full example of how to use xarray's Dask integration, read the
-`blog post introducing xarray and Dask`_.
+`blog post introducing xarray and Dask`_. More up-to-date examples
+may be found at the `Pangeo project's use-cases <http://pangeo.io/use_cases/index.html>`_.
 
 .. _blog post introducing xarray and Dask: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
 
@@ -396,4 +396,16 @@ With analysis pipelines involving both spatial subsetting and temporal resamplin
 
 2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)
 
-3. Specify smaller chunks across space when using ``open_mfdataset()`` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
+3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
+
+4. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset`
+   can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.
+
+5. Some dask-specific tips may be found `here <https://docs.dask.org/en/latest/array-best-practices.html>`_.
+
+6. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be
+   useful in identifying performance bottlenecks.
+
+7. Installing the optional `bottleneck <https://github.com/kwgoodman/bottleneck>`_ library
+   will result in greatly reduced memory usage when using :py:meth:`~xarray.Dataset.rolling`
+   on dask arrays,
diff --git a/doc/io.rst b/doc/io.rst
@@ -87,7 +87,7 @@ string, e.g., to access subgroup 'bar' within group 'foo' pass
 pass ``mode='a'`` to ``to_netcdf`` to ensure that each call does not delete the
 file.
 
-Data is always loaded lazily from netCDF files. You can manipulate, slice and subset
+Data is *always* loaded lazily from netCDF files. You can manipulate, slice and subset
 Dataset and DataArray objects, and no array values are loaded into memory until
 you try to perform some sort of actual computation. For an example of how these
 lazy arrays work, see the OPeNDAP section below.