Skip to content

Commit

Permalink
Merge pull request #2597 from bjlittle/dask-merge-back
Browse files Browse the repository at this point in the history
Dask merge back
  • Loading branch information
corinnebosley authored Jun 14, 2017
2 parents 2b9c2aa + e2eeea3 commit 6bd26b7
Show file tree
Hide file tree
Showing 776 changed files with 12,220 additions and 5,695 deletions.
4 changes: 2 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ git:
depth: 10000

install:
- export IRIS_TEST_DATA_REF="7c0e32c8812b464e467a9555bdc25dc1e0c5be0c"
- export IRIS_TEST_DATA_REF="2f3a6bcf25f81bd152b3d66223394074c9069a96"
- export IRIS_TEST_DATA_SUFFIX=$(echo "${IRIS_TEST_DATA_REF}" | sed "s/^v//")

# Install miniconda
Expand Down Expand Up @@ -54,7 +54,7 @@ install:
conda install --quiet --file minimal-conda-requirements.txt;
else
if [[ "$TRAVIS_PYTHON_VERSION" == 3* ]]; then
sed -e '/ecmwf_grib/d' -e '/esmpy/d' -e '/iris-grib/d' -e 's/#.\+$//' conda-requirements.txt | xargs conda install --quiet;
sed -e '/ecmwf_grib/d' -e '/esmpy/d' -e 's/#.\+$//' conda-requirements.txt | xargs conda install --quiet;
else
conda install --quiet --file conda-requirements.txt;
fi
Expand Down
7 changes: 0 additions & 7 deletions INSTALL
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,6 @@ numpy 1.9 or later (http://numpy.scipy.org/)
Python package for scientific computing including a powerful N-dimensional
array object.

biggus 0.14 or later (https://github.com/SciTools/biggus)
Virtual large arrays and lazy evaluation.

scipy 0.10 or later (http://www.scipy.org/)
Python package for scientific computing.

Expand Down Expand Up @@ -128,10 +125,6 @@ grib-api 1.9.16 or later
edition 2 messages. A compression library such as Jasper is required
to read JPEG2000 compressed GRIB2 files.

iris-grib 0.9 or later
(https://github.com/scitools/iris-grib)
Iris interface to ECMWF's GRIB API

matplotlib 1.2.0 (http://matplotlib.sourceforge.net/)
Python package for 2D plotting.

Expand Down
10 changes: 5 additions & 5 deletions conda-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
# conda create -n <name> --file conda-requirements.txt

# Mandatory dependencies
biggus
cartopy
matplotlib<1.9
netcdf4
numpy
pyke
udunits2
cf_units
dask

# Iris build dependencies
setuptools
Expand All @@ -25,12 +25,12 @@ imagehash
requests

# Optional iris dependencies
nc_time_axis
iris-grib
ecmwf_grib
esmpy>=7.0
gdal
libmo_unpack
pandas
pyugrid
mo_pack
nc_time_axis
pandas
python-stratify
pyugrid
2 changes: 0 additions & 2 deletions docs/iris/src/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,6 @@
'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
'matplotlib': ('http://matplotlib.org/', None),
'cartopy': ('http://scitools.org.uk/cartopy/docs/latest/', None),
'biggus': ('https://biggus.readthedocs.io/en/latest/', None),
'iris-grib': ('http://iris-grib.readthedocs.io/en/latest/', None),
}


Expand Down
35 changes: 35 additions & 0 deletions docs/iris/src/developers_guide/dask_interface.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
Iris Dask Interface
*******************

Iris uses `dask <http://dask.pydata.org>`_ to manage lazy data interfaces and processing graphs.
The key principles that define this interface are:

* A call to :attr:`cube.data` will always load all of the data.

* Once this has happened:

* :attr:`cube.data` is a mutable NumPy masked array or ``ndarray``, and
* ``cube._numpy_array`` is a private NumPy masked array, accessible via :attr:`cube.data`, which may strip off the mask and return a reference to the bare ``ndarray``.

* You can use :attr:`cube.data` to set the data. This accepts:

* a NumPy array (including masked array), which is assigned to ``cube._numpy_array``, or
* a dask array, which is assigned to ``cube._dask_array``, while ``cube._numpy_array`` is set to None.

* ``cube._dask_array`` may be None, otherwise it is expected to be a dask array:

* this may wrap a proxy to a file collection, or
* this may wrap the NumPy array in ``cube._numpy_array``.

* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values:

* Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values.
* Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``.

* In order to support this mask conversion, cubes have a ``fill_value`` defined as part of their metadata, which may be ``None``.

* Array copying is kept to an absolute minimum:

* array references should always be passed, not new arrays created, unless an explicit copy operation is requested.

* To test for the presence of a dask array of any sort, we use :func:`iris._lazy_data.is_lazy_data`. This is implemented as ``hasattr(data, 'compute')``.
1 change: 1 addition & 0 deletions docs/iris/src/developers_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,4 @@
tests.rst
deprecations.rst
release.rst
dask_interface.rst
7 changes: 4 additions & 3 deletions docs/iris/src/userguide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ Iris user guide

How to use the user guide
---------------------------
If you are reading this user guide for the first time it is strongly recommended that you read the user guide
fully before experimenting with your own data files.
If you are reading this user guide for the first time it is strongly recommended that you read the user guide
fully before experimenting with your own data files.


Much of the content has supplementary links to the reference documentation; you will not need to follow these
Much of the content has supplementary links to the reference documentation; you will not need to follow these
links in order to understand the guide but they may serve as a useful reference for future exploration.

.. htmlonly::
Expand All @@ -30,6 +30,7 @@ User guide table of contents
saving_iris_cubes.rst
navigating_a_cube.rst
subsetting_a_cube.rst
real_and_lazy_data.rst
plotting_a_cube.rst
interpolation_and_regridding.rst
merge_and_concat.rst
Expand Down
4 changes: 2 additions & 2 deletions docs/iris/src/userguide/interpolation_and_regridding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,8 +176,8 @@ For example, to mask values that lie beyond the range of the original data:
>>> scheme = iris.analysis.Linear(extrapolation_mode='mask')
>>> new_column = column.interpolate(sample_points, scheme)
>>> print(new_column.coord('altitude').points)
[ nan 494.44451904 588.88891602 683.33325195 777.77783203
872.222229 966.66674805 1061.11108398 1155.55541992 nan]
[-- 494.44451904296875 588.888916015625 683.333251953125 777.77783203125
872.2222290039062 966.666748046875 1061.111083984375 1155.555419921875 --]


.. _caching_an_interpolator:
Expand Down
230 changes: 230 additions & 0 deletions docs/iris/src/userguide/real_and_lazy_data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
.. _real_and_lazy_data:


.. testsetup:: *

import dask.array as da
import iris
import numpy as np


==================
Real and Lazy Data
==================

We have seen in the :doc:`user_guide_introduction` section of the user guide that
Iris cubes contain data and metadata about a phenomenon. The data element of a cube
is always an array, but the array may be either "real" or "lazy".

In this section of the user guide we will look specifically at the concepts of
real and lazy data as they apply to the cube and other data structures in Iris.


What is real and lazy data?
---------------------------

In Iris, we use the term **real data** to describe data arrays that are loaded
into memory. Real data is typically provided as a
`NumPy array <https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html>`_,
which has a shape and data type that are used to describe the array's data points.
Each data point takes up a small amount of memory, which means large NumPy arrays can
take up a large amount of memory.

Conversely, we use the term **lazy data** to describe data that is not loaded into memory.
(This is sometimes also referred to as **deferred data**.)
In Iris, lazy data is provided as a
`dask array <http://dask.pydata.org/en/latest/array-overview.html>`_.
A dask array also has a shape and data type
but typically the dask array's data points are not loaded into memory.
Instead the data points are stored on disk and only loaded into memory in
small chunks when absolutely necessary (see the section :ref:`when_real_data`
for examples of when this might happen).

The primary advantage of using lazy data is that it enables
`out-of-core processing <https://en.wikipedia.org/wiki/Out-of-core_algorithm>`_;
that is, the loading and manipulating of datasets that otherwise would not fit into memory.

You can check whether a cube has real data or lazy data by using the method
:meth:`~iris.cube.Cube.has_lazy_data`. For example::

>>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
>>> cube.has_lazy_data()
True
# Realise the lazy data.
>>> cube.data
>>> cube.has_lazy_data()
False


.. _when_real_data:

When does my data become real?
------------------------------

When you load a dataset using Iris the data array will almost always initially be
a lazy array. This section details some operations that will realise lazy data
as well as some operations that will maintain lazy data. We use the term **realise**
to mean converting lazy data into real data.

Most operations on data arrays can be run equivalently on both real and lazy data.
If the data array is real then the operation will be run on the data array
immediately. The results of the operation will be available as soon as processing is completed.
If the data array is lazy then the operation will be deferred and the data array will
remain lazy until you request the result (such as when you call ``cube.data``)::

>>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
>>> cube.has_lazy_data()
True
>>> cube += 5
>>> cube.has_lazy_data()
True

The process by which the operation is deferred until the result is requested is
referred to as **lazy evaluation**.

Certain operations, including regridding and plotting, can only be run on real data.
Calling such operations on lazy data will automatically realise your lazy data.

You can also realise (and so load into memory) your cube's lazy data if you 'touch' the data.
To 'touch' the data means directly accessing the data by calling ``cube.data``,
as in the previous example.

Core data
^^^^^^^^^

Cubes have the concept of "core data". This returns the cube's data in its
current state:

* If a cube has lazy data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
will return the cube's lazy dask array. Calling the cube's
:meth:`~iris.cube.Cube.core_data` method **will never realise** the cube's data.
* If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
will return the cube's real NumPy array.

For example::

>>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
>>> cube.has_lazy_data()
True

>>> the_data = cube.core_data()
>>> type(the_data)
<class 'dask.array.core.Array'>
>>> cube.has_lazy_data()
True

# Realise the lazy data.
>>> cube.data
>>> the_data = cube.core_data()
>>> type(the_data)
<type 'numpy.ndarray'>
>>> cube.has_lazy_data()
False


Coordinates
-----------

In the same way that Iris cubes contain a data array, Iris coordinates contain a
points array and an optional bounds array.
Coordinate points and bounds arrays can also be real or lazy:

* A :class:`~iris.coords.DimCoord` will only ever have **real** points and bounds
arrays because of monotonicity checks that realise lazy arrays.
* An :class:`~iris.coords.AuxCoord` can have **real or lazy** points and bounds.
* An :class:`~iris.aux_factory.AuxCoordFactory` (or derived coordinate)
can have **real or lazy** points and bounds. If all of the
:class:`~iris.coords.AuxCoord` instances used to construct the derived coordinate
have real points and bounds then the derived coordinate will have real points
and bounds, otherwise the derived coordinate will have lazy points and bounds.

Iris cubes and coordinates have very similar interfaces, which extends to accessing
coordinates' lazy points and bounds:

.. doctest::

>>> cube = iris.load_cube(iris.sample_data_path('hybrid_height.nc'))

>>> dim_coord = cube.coord('model_level_number')
>>> print(dim_coord.has_lazy_points())
False
>>> print(dim_coord.has_bounds())
False
>>> print(dim_coord.has_lazy_bounds())
False

>>> aux_coord = cube.coord('sigma')
>>> print(aux_coord.has_lazy_points())
True
>>> print(aux_coord.has_bounds())
True
>>> print(aux_coord.has_lazy_bounds())
True

# Realise the lazy points. This will **not** realise the lazy bounds.
>>> points = aux_coord.points
>>> print(aux_coord.has_lazy_points())
False
>>> print(aux_coord.has_lazy_bounds())
True

>>> derived_coord = cube.coord('altitude')
>>> print(derived_coord.has_lazy_points())
True
>>> print(derived_coord.has_bounds())
True
>>> print(derived_coord.has_lazy_bounds())
True

.. note::
Printing a lazy :class:`~iris.coords.AuxCoord` will realise its points and bounds arrays!


Dask processing options
-----------------------

As stated earlier in this user guide section, Iris uses dask to provide
lazy data arrays for both Iris cubes and coordinates. Iris also uses dask
functionality for processing deferred operations on lazy arrays.

Dask provides processing options to control how deferred operations on lazy arrays
are computed. This is provided via the ``dask.set_options`` interface.
We can make use of this functionality in Iris. This means we can
control how dask arrays in Iris are processed, for example giving us power to
run Iris processing in parallel.

Iris by default applies a single dask processing option. This specifies that
all dask processing in Iris should be run in serial (that is, without any
parallel processing enabled).

The dask processing option applied by Iris can be overridden by manually setting
dask processing options for either or both of:

* the number of parallel workers to use,
* the scheduler to use.

This must be done **before** importing Iris. For example, to specify that dask
processing within Iris should use four workers in a thread pool::

>>> from multiprocessing.pool import ThreadPool
>>> import dask
>>> dask.set_options(get=dask.threaded.get, pool=ThreadPool(4))

>>> import iris
>>> # Iris processing here...

.. note::
These dask processing options will last for the lifetime of the Python session
and must be re-applied in other or subsequent sessions.

Other dask processing options are also available. See the
`dask documentation <http://dask.pydata.org/en/latest/scheduler-overview.html>`_
for more information on setting dask processing options.


Further reading
---------------

This section of the Iris user guide provides a quick overview of real and lazy
data within Iris. For more details on these and related concepts,
see the whitepaper on lazy data.
Loading

0 comments on commit 6bd26b7

Please sign in to comment.